Movies Data Analysis

This jupyter notebook example shows use of Data Science techniques to analyse movie data, using MovieLens data (filename: ml-20m.zip) from the location https://grouplens.org/datasets/movielens/

I am using three CSV files from the downloaded data:

movies.csv : movieId, title, genres
tags.csv : userId,movieId, tag, timestamp
ratings.csv : userId,movieId,rating, timestamp

import pandas as pd

movies = pd.read_csv('./MLData/02_Movielens/ml-20m/movies.csv', sep=',')
print(type(movies))

<class 'pandas.core.frame.DataFrame'>

Data Structures: DataFrames

movies.head()

tags = pd.read_csv('./MLData/02_Movielens/ml-20m/tags.csv', sep=',')
tags.head()

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

ratings = pd.read_csv('./MLData/02_Movielens/ml-20m/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()

For current analysis, we will remove timestamp (we will come back to it!).

del tags['timestamp']
del ratings['timestamp']

tags.head(3)

ratings.head(3)

movies.shape

(27278, 3)

tags.shape

(465564, 3)

ratings.shape

(20000263, 3)

tags.index

RangeIndex(start=0, stop=465564, step=1)

tags.columns

Index(['userId', 'movieId', 'tag'], dtype='object')

Extract row 0, 11, 2000 from DataFrame using index location iloc

tags.iloc[ [0,11,2000]]

movies.iloc[ [10, 50, 1500]]

Data Structures: Series

row_0 = tags.iloc[0]
type(row_0)

pandas.core.series.Series

Extract 0th row: notice that it is infact a Series

print(row_0)

userId              18
movieId           4141
tag        Mark Waters
Name: 0, dtype: object

row_0.index

Index(['userId', 'movieId', 'tag'], dtype='object')

row_0['userId']

18

'rating' in row_0

False

row_0.name

0

row_0 = row_0.rename('first_row')
row_0.name

'first_row'

print(row_0)

userId              18
movieId           4141
tag        Mark Waters
Name: first_row, dtype: object

Descriptive Statistics

Using some of the useful functions in Pandas to generate descriptive data statistics.
Let's look at how the ratings are distributed.

ratings.head(3)

ratings.describe()

ratings['rating'].describe()

count    2.000026e+07
mean     3.525529e+00
std      1.051989e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

ratings.mean()

userId     69045.872583
movieId     9041.567330
rating         3.525529
dtype: float64

ratings['rating'].mean()

3.5255285642993797

ratings['rating'].min()

0.5

ratings['rating'].max()

5.0

ratings['rating'].std()

1.0519889192942424

ratings['rating'].mode()

0    4.0
dtype: float64

ratings.corr()

filter_1 = ratings['rating'] > 5
filter_1.any()

False

filter_2 = ratings['rating'] < 1
filter_2.any()

True

type(filter_2)

pandas.core.series.Series

filter_2.head()

0    False
1    False
2    False
3    False
4    False
Name: rating, dtype: bool

filter_2 = ratings['rating'] > 0
filter_2.all()

True

Data Cleaning: Handling Missing Data

movies.shape

(27278, 3)

Is any row has NULL value?

movies.isnull().any()

movieId    False
title      False
genres     False
dtype: bool

tags.isnull().any()

userId     False
movieId    False
tag         True
dtype: bool

ratings.isnull().any()

userId     False
movieId    False
rating     False
dtype: bool

So the tags table has some missing values in the tag column.

tags.isnull().sum()

userId      0
movieId     0
tag        16
dtype: int64

tags.shape

(465564, 3)

So in tags table out of 465564 rows, 16 rows have missing values, which is a small number in comparision to 465564 rows. So we can drop those 16 rows.

tags = tags.dropna()

tags.isnull().any()

userId     False
movieId    False
tag        False
dtype: bool

tags.shape

(465548, 3)

So 465564 - 465548 = 16 rows are now removed with no null rows left.

Data Visualisation

Pandas library has very useful data visualisation plot functions as the following:

DataFrame.plot() : Line graphs of each column with different line.
DataFrame.plot.area() : Gives area plot
DataFrame.plot.bar() : Gives vertical bars for each columns
DataFrame.plot.barh() : Gives horizontal bars for each columns
DataFrame.plot.box() : Gives box plot of data distribution of min, max, and medium values for columns.
DataFrame.plot.density() : Gives Kernel Density Estimate plot
DataFrame.plot.hexbin() : Gives Hexbin plot
DataFrame.plot.hist() : Gives a histogram of distribution of data, and it can show skewness of data.
DataFrame.plot.kde() : Same as Kernel Density Estimate plot
DataFrame.plot.line() : Gives simple line plot
DataFrame.plot.pie() : Gives Pie chart
DataFrame.plot.scatter() : Gives Scatter plot
DataFrame.boxplot() : Gives a box plot from DataFrame columns
DataFrame.hist() : Gives a histogram from the DataFrame columns

Matplotlib is a plotting library for Python and Pandas leverages matplotlib underneath for its plots. For jupyter to plot the graphs inside the notebooks we have to tell jupyter to plot inline. The percentage sign before the matplotlib is a symbol for a special class of functions in jupyter called magic functions.

%matplotlib inline
ratings.hist(column='rating', figsize=(10,5))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000026D9BFEB0B8>]],
      dtype=object)

ratings.boxplot(column='rating', figsize=(10,5))

<matplotlib.axes._subplots.AxesSubplot at 0x26d82627470>

Slicing Out Columns

tags['tag'].head()

0      Mark Waters
1        dark hero
2        dark hero
3    noir thriller
4        dark hero
Name: tag, dtype: object

movies[['title','genres']].head()

ratings[:5]

ratings[-5:]

ratings[1000:1005]

# Count of movies per tag in the tag database
tag_counts = tags['tag'].value_counts()
tag_counts[:5]

sci-fi             3384
based on a book    3281
atmospheric        2917
comedy             2779
action             2657
Name: tag, dtype: int64

Show only rows from tags DataFrame with tag value as "sci-fi".

scifi = tags.loc[tags['tag'] == 'sci-fi']
scifi.head(5)

comedy = tags.loc[tags['tag'] == 'comedy']
comedy.head(5)

tag_counts[:10].plot(kind='bar', figsize=(10,5))

<matplotlib.axes._subplots.AxesSubplot at 0x26d826bd278>

More movies are made within sci-fi genere, followed by based on a book.

Filters for Selecting Rows

is_highly_rated = ratings['rating'] >= 4.0
ratings[is_highly_rated][90:95]

is_lowly_rated = ratings['rating'] <= 2.0
ratings[is_lowly_rated][15:20]

is_animation = movies['genres'].str.contains('Animation')
movies[is_animation][10:15]

is_scifi = movies['genres'].str.contains('Sci-Fi')
movies[is_scifi][10:15]

movies[is_scifi].head(5)

Group By and Aggregate

# Count of movies for each rating 
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

#  Average rating for every movie in our ratings database
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()

# To see how many ratings are present per movie
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

Merge DataFrames

Details in: http://pandas.pydata.org/pandas-docs/stable/merging.html

Note: Using merge() is better than using concat() or append() as merge() eliminates duplicate keys.

movies.head()

tags.head()

ratings.head()

movtags = movies.merge(tags, on='movieId', how='inner')
del movtags['userId']
movtags.head()

Combine aggreagation, merging, and filters to get useful analytics:

avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()

# First merging the movies table with the ratings table
box_off = movies.merge(ratings, on='movieId', how='inner')
del box_off['userId']
box_off.head()

# Then merging the movies table with the avg_ratings dataframe in memory
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.head()

is_highly_rated = box_office['rating'] >= 4.0
box_office[is_highly_rated][:5]

is_comedy = box_office['genres'].str.contains('Comedy')
box_office[is_comedy][:5]

box_office[is_comedy & is_highly_rated][:5]

Vectorized String Operations

Split 'genres' into multiple columns like "Comedy|Drama|Romance|War" into "Comedy", "Drama", "Romance", and "War"
Main string operation functions are:

str.split()
str.contains()
str.replace()
str.extract()

movies.head()

The movies title also have the year, and genres has more than one genre, all stringed together by pipe characters.

movie_genres = movies['genres'].str.split('|', expand=True)
movie_genres[:5]

All the genres are split into a separate dataframe. Probably one of the rows has nine genres in it, which is why we have nine columns here.

movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')
movie_genres[:5]

Extract year from movie title e.g. separate "Toy Story" and "(1995)" so that year is extracted out.

movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
movies[-5:]

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!

tags2 = pd.read_csv('./MLData/02_Movielens/ml-20m/tags.csv', sep=',')
tags2.head()

tags2.dtypes

userId        int64
movieId       int64
tag          object
timestamp     int64
dtype: object

Note that the datatype timestamp is int64.
This is a Unix time/POSIX time/epoch time format that records time in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
Our big task is to convert the int64, which was that original instant since 1970 UTC time, into either one of the datetime formats so Python renders it in a human-readable format.

tags2['parsed_time'] = pd.to_datetime(tags2['timestamp'], unit='s')
tags2['parsed_time'].dtype

dtype('<M8[ns]')

Data Type datetime64[ns] maps to either <M8[ns] or >M8[ns] depending on the hardware

tags2.head()

Selecting rows based on timestamps:

greater_than_t = tags2['parsed_time'] > '2015-02-01'
selected_rows = tags2[greater_than_t]
tags2.shape, selected_rows.shape

((465564, 5), (12130, 5))

tags2[greater_than_t][:5]

Sorting the table using the timestamps:

tags2.sort_values(by='parsed_time', ascending=True)[:5]

Average Movie Ratings over Time

Let's find insight to this question - Are movie ratings related to the year of launch?

ratings.tail()

average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.head()

joined = movies.merge(average_rating, on='movieId', how='inner')
joined.tail()

joined.corr()

yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[-10:]

yearly_average.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 124 entries, 0 to 123
Data columns (total 2 columns):
year      124 non-null object
rating    124 non-null float64
dtypes: float64(1), object(1)
memory usage: 2.9+ KB

Note that the year column has datatype as "object" as it has some unintended data like "Das Millionenspiel" or "2009–". This will create problem in graph plotting. So we need to clean these.

yearly_average['year'] = pd.to_numeric(yearly_average['year'], errors='coerce')
yearly_average[-10:]

yearly_average.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 124 entries, 0 to 123
Data columns (total 2 columns):
year      118 non-null float64
rating    124 non-null float64
dtypes: float64(2)
memory usage: 2.9 KB

yearly_average.isnull().sum()

year      6
rating    0
dtype: int64

yearly_average = yearly_average.dropna(axis=0)
yearly_average[-10:]

yearly_average.isnull().sum()

year      0
rating    0
dtype: int64

yearly_average['year'] = yearly_average['year'].astype('int64', errors='ignore')
yearly_average[-10:]

yearly_average[:].plot(x='year', y='rating', figsize=(13,8), grid=True)

<matplotlib.axes._subplots.AxesSubplot at 0x26d8660cef0>

Do some years look better for the boxoffice movies than others?
Yes, around the years 1900 and 1920, films used to get higher ratings.

Does any data point seem like an outlier in some sense?
No, as the ratings are well between 2.5 to 5.0, no data seems to be in outlier.

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

	userId	movieId	rating
userId	1.000000	-0.000850	0.001175
movieId	-0.000850	1.000000	0.002606
rating	0.001175	0.002606	1.000000

	title	genres
0	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	Jumanji (1995)	Adventure\|Children\|Fantasy
2	Grumpier Old Men (1995)	Comedy\|Romance
3	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	Father of the Bride Part II (1995)	Comedy

	userId	movieId	rating
20000258	138493	68954	4.5
20000259	138493	69526	4.5
20000260	138493	69644	3.0
20000261	138493	70286	5.0
20000262	138493	71619	2.5

	userId	movieId	rating
1000	11	527	4.5
1001	11	531	4.5
1002	11	541	4.5
1003	11	546	5.0
1004	11	551	5.0

Pages

Oct 8, 2018

Example 10: Movies Data Analysis

Exploring Pandas Library with Movies Dataset