Exploring Pandas Library with Movies Dataset
This example extends Example 9 to explore the Pandas library with Movies dataset from MovieLens. The analysis is done using Jupyter Notebook. The movies notebook is a representative example to keep as a reference as any data science study will have similar steps.
Movies Data Analysis
This jupyter notebook example shows use of Data Science techniques to analyse movie data, using MovieLens data (filename: ml-20m.zip) from the location https://grouplens.org/datasets/movielens/
I am using three CSV files from the downloaded data:
- movies.csv : movieId, title, genres
- tags.csv : userId,movieId, tag, timestamp
- ratings.csv : userId,movieId,rating, timestamp
In [1]:
import pandas as pd
In [2]:
movies = pd.read_csv('./MLData/02_Movielens/ml-20m/movies.csv', sep=',')
print(type(movies))
In [3]:
movies.head()
Out[3]:
In [4]:
tags = pd.read_csv('./MLData/02_Movielens/ml-20m/tags.csv', sep=',')
tags.head()
Out[4]:
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
In [5]:
ratings = pd.read_csv('./MLData/02_Movielens/ml-20m/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()
Out[5]:
For current analysis, we will remove timestamp (we will come back to it!).
In [6]:
del tags['timestamp']
del ratings['timestamp']
In [7]:
tags.head(3)
Out[7]:
In [8]:
ratings.head(3)
Out[8]:
In [9]:
movies.shape
Out[9]:
In [10]:
tags.shape
Out[10]:
In [11]:
ratings.shape
Out[11]:
In [12]:
tags.index
Out[12]:
In [13]:
tags.columns
Out[13]:
Extract row 0, 11, 2000 from DataFrame using index location iloc
In [14]:
tags.iloc[ [0,11,2000]]
Out[14]:
In [15]:
movies.iloc[ [10, 50, 1500]]
Out[15]:
In [16]:
row_0 = tags.iloc[0]
type(row_0)
Out[16]:
Extract 0th row: notice that it is infact a Series
In [17]:
print(row_0)
In [18]:
row_0.index
Out[18]:
In [19]:
row_0['userId']
Out[19]:
In [20]:
'rating' in row_0
Out[20]:
In [21]:
row_0.name
Out[21]:
In [22]:
row_0 = row_0.rename('first_row')
row_0.name
Out[22]:
In [23]:
print(row_0)
In [24]:
ratings.head(3)
Out[24]:
In [25]:
ratings.describe()
Out[25]:
In [26]:
ratings['rating'].describe()
Out[26]:
In [27]:
ratings.mean()
Out[27]:
In [28]:
ratings['rating'].mean()
Out[28]:
In [29]:
ratings['rating'].min()
Out[29]:
In [30]:
ratings['rating'].max()
Out[30]:
In [31]:
ratings['rating'].std()
Out[31]:
In [32]:
ratings['rating'].mode()
Out[32]:
In [33]:
ratings.corr()
Out[33]:
In [34]:
filter_1 = ratings['rating'] > 5
filter_1.any()
Out[34]:
In [35]:
filter_2 = ratings['rating'] < 1
filter_2.any()
Out[35]:
In [36]:
type(filter_2)
Out[36]:
In [37]:
filter_2.head()
Out[37]:
In [38]:
filter_2 = ratings['rating'] > 0
filter_2.all()
Out[38]:
In [39]:
movies.shape
Out[39]:
Is any row has NULL value?
In [40]:
movies.isnull().any()
Out[40]:
In [41]:
tags.isnull().any()
Out[41]:
In [42]:
ratings.isnull().any()
Out[42]:
So the tags table has some missing values in the tag column.
In [43]:
tags.isnull().sum()
Out[43]:
In [44]:
tags.shape
Out[44]:
So in tags table out of 465564 rows, 16 rows have missing values, which is a small number in comparision to 465564 rows. So we can drop those 16 rows.
In [45]:
tags = tags.dropna()
In [46]:
tags.isnull().any()
Out[46]:
In [47]:
tags.shape
Out[47]:
So 465564 - 465548 = 16 rows are now removed with no null rows left.
Pandas library has very useful data visualisation plot functions as the following:
- DataFrame.plot() : Line graphs of each column with different line.
- DataFrame.plot.area() : Gives area plot
- DataFrame.plot.bar() : Gives vertical bars for each columns
- DataFrame.plot.barh() : Gives horizontal bars for each columns
- DataFrame.plot.box() : Gives box plot of data distribution of min, max, and medium values for columns.
- DataFrame.plot.density() : Gives Kernel Density Estimate plot
- DataFrame.plot.hexbin() : Gives Hexbin plot
- DataFrame.plot.hist() : Gives a histogram of distribution of data, and it can show skewness of data.
- DataFrame.plot.kde() : Same as Kernel Density Estimate plot
- DataFrame.plot.line() : Gives simple line plot
- DataFrame.plot.pie() : Gives Pie chart
- DataFrame.plot.scatter() : Gives Scatter plot
- DataFrame.boxplot() : Gives a box plot from DataFrame columns
- DataFrame.hist() : Gives a histogram from the DataFrame columns
Matplotlib is a plotting library for Python and Pandas leverages matplotlib underneath for its plots. For jupyter to plot the graphs inside the notebooks we have to tell jupyter to plot inline. The percentage sign before the matplotlib is a symbol for a special class of functions in jupyter called magic functions.
In [48]:
%matplotlib inline
ratings.hist(column='rating', figsize=(10,5))
Out[48]:
In [49]:
ratings.boxplot(column='rating', figsize=(10,5))
Out[49]:
In [50]:
tags['tag'].head()
Out[50]:
In [51]:
movies[['title','genres']].head()
Out[51]:
In [52]:
ratings[:5]
Out[52]:
In [53]:
ratings[-5:]
Out[53]:
In [54]:
ratings[1000:1005]
Out[54]:
In [55]:
# Count of movies per tag in the tag database
tag_counts = tags['tag'].value_counts()
tag_counts[:5]
Out[55]:
Show only rows from tags DataFrame with tag value as "sci-fi".
In [56]:
scifi = tags.loc[tags['tag'] == 'sci-fi']
scifi.head(5)
Out[56]:
In [57]:
comedy = tags.loc[tags['tag'] == 'comedy']
comedy.head(5)
Out[57]:
In [58]:
tag_counts[:10].plot(kind='bar', figsize=(10,5))
Out[58]:
More movies are made within sci-fi genere, followed by based on a book.
In [59]:
is_highly_rated = ratings['rating'] >= 4.0
ratings[is_highly_rated][90:95]
Out[59]:
In [60]:
is_lowly_rated = ratings['rating'] <= 2.0
ratings[is_lowly_rated][15:20]
Out[60]:
In [61]:
is_animation = movies['genres'].str.contains('Animation')
movies[is_animation][10:15]
Out[61]:
In [62]:
is_scifi = movies['genres'].str.contains('Sci-Fi')
movies[is_scifi][10:15]
Out[62]:
In [63]:
movies[is_scifi].head(5)
Out[63]:
In [64]:
# Count of movies for each rating
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count
Out[64]:
In [65]:
# Average rating for every movie in our ratings database
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()
Out[65]:
In [66]:
# To see how many ratings are present per movie
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()
Out[66]:
Merge DataFrames
Details in: http://pandas.pydata.org/pandas-docs/stable/merging.html
Note: Using merge() is better than using concat() or append() as merge() eliminates duplicate keys.
In [67]:
movies.head()
Out[67]:
In [68]:
tags.head()
Out[68]:
In [69]:
ratings.head()
Out[69]:
In [70]:
movtags = movies.merge(tags, on='movieId', how='inner')
del movtags['userId']
movtags.head()
Out[70]:
Combine aggreagation, merging, and filters to get useful analytics:
In [71]:
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()
Out[71]:
In [72]:
# First merging the movies table with the ratings table
box_off = movies.merge(ratings, on='movieId', how='inner')
del box_off['userId']
box_off.head()
Out[72]:
In [73]:
# Then merging the movies table with the avg_ratings dataframe in memory
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.head()
Out[73]:
In [74]:
is_highly_rated = box_office['rating'] >= 4.0
box_office[is_highly_rated][:5]
Out[74]:
In [75]:
is_comedy = box_office['genres'].str.contains('Comedy')
box_office[is_comedy][:5]
Out[75]:
In [76]:
box_office[is_comedy & is_highly_rated][:5]
Out[76]:
In [77]:
movies.head()
Out[77]:
The movies title also have the year, and genres has more than one genre, all stringed together by pipe characters.
In [78]:
movie_genres = movies['genres'].str.split('|', expand=True)
movie_genres[:5]
Out[78]:
All the genres are split into a separate dataframe. Probably one of the rows has nine genres in it, which is why we have nine columns here.
In [79]:
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')
movie_genres[:5]
Out[79]:
Extract year from movie title e.g. separate "Toy Story" and "(1995)" so that year is extracted out.
In [80]:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
movies[-5:]
Out[80]:
In [81]:
tags2 = pd.read_csv('./MLData/02_Movielens/ml-20m/tags.csv', sep=',')
tags2.head()
Out[81]:
In [82]:
tags2.dtypes
Out[82]:
Note that the datatype timestamp is int64.
This is a Unix time/POSIX time/epoch time format that records time in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
Our big task is to convert the int64, which was that original instant since 1970 UTC time, into either one of the datetime formats so Python renders it in a human-readable format.
This is a Unix time/POSIX time/epoch time format that records time in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
Our big task is to convert the int64, which was that original instant since 1970 UTC time, into either one of the datetime formats so Python renders it in a human-readable format.
In [83]:
tags2['parsed_time'] = pd.to_datetime(tags2['timestamp'], unit='s')
tags2['parsed_time'].dtype
Out[83]:
Data Type datetime64[ns] maps to either <M8[ns] or >M8[ns] depending on the hardware
In [84]:
tags2.head()
Out[84]:
Selecting rows based on timestamps:
In [85]:
greater_than_t = tags2['parsed_time'] > '2015-02-01'
selected_rows = tags2[greater_than_t]
tags2.shape, selected_rows.shape
Out[85]:
In [86]:
tags2[greater_than_t][:5]
Out[86]:
Sorting the table using the timestamps:
In [87]:
tags2.sort_values(by='parsed_time', ascending=True)[:5]
Out[87]:
In [88]:
ratings.tail()
Out[88]:
In [89]:
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.head()
Out[89]:
In [90]:
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.tail()
Out[90]:
In [91]:
joined.corr()
Out[91]:
In [92]:
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[-10:]
Out[92]:
In [93]:
yearly_average.info()
Note that the year column has datatype as "object" as it has some unintended data like "Das Millionenspiel" or "2009–". This will create problem in graph plotting. So we need to clean these.
In [94]:
yearly_average['year'] = pd.to_numeric(yearly_average['year'], errors='coerce')
yearly_average[-10:]
Out[94]:
In [95]:
yearly_average.info()
In [96]:
yearly_average.isnull().sum()
Out[96]:
In [97]:
yearly_average = yearly_average.dropna(axis=0)
yearly_average[-10:]
Out[97]:
In [98]:
yearly_average.isnull().sum()
Out[98]:
In [99]:
yearly_average['year'] = yearly_average['year'].astype('int64', errors='ignore')
yearly_average[-10:]
Out[99]:
In [100]:
yearly_average[:].plot(x='year', y='rating', figsize=(13,8), grid=True)
Out[100]:
Do some years look better for the boxoffice movies than others?
Yes, around the years 1900 and 1920, films used to get higher ratings.
Yes, around the years 1900 and 1920, films used to get higher ratings.
Does any data point seem like an outlier in some sense?
No, as the ratings are well between 2.5 to 5.0, no data seems to be in outlier.
No, as the ratings are well between 2.5 to 5.0, no data seems to be in outlier.
In [ ]:
For details on optimizing Pandas code for speed and efficiency, see this presentation at PyCon 2017.