Oct 6, 2018

Example 9: Exploring Pandas Library

In Jupyter Notebook



Pandas stands for “Python Data Analysis Library” is one of the mostly used and useful libraries in Python for Data Science analysis tasks. The library’s name originally derives from panel data, a common term for multidimensional data sets encountered in statistics and econometrics. This example explores the Pandas library basics. The analysis is done using Jupyter Notebook.

The Pandas Library


The Pandas library has handy functions for data ingestion, generating descriptive statistics on data, data cleaning, subsetting, filtering, insertion, deletion and aggregation. Following are the characteristics of the Pandas library:
  • Pandas library provides a number of data analysis-friendly features.
  • Pandas builds up NumPy, so most of the NumPy advantages still hold true.
  • Enables ingestion and manipulation of heterogeneous data types in an intuitive fashion.
  • Enables combining large data sets using merge and join.
  • Also provides visualizations and fast generation of data plots.
  • Pandas has two data structures:
    • Series, is like a fit sized dictionary.
    • DataFrame, is a 2D elastic data structure that supports heterogeneous data.
In [1]:
import pandas as pd

Pandas Series

In [2]:
ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])
ser
Out[2]:
tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object
In [3]:
ser.index
Out[3]:
Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')
In [4]:
ser['nancy']
Out[4]:
300
In [5]:
ser.loc['nancy']
Out[5]:
300
In [6]:
ser.loc[['nancy', 'bob']]
Out[6]:
nancy    300
bob      foo
dtype: object
In [7]:
ser[[1,3,4]]
Out[7]:
bob     foo
dan     bar
eric    500
dtype: object
In [8]:
ser.iloc[[1,3,4]]
Out[8]:
bob     foo
dan     bar
eric    500
dtype: object
In [9]:
'bob' in ser
Out[9]:
True
In [10]:
ser
Out[10]:
tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object
In [11]:
ser * 2
Out[11]:
tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object
In [12]:
ser[['nancy', 'eric']] ** 2
Out[12]:
nancy     90000
eric     250000
dtype: object

Pandas DataFrame

In [13]:
dict = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
dict
Out[13]:
{'one': apple    100.0
 ball     200.0
 clock    300.0
 dtype: float64, 'two': apple      111.0
 ball       222.0
 cerill     333.0
 dancy     4444.0
 dtype: float64}
In [14]:
df = pd.DataFrame(dict)
print(df)
          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0
In [15]:
df
Out[15]:
one two
apple 100.0 111.0
ball 200.0 222.0
cerill NaN 333.0
clock 300.0 NaN
dancy NaN 4444.0
In [16]:
df.index
Out[16]:
Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')
In [17]:
df.columns
Out[17]:
Index(['one', 'two'], dtype='object')
In [18]:
pd.DataFrame(dict, index=['dancy', 'ball', 'apple'])
Out[18]:
one two
dancy NaN 4444.0
ball 200.0 222.0
apple 100.0 111.0
In [19]:
pd.DataFrame(dict, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])
Out[19]:
two five
dancy 4444.0 NaN
ball 222.0 NaN
apple 111.0 NaN

Create DataFrame from List of Python dictionaries

In [20]:
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
data
Out[20]:
[{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
In [21]:
pd.DataFrame(data)
Out[21]:
alex alice dora ema joe
0 1.0 NaN NaN NaN 2.0
1 NaN 20.0 10.0 5.0 NaN
In [22]:
pd.DataFrame(data, index=['red', 'blue'])
Out[22]:
alex alice dora ema joe
red 1.0 NaN NaN NaN 2.0
blue NaN 20.0 10.0 5.0 NaN
In [23]:
pd.DataFrame(data, columns=['joe', 'dora', 'alice'])
Out[23]:
joe dora alice
0 2.0 NaN NaN
1 NaN 10.0 20.0

Basic DataFrame Operations

In [24]:
df
Out[24]:
one two
apple 100.0 111.0
ball 200.0 222.0
cerill NaN 333.0
clock 300.0 NaN
dancy NaN 4444.0
In [25]:
df['one']
Out[25]:
apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64
In [26]:
df['three'] = df['one'] * df['two']
df
Out[26]:
one two three
apple 100.0 111.0 11100.0
ball 200.0 222.0 44400.0
cerill NaN 333.0 NaN
clock 300.0 NaN NaN
dancy NaN 4444.0 NaN
In [27]:
df['flag'] = df['one'] > 250
df
Out[27]:
one two three flag
apple 100.0 111.0 11100.0 False
ball 200.0 222.0 44400.0 False
cerill NaN 333.0 NaN False
clock 300.0 NaN NaN True
dancy NaN 4444.0 NaN False
In [28]:
three = df.pop('three')
three
Out[28]:
apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64
In [29]:
df
Out[29]:
one two flag
apple 100.0 111.0 False
ball 200.0 222.0 False
cerill NaN 333.0 False
clock 300.0 NaN True
dancy NaN 4444.0 False
In [30]:
del df['two']
df
Out[30]:
one flag
apple 100.0 False
ball 200.0 False
cerill NaN False
clock 300.0 True
dancy NaN False
In [31]:
df.insert(2, 'copy_of_one', df['one'])
df
Out[31]:
one flag copy_of_one
apple 100.0 False 100.0
ball 200.0 False 200.0
cerill NaN False NaN
clock 300.0 True 300.0
dancy NaN False NaN
In [32]:
df['one_upper_half'] = df['one'][:2]
df
Out[32]:
one flag copy_of_one one_upper_half
apple 100.0 False 100.0 100.0
ball 200.0 False 200.0 200.0
cerill NaN False NaN NaN
clock 300.0 True 300.0 NaN
dancy NaN False NaN NaN

Reading external data from file in pandas

In [33]:
file = "/Users/vrowm/PycharmProjects/DataScience/MyData/boston.csv"
mydata = pd.read_csv(file)
mydata.head()
Out[33]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PT B LSTAT MV
0 0.00632 18.0 2.31 0 0.538 6.575 65.199997 4.0900 1 296 15.300000 396.899994 4.98 24.000000
1 0.02731 0.0 7.07 0 0.469 6.421 78.900002 4.9671 2 242 17.799999 396.899994 9.14 21.600000
2 0.02729 0.0 7.07 0 0.469 7.185 61.099998 4.9671 2 242 17.799999 392.829987 4.03 34.700001
3 0.03237 0.0 2.18 0 0.458 6.998 45.799999 6.0622 3 222 18.700001 394.630005 2.94 33.400002
4 0.06905 0.0 2.18 0 0.458 7.147 54.200001 6.0622 3 222 18.700001 396.899994 5.33 36.200001
In [34]:
mydata.tail()
Out[34]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PT B LSTAT MV
501 0.06263 0.0 11.93 0 0.573 6.593 69.099998 2.4786 1 273 21.0 391.989990 9.67 22.4
502 0.04527 0.0 11.93 0 0.573 6.120 76.699997 2.2875 1 273 21.0 396.899994 9.08 20.6
503 0.06076 0.0 11.93 0 0.573 6.976 91.000000 2.1675 1 273 21.0 396.899994 5.64 23.9
504 0.10959 0.0 11.93 0 0.573 6.794 89.300003 2.3889 1 273 21.0 393.450012 6.48 22.0
505 0.04741 0.0 11.93 0 0.573 6.030 80.800003 2.5050 1 273 21.0 396.899994 7.88 11.9
In [35]:
mydata.shape
Out[35]:
(506, 14)
In [36]:
mydata.describe()
Out[36]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PT B LSTAT MV
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674030 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148862 2.105710 8.707259 168.537116 2.164946 91.294863 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377487 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440002 11.360000 21.200001
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.074999 5.188425 24.000000 666.000000 20.200001 396.225006 16.954999 25.000000
max 88.976196 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.899994 37.970001 50.000000
In [37]:
mydata.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM     506 non-null float64
ZN       506 non-null float64
INDUS    506 non-null float64
CHAS     506 non-null int64
NOX      506 non-null float64
RM       506 non-null float64
AGE      506 non-null float64
DIS      506 non-null float64
RAD      506 non-null int64
TAX      506 non-null int64
PT       506 non-null float64
B        506 non-null float64
LSTAT    506 non-null float64
MV       506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 55.4 KB
In [38]:
#mydata.isnull()
mydata.isnull().sum()
Out[38]:
CRIM     0
ZN       0
INDUS    0
CHAS     0
NOX      0
RM       0
AGE      0
DIS      0
RAD      0
TAX      0
PT       0
B        0
LSTAT    0
MV       0
dtype: int64
In [39]:
#mydata['TAX']
mydata['TAX'].value_counts()
mydata['TAX'].head()
Out[39]:
0    296
1    242
2    242
3    222
4    222
Name: TAX, dtype: int64
In [ ]:
 


For more details on Pandas library please see Example 10: Movies Data Analysis with real dataset of movies.