The Pandas Library

The Pandas library has handy functions for data ingestion, generating descriptive statistics on data, data cleaning, subsetting, filtering, insertion, deletion and aggregation. Following are the characteristics of the Pandas library:

Pandas library provides a number of data analysis-friendly features.
Pandas builds up NumPy, so most of the NumPy advantages still hold true.
Enables ingestion and manipulation of heterogeneous data types in an intuitive fashion.
Enables combining large data sets using merge and join.
Also provides visualizations and fast generation of data plots.
Pandas has two data structures:
- Series, is like a fit sized dictionary.
- DataFrame, is a 2D elastic data structure that supports heterogeneous data.

import pandas as pd

Pandas Series

ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])
ser

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

ser.index

Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')

ser['nancy']

300

ser.loc['nancy']

300

ser.loc[['nancy', 'bob']]

nancy    300
bob      foo
dtype: object

ser[[1,3,4]]

bob     foo
dan     bar
eric    500
dtype: object

ser.iloc[[1,3,4]]

bob     foo
dan     bar
eric    500
dtype: object

'bob' in ser

True

ser

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

ser * 2

tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object

ser[['nancy', 'eric']] ** 2

nancy     90000
eric     250000
dtype: object

Pandas DataFrame

dict = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
dict

{'one': apple    100.0
 ball     200.0
 clock    300.0
 dtype: float64, 'two': apple      111.0
 ball       222.0
 cerill     333.0
 dancy     4444.0
 dtype: float64}

df = pd.DataFrame(dict)
print(df)

          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0

df

df.index

Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

df.columns

Index(['one', 'two'], dtype='object')

pd.DataFrame(dict, index=['dancy', 'ball', 'apple'])

pd.DataFrame(dict, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

Create DataFrame from List of Python dictionaries

data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
data

[{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

pd.DataFrame(data)

pd.DataFrame(data, index=['red', 'blue'])

pd.DataFrame(data, columns=['joe', 'dora', 'alice'])

Basic DataFrame Operations

df

df['one']

apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64

df['three'] = df['one'] * df['two']
df

df['flag'] = df['one'] > 250
df

three = df.pop('three')
three

apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64

df

del df['two']
df

df.insert(2, 'copy_of_one', df['one'])
df

df['one_upper_half'] = df['one'][:2]
df

Reading external data from file in pandas

file = "/Users/vrowm/PycharmProjects/DataScience/MyData/boston.csv"
mydata = pd.read_csv(file)
mydata.head()

mydata.tail()

mydata.shape

(506, 14)

mydata.describe()

mydata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM     506 non-null float64
ZN       506 non-null float64
INDUS    506 non-null float64
CHAS     506 non-null int64
NOX      506 non-null float64
RM       506 non-null float64
AGE      506 non-null float64
DIS      506 non-null float64
RAD      506 non-null int64
TAX      506 non-null int64
PT       506 non-null float64
B        506 non-null float64
LSTAT    506 non-null float64
MV       506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 55.4 KB

#mydata.isnull()
mydata.isnull().sum()

CRIM     0
ZN       0
INDUS    0
CHAS     0
NOX      0
RM       0
AGE      0
DIS      0
RAD      0
TAX      0
PT       0
B        0
LSTAT    0
MV       0
dtype: int64

#mydata['TAX']
mydata['TAX'].value_counts()
mydata['TAX'].head()

0    296
1    242
2    242
3    222
4    222
Name: TAX, dtype: int64

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PT	B	LSTAT	MV
0	0.00632	18.0	2.31	0.538	6.575	65.199997	4.0900	1	296	15.300000	396.899994	4.98	24.000000
1	0.02731	0.0	7.07	0.469	6.421	78.900002	4.9671	2	242	17.799999	396.899994	9.14	21.600000
2	0.02729	0.0	7.07	0.469	7.185	61.099998	4.9671	2	242	17.799999	392.829987	4.03	34.700001
3	0.03237	0.0	2.18	0.458	6.998	45.799999	6.0622	3	222	18.700001	394.630005	2.94	33.400002
4	0.06905	0.0	2.18	0.458	7.147	54.200001	6.0622	3	222	18.700001	396.899994	5.33	36.200001

	CRIM	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PT	B	LSTAT	MV
501	0.06263	11.93	0.573	6.593	69.099998	2.4786	1	273	21.0	391.989990	9.67	22.4
502	0.04527	11.93	0.573	6.120	76.699997	2.2875	1	273	21.0	396.899994	9.08	20.6
503	0.06076	11.93	0.573	6.976	91.000000	2.1675	1	273	21.0	396.899994	5.64	23.9
504	0.10959	11.93	0.573	6.794	89.300003	2.3889	1	273	21.0	393.450012	6.48	22.0
505	0.04741	11.93	0.573	6.030	80.800003	2.5050	1	273	21.0	396.899994	7.88	11.9

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PT	B	LSTAT	MV
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674030	12.653063	22.532806
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148862	2.105710	8.707259	168.537116	2.164946	91.294863	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377487	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440002	11.360000	21.200001
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.074999	5.188425	24.000000	666.000000	20.200001	396.225006	16.954999	25.000000
max	88.976196	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.899994	37.970001	50.000000

Pages

Oct 6, 2018

Example 9: Exploring Pandas Library

In Jupyter Notebook