In Jupyter Notebook
Pandas stands for “Python Data Analysis Library” is one of the mostly used and useful libraries in Python for Data Science analysis tasks. The library’s name originally derives from panel data, a common term for multidimensional data sets encountered in statistics and econometrics. This example explores the Pandas library basics. The analysis is done using Jupyter Notebook.
The Pandas Library
The Pandas library has handy functions for data ingestion, generating descriptive statistics on data, data cleaning, subsetting, filtering, insertion, deletion and aggregation. Following are the characteristics of the Pandas library:
- Pandas library provides a number of data analysis-friendly features.
- Pandas builds up NumPy, so most of the NumPy advantages still hold true.
- Enables ingestion and manipulation of heterogeneous data types in an intuitive fashion.
- Enables combining large data sets using merge and join.
- Also provides visualizations and fast generation of data plots.
- Pandas has two data structures:
- Series, is like a fit sized dictionary.
- DataFrame, is a 2D elastic data structure that supports heterogeneous data.
In [1]:
import pandas as pd
In [2]:
ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])
ser
Out[2]:
In [3]:
ser.index
Out[3]:
In [4]:
ser['nancy']
Out[4]:
In [5]:
ser.loc['nancy']
Out[5]:
In [6]:
ser.loc[['nancy', 'bob']]
Out[6]:
In [7]:
ser[[1,3,4]]
Out[7]:
In [8]:
ser.iloc[[1,3,4]]
Out[8]:
In [9]:
'bob' in ser
Out[9]:
In [10]:
ser
Out[10]:
In [11]:
ser * 2
Out[11]:
In [12]:
ser[['nancy', 'eric']] ** 2
Out[12]:
In [13]:
dict = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
dict
Out[13]:
In [14]:
df = pd.DataFrame(dict)
print(df)
In [15]:
df
Out[15]:
In [16]:
df.index
Out[16]:
In [17]:
df.columns
Out[17]:
In [18]:
pd.DataFrame(dict, index=['dancy', 'ball', 'apple'])
Out[18]:
In [19]:
pd.DataFrame(dict, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])
Out[19]:
In [20]:
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
data
Out[20]:
In [21]:
pd.DataFrame(data)
Out[21]:
In [22]:
pd.DataFrame(data, index=['red', 'blue'])
Out[22]:
In [23]:
pd.DataFrame(data, columns=['joe', 'dora', 'alice'])
Out[23]:
In [24]:
df
Out[24]:
In [25]:
df['one']
Out[25]:
In [26]:
df['three'] = df['one'] * df['two']
df
Out[26]:
In [27]:
df['flag'] = df['one'] > 250
df
Out[27]:
In [28]:
three = df.pop('three')
three
Out[28]:
In [29]:
df
Out[29]:
In [30]:
del df['two']
df
Out[30]:
In [31]:
df.insert(2, 'copy_of_one', df['one'])
df
Out[31]:
In [32]:
df['one_upper_half'] = df['one'][:2]
df
Out[32]:
In [33]:
file = "/Users/vrowm/PycharmProjects/DataScience/MyData/boston.csv"
mydata = pd.read_csv(file)
mydata.head()
Out[33]:
In [34]:
mydata.tail()
Out[34]:
In [35]:
mydata.shape
Out[35]:
In [36]:
mydata.describe()
Out[36]:
In [37]:
mydata.info()
In [38]:
#mydata.isnull()
mydata.isnull().sum()
Out[38]:
In [39]:
#mydata['TAX']
mydata['TAX'].value_counts()
mydata['TAX'].head()
Out[39]:
In [ ]:
For more details on Pandas library please see Example 10: Movies Data Analysis with real dataset of movies.