Edit this page
Open and issue
Pandas
Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame . DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are. Thus, before we go any further, let's introduce these three fundamental Pandas data structures: the Series, DataFrame, and Index.
Series
The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object. A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
Let's explore this concept through some examples:
import numpy as np
import pandas as pd
You can convert a list, numpy array, or dictionary to a Series:
labels = [ 'a' , 'b' , 'c' ]
my_list = [ 10 , 20 , 30 ]
arr = np . array ([ 10 , 20 , 30 ])
d = { 'a' : 10 , 'b' : 20 , 'c' : 30 }
Using Lists
pd . Series ( data = my_list )
0 10
1 20
2 30
dtype : int64
pd . Series ( data = my_list , index = labels )
a 10
b 20
c 30
dtype : int64
pd . Series ( my_list , labels )
a 10
b 20
c 30
dtype : int64
NumPy Arrays
pd . Series ( arr )
0 10
1 20
2 30
dtype : int64
pd . Series ( arr , labels )
a 10
b 20
c 30
dtype : int64
Dictionary
pd . Series ( d )
a 10
b 20
c 30
dtype : int64
Data in a Series
A pandas Series can hold a variety of object types:
pd . Series ( data = labels )
0 a
1 b
2 c
dtype : object
# Even functions (although unlikely that you will use this)
pd . Series ([ sum , print , len ])
0 < built - in function sum >
1 < built - in function print >
2 < built - in function len >
dtype : object
Using an Index
The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).
Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:
ser1 = pd . Series ([ 1 , 2 , 3 , 4 ], index = [ 'USA' , 'Germany' , 'USSR' , 'Japan' ])
ser1
USA 1
Germany 2
USSR 3
Japan 4
dtype : int64
ser2 = pd . Series ([ 1 , 2 , 5 , 4 ], index = [ 'USA' , 'Germany' , 'Italy' , 'Japan' ])
ser2
USA 1
Germany 2
Italy 5
Japan 4
dtype : int64
Operations are then also done based off of index:
ser1 + ser2
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype : float64
Let's stop here for now and move on to DataFrames, which will expand on the concept of Series!
DataFrames
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!
import pandas as pd
import numpy as np
from numpy.random import randn
np . random . seed ( 101 )
df = pd . DataFrame ( randn ( 5 , 4 ), index = 'A B C D E' . split (), columns = 'W X Y Z' . split ())
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
Selection and Indexing
Let's learn the various methods to grab data from a DataFrame
df [ 'W' ]
A 2.706850
B 0.651118
C - 2.018168
D 0.188695
E 0.190794
Name : W , dtype : float64
# Pass a list of column names
df [[ 'W' , 'Z' ]]
W
Z
A
2.706850
0.503826
B
0.651118
0.605965
C
-2.018168
-0.589001
D
0.188695
0.955057
E
0.190794
0.683509
#SQL Syntax (NOT RECOMMENDED!)
df . W
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
DataFrame Columns are just Series
type ( df [ 'W' ])
pandas . core . series . Series
Creating a new column
df [ 'new' ] = df [ 'W' ] + df [ 'Y' ]
W
X
Y
Z
new
A
2.706850
0.628133
0.907969
0.503826
3.614819
B
0.651118
-0.319318
-0.848077
0.605965
-0.196959
C
-2.018168
0.740122
0.528813
-0.589001
-1.489355
D
0.188695
-0.758872
-0.933237
0.955057
-0.744542
E
0.190794
1.978757
2.605967
0.683509
2.796762
Removing Columns
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
#Not inplace unless specified!
df
W
X
Y
Z
new
A
2.706850
0.628133
0.907969
0.503826
3.614819
B
0.651118
-0.319318
-0.848077
0.605965
-0.196959
C
-2.018168
0.740122
0.528813
-0.589001
-1.489355
D
0.188695
-0.758872
-0.933237
0.955057
-0.744542
E
0.190794
1.978757
2.605967
0.683509
2.796762
df . drop ( 'new' , axis = 1 , inplace = True )
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
Can also drop rows this way:
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
Selecting Rows
df . loc [ 'A' ]
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name : A , dtype : float64
Or select based off of position instead of label
df . iloc [ 2 ]
W - 2.018168
X 0.740122
Y 0.528813
Z - 0.589001
Name : C , dtype : float64
Selecting subset of rows and columns
df . loc [ 'B' , 'Y' ]
- 0.84807698340363147
df . loc [[ 'A' , 'B' ],[ 'W' , 'Y' ]]
W
Y
A
2.706850
0.907969
B
0.651118
-0.848077
Conditional Selection
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
W
X
Y
Z
A
True
True
True
True
B
True
False
False
True
C
False
True
True
False
D
True
False
False
True
E
True
True
True
True
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
NaN
NaN
0.605965
C
NaN
0.740122
0.528813
NaN
D
0.188695
NaN
NaN
0.955057
E
0.190794
1.978757
2.605967
0.683509
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
df [ df [ 'W' ] > 0 ][ 'Y' ]
A 0.907969
B - 0.848077
D - 0.933237
E 2.605967
Name : Y , dtype : float64
Y
X
A
0.907969
0.628133
B
-0.848077
-0.319318
D
-0.933237
-0.758872
E
2.605967
1.978757
For two conditions you can use | and & with parenthesis:
df [( df [ 'W' ] > 0 ) & ( df [ 'Y' ] > 1 )]
W
X
Y
Z
E
0.190794
1.978757
2.605967
0.683509
More Index Details
Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
#Reset to default 0,1...n index
df . reset_index ()
index
W
X
Y
Z
0
A
2.706850
0.628133
0.907969
0.503826
1
B
0.651118
-0.319318
-0.848077
0.605965
2
C
-2.018168
0.740122
0.528813
-0.589001
3
D
0.188695
-0.758872
-0.933237
0.955057
4
E
0.190794
1.978757
2.605967
0.683509
newind = 'CA NY WY OR CO' . split ()
W
X
Y
Z
States
A
2.706850
0.628133
0.907969
0.503826
CA
B
0.651118
-0.319318
-0.848077
0.605965
NY
C
-2.018168
0.740122
0.528813
-0.589001
WY
D
0.188695
-0.758872
-0.933237
0.955057
OR
E
0.190794
1.978757
2.605967
0.683509
CO
W
X
Y
Z
States
CA
2.706850
0.628133
0.907969
0.503826
NY
0.651118
-0.319318
-0.848077
0.605965
WY
-2.018168
0.740122
0.528813
-0.589001
OR
0.188695
-0.758872
-0.933237
0.955057
CO
0.190794
1.978757
2.605967
0.683509
W
X
Y
Z
States
A
2.706850
0.628133
0.907969
0.503826
CA
B
0.651118
-0.319318
-0.848077
0.605965
NY
C
-2.018168
0.740122
0.528813
-0.589001
WY
D
0.188695
-0.758872
-0.933237
0.955057
OR
E
0.190794
1.978757
2.605967
0.683509
CO
df . set_index ( 'States' , inplace = True )
W
X
Y
Z
States
CA
2.706850
0.628133
0.907969
0.503826
NY
0.651118
-0.319318
-0.848077
0.605965
WY
-2.018168
0.740122
0.528813
-0.589001
OR
0.188695
-0.758872
-0.933237
0.955057
CO
0.190794
1.978757
2.605967
0.683509
Multi-Index and Index Hierarchy
Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:
#Index Levels
outside = [ 'G1' , 'G1' , 'G1' , 'G2' , 'G2' , 'G2' ]
inside = [ 1 , 2 , 3 , 1 , 2 , 3 ]
hier_index = list ( zip ( outside , inside ))
hier_index = pd . MultiIndex . from_tuples ( hier_index )
hier_index
MultiIndex ( levels = [[ 'G1' , 'G2' ], [ 1 , 2 , 3 ]],
labels = [[ 0 , 0 , 0 , 1 , 1 , 1 ], [ 0 , 1 , 2 , 0 , 1 , 2 ]])
df = pd . DataFrame ( np . random . randn ( 6 , 2 ), index = hier_index , columns = [ 'A' , 'B' ])
df
A
B
G1
1
0.153661
0.167638
2
-0.765930
0.962299
3
0.902826
-0.537909
G2
1
-1.549671
0.435253
2
1.259904
-0.447898
3
0.266207
0.412580
Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
A
B
1
0.153661
0.167638
2
-0.765930
0.962299
3
0.902826
-0.537909
df . loc [ 'G1' ] . loc [ 1 ]
A 0.153661
B 0.167638
Name : 1 , dtype : float64
df . index . names
FrozenList ([ None , None ])
df . index . names = [ 'Group' , 'Num' ]
A
B
Group
Num
G1
1
0.153661
0.167638
2
-0.765930
0.962299
3
0.902826
-0.537909
G2
1
-1.549671
0.435253
2
1.259904
-0.447898
3
0.266207
0.412580
A
B
Num
1
0.153661
0.167638
2
-0.765930
0.962299
3
0.902826
-0.537909
df . xs ([ 'G1' , 1 ])
A 0.153661
B 0.167638
Name : ( G1 , 1 ), dtype : float64
A
B
Group
G1
0.153661
0.167638
G2
-1.549671
0.435253
Missing Data
Let's show a few convenient methods to deal with Missing Data in pandas:
import numpy as np
import pandas as pd
df = pd . DataFrame ({ 'A' :[ 1 , 2 , np . nan ],
'B' :[ 5 , np . nan , np . nan ],
'C' :[ 1 , 2 , 3 ]})
A
B
C
0
1.0
5.0
1
1
2.0
NaN
2
2
NaN
NaN
3
A
B
C
0
1.0
5.0
1
1
2.0
NaN
2
df . fillna ( value = 'FILL VALUE' )
A
B
C
0
1
5
1
1
2
FILL VALUE
2
2
FILL VALUE
FILL VALUE
3
df [ 'A' ] . fillna ( value = df [ 'A' ] . mean ())
0 1.0
1 2.0
2 1.5
Name : A , dtype : float64
Groupby
The groupby method allows you to group rows of data together and call aggregate functions
import pandas as pd
# Create dataframe
data = { 'Company' :[ 'GOOG' , 'GOOG' , 'MSFT' , 'MSFT' , 'FB' , 'FB' ],
'Person' :[ 'Sam' , 'Charlie' , 'Amy' , 'Vanessa' , 'Carl' , 'Sarah' ],
'Sales' :[ 200 , 120 , 340 , 124 , 243 , 350 ]}
Company
Person
Sales
0
GOOG
Sam
200
1
GOOG
Charlie
120
2
MSFT
Amy
340
3
MSFT
Vanessa
124
4
FB
Carl
243
5
FB
Sarah
350
Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:
You can save this object as a new variable:
by_comp = df . groupby ( "Company" )
And then call aggregate methods off the object:
Sales
Company
FB
296.5
GOOG
160.0
MSFT
232.0
df . groupby ( 'Company' ) . mean ()
Sales
Company
FB
296.5
GOOG
160.0
MSFT
232.0
More examples of aggregate methods:
Sales
Company
FB
75.660426
GOOG
56.568542
MSFT
152.735065
Person
Sales
Company
FB
Carl
243
GOOG
Charlie
120
MSFT
Amy
124
Person
Sales
Company
FB
Sarah
350
GOOG
Sam
200
MSFT
Vanessa
340
Person
Sales
Company
FB
2
2
GOOG
2
2
MSFT
2
2
Sales
Company
FB
count
2.000000
mean
296.500000
std
75.660426
min
243.000000
25%
269.750000
50%
296.500000
75%
323.250000
max
350.000000
GOOG
count
2.000000
mean
160.000000
std
56.568542
min
120.000000
25%
140.000000
50%
160.000000
75%
180.000000
max
200.000000
MSFT
count
2.000000
mean
232.000000
std
152.735065
min
124.000000
25%
178.000000
50%
232.000000
75%
286.000000
max
340.000000
by_comp . describe () . transpose ()
Company
FB
GOOG
MSFT
count
mean
std
min
25%
50%
75%
max
count
mean
...
75%
max
count
mean
std
min
25%
50%
75%
max
Sales
2.0
296.5
75.660426
243.0
269.75
296.5
323.25
350.0
2.0
160.0
...
180.0
200.0
2.0
232.0
152.735065
124.0
178.0
232.0
286.0
340.0
1 rows Ă— 24 columns
by_comp . describe () . transpose ()[ 'GOOG' ]
count
mean
std
min
25%
50%
75%
max
Sales
2.0
160.0
56.568542
120.0
140.0
160.0
180.0
200.0
Merging, Joining, and Concatenating
There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this lecture we will discuss these 3 methods with examples.
Example DataFrames
df1 = pd . DataFrame ({ 'A' : [ 'A0' , 'A1' , 'A2' , 'A3' ],
'B' : [ 'B0' , 'B1' , 'B2' , 'B3' ],
'C' : [ 'C0' , 'C1' , 'C2' , 'C3' ],
'D' : [ 'D0' , 'D1' , 'D2' , 'D3' ]},
index = [ 0 , 1 , 2 , 3 ])
df1
A
B
C
D
0
A0
B0
C0
D0
1
A1
B1
C1
D1
2
A2
B2
C2
D2
3
A3
B3
C3
D3
df2 = pd . DataFrame ({ 'A' : [ 'A4' , 'A5' , 'A6' , 'A7' ],
'B' : [ 'B4' , 'B5' , 'B6' , 'B7' ],
'C' : [ 'C4' , 'C5' , 'C6' , 'C7' ],
'D' : [ 'D4' , 'D5' , 'D6' , 'D7' ]},
index = [ 4 , 5 , 6 , 7 ])
df2
A
B
C
D
0
A4
B4
C4
D4
1
A5
B5
C5
D5
2
A6
B6
C6
D6
3
A7
B7
C7
D7
df3 = pd . DataFrame ({ 'A' : [ 'A8' , 'A9' , 'A10' , 'A11' ],
'B' : [ 'B8' , 'B9' , 'B10' , 'B11' ],
'C' : [ 'C8' , 'C9' , 'C10' , 'C11' ],
'D' : [ 'D8' , 'D9' , 'D10' , 'D11' ]},
index = [ 8 , 9 , 10 , 11 ])
print ( df3 )
| | A | B | C | D |
|-|:-|:-|:-|:-
|0 |A8 |B8 |C8 |D8 |
|1 |A9 |B9 |C9 |D9 |
|2 |A10 |B10 |C10 |D10 |
|3 |A11 |B11 |C11 |D11 |
Concatenation
Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:
A
B
C
D
0
A0
B0
C0
D0
1
A1
B1
C1
D1
2
A2
B2
C2
D2
3
A3
B3
C3
D3
A
B
C
D
0
A0
B0
C0
D0
1
A1
B1
C1
D1
2
A2
B2
C2
D2
3
A3
B3
C3
D3
4
A4
B4
C4
D4
5
A5
B5
C5
D5
6
A6
B6
C6
D6
7
A7
B7
C7
D7
8
A8
B8
C8
D8
9
A9
B9
C9
D9
10
A10
B10
C10
D10
11
A11
B11
C11
D11
pd . concat ([ df1 , df2 , df3 ], axis = 1 )
A
B
C
D
A
B
C
D
A
B
C
D
0
A0
B0
C0
D0
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1
A1
B1
C1
D1
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
A2
B2
C2
D2
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
3
A3
B3
C3
D3
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4
NaN
NaN
NaN
NaN
A4
B4
C4
D4
NaN
NaN
NaN
NaN
5
NaN
NaN
NaN
NaN
A5
B5
C5
D5
NaN
NaN
NaN
NaN
6
NaN
NaN
NaN
NaN
A6
B6
C6
D6
NaN
NaN
NaN
NaN
7
NaN
NaN
NaN
NaN
A7
B7
C7
D7
NaN
NaN
NaN
NaN
8
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
A8
B8
C8
D8
9
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
A9
B9
C9
D9
10
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
A10
B10
C10
D10
11
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
A11
B11
C11
D11
Example DataFrames
left = pd . DataFrame ({ 'key' : [ 'K0' , 'K1' , 'K2' , 'K3' ],
'A' : [ 'A0' , 'A1' , 'A2' , 'A3' ],
'B' : [ 'B0' , 'B1' , 'B2' , 'B3' ]})
right = pd . DataFrame ({ 'key' : [ 'K0' , 'K1' , 'K2' , 'K3' ],
'C' : [ 'C0' , 'C1' , 'C2' , 'C3' ],
'D' : [ 'D0' , 'D1' , 'D2' , 'D3' ]})
A
B
key
0
A0
B0
K0
1
A1
B1
K1
2
A2
B2
K2
3
A3
B3
K3
C
D
key
0
C0
D0
K0
1
C1
D1
K1
2
C2
D2
K2
3
C3
D3
K3
Merging
The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:
pd . merge ( left , right , how = 'inner' , on = 'key' )
A
B
key
C
D
0
A0
B0
K0
C0
D0
1
A1
B1
K1
C1
D1
2
A2
B2
K2
C2
D2
3
A3
B3
K3
C3
D3
Or to show a more complicated example:
left = pd . DataFrame ({ 'key1' : [ 'K0' , 'K0' , 'K1' , 'K2' ],
'key2' : [ 'K0' , 'K1' , 'K0' , 'K1' ],
'A' : [ 'A0' , 'A1' , 'A2' , 'A3' ],
'B' : [ 'B0' , 'B1' , 'B2' , 'B3' ]})
right = pd . DataFrame ({ 'key1' : [ 'K0' , 'K1' , 'K1' , 'K2' ],
'key2' : [ 'K0' , 'K0' , 'K0' , 'K0' ],
'C' : [ 'C0' , 'C1' , 'C2' , 'C3' ],
'D' : [ 'D0' , 'D1' , 'D2' , 'D3' ]})
pd . merge ( left , right , on = [ 'key1' , 'key2' ])
A
B
key1
key2
C
D
0
A0
B0
K0
K0
C0
D0
1
A2
B2
K1
K0
C1
D1
2
A2
B2
K1
K0
C2
D2
pd . merge ( left , right , how = 'outer' , on = [ 'key1' , 'key2' ])
A
B
key1
key2
C
D
0
A0
B0
K0
K0
C0
D0
1
A1
B1
K0
K1
NaN
NaN
2
A2
B2
K1
K0
C1
D1
3
A2
B2
K1
K0
C2
D2
4
A3
B3
K2
K1
NaN
NaN
5
NaN
NaN
K2
K0
C3
D3
pd . merge ( left , right , how = 'right' , on = [ 'key1' , 'key2' ])
A
B
key1
key2
C
D
0
A0
B0
K0
K0
C0
D0
1
A2
B2
K1
K0
C1
D1
2
A2
B2
K1
K0
C2
D2
3
NaN
NaN
K2
K0
C3
D3
pd . merge ( left , right , how = 'left' , on = [ 'key1' , 'key2' ])
A
B
key1
key2
C
D
0
A0
B0
K0
K0
C0
D0
1
A1
B1
K0
K1
NaN
NaN
2
A2
B2
K1
K0
C1
D1
3
A2
B2
K1
K0
C2
D2
4
A3
B3
K2
K1
NaN
NaN
Joining
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.
left = pd . DataFrame ({ 'A' : [ 'A0' , 'A1' , 'A2' ],
'B' : [ 'B0' , 'B1' , 'B2' ]},
index = [ 'K0' , 'K1' , 'K2' ])
right = pd . DataFrame ({ 'C' : [ 'C0' , 'C2' , 'C3' ],
'D' : [ 'D0' , 'D2' , 'D3' ]},
index = [ 'K0' , 'K2' , 'K3' ])
A
B
C
D
K0
A0
B0
C0
D0
K1
A1
B1
NaN
NaN
K2
A2
B2
C2
D2
left . join ( right , how = 'outer' )
A
B
C
D
K0
A0
B0
C0
D0
K1
A1
B1
NaN
NaN
K2
A2
B2
C2
D2
K3
NaN
NaN
C3
D3
Operations
There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category.
import pandas as pd
df = pd . DataFrame ({ 'col1' :[ 1 , 2 , 3 , 4 ], 'col2' :[ 444 , 555 , 666 , 444 ], 'col3' :[ 'abc' , 'def' , 'ghi' , 'xyz' ]})
df . head ()
col1
col2
col3
0
1
444
abc
1
2
555
def
2
3
666
ghi
3
4
444
xyz
Info on Unique Values
df [ 'col2' ] . unique ()
array ([ 444 , 555 , 666 ])
df [ 'col2' ] . value_counts ()
444 2
555 1
666 1
Name : col2 , dtype : int64
Selecting Data
#Select from DataFrame using criteria from multiple columns
newdf = df [( df [ 'col1' ] > 2 ) & ( df [ 'col2' ] == 444 )]
col1
col2
col3
3
4
444
xyz
Applying Functions
def times2 ( x ):
return x * 2
df [ 'col1' ] . apply ( times2 )
0 2
1 4
2 6
3 8
Name : col1 , dtype : int64
df [ 'col3' ] . apply ( len )
0 3
1 3
2 3
3 3
Name : col3 , dtype : int64
Permanently Removing a Column
col2
col3
0
444
abc
1
555
def
2
666
ghi
3
444
xyz
Get column and index names
df . columns
Index ([ 'col2' , 'col3' ], dtype = 'object' )
df . index
RangeIndex ( start = 0 , stop = 4 , step = 1 )
Sorting and Ordering a DataFrame
col2
col3
0
444
abc
1
555
def
2
666
ghi
3
444
xyz
df . sort_values ( by = 'col2' ) #inplace=False by default
col2
col3
0
444
abc
3
444
xyz
1
555
def
2
666
ghi
Find Null Values or Check for Null Values
col2
col3
0
False
False
1
False
False
2
False
False
3
False
False
#Drop rows with NaN Values
df . dropna ()
col2
col3
0
444
abc
1
555
def
2
666
ghi
3
444
xyz
Filling in NaN values with something else
df = pd . DataFrame ({ 'col1' :[ 1 , 2 , 3 , np . nan ],
'col2' :[ np . nan , 555 , 666 , 444 ],
'col3' :[ 'abc' , 'def' , 'ghi' , 'xyz' ]})
df . head ()
col1
col2
col3
0
1.0
NaN
abc
1
2.0
555.0
def
2
3.0
666.0
ghi
3
NaN
444.0
xyz
col1
col2
col3
0
1
FILL
abc
1
2
555
def
2
3
666
ghi
3
FILL
444
xyz
data = { 'A' :[ 'foo' , 'foo' , 'foo' , 'bar' , 'bar' , 'bar' ],
'B' :[ 'one' , 'one' , 'two' , 'two' , 'one' , 'one' ],
'C' :[ 'x' , 'y' , 'x' , 'y' , 'x' , 'y' ],
'D' :[ 1 , 3 , 2 , 5 , 4 , 1 ]}
df = pd . DataFrame ( data )
A
B
C
D
0
foo
one
x
1
1
foo
one
y
3
2
foo
two
x
2
3
bar
two
y
5
4
bar
one
x
4
5
bar
one
y
1
df . pivot_table ( values = 'D' , index = [ 'A' , 'B' ], columns = [ 'C' ])
C
x
y
A
B
bar
one
4.0
1.0
two
NaN
5.0
foo
one
1.0
3.0
two
2.0
NaN
CSV Data
df = pd . read_csv ( 'example' )
df
a
b
c
d
0
0
1
2
3
1
4
5
6
7
2
8
9
10
11
3
12
13
14
15
CSV Output
df . to_csv ( 'example' , index = False )
Excel Data
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.
pd . read_excel ( 'Excel_Sample.xlsx' , sheetname = 'Sheet1' )
a
b
c
d
0
0
1
2
3
1
4
5
6
7
2
8
9
10
11
3
12
13
14
15
Excel Output
df . to_excel ( 'Excel_Sample.xlsx' , sheet_name = 'Sheet1' )
HTML Data
You may need to install htmllib5,lxml, and BeautifulSoup4. In your terminal/command prompt run:
python -m pip install lxml
python -m pip install html5lib
python -m pip install BeautifulSoup4
Pandas can read table tabs off of html.
Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:
df = pd . read_html ( 'http://www.fdic.gov/bank/individual/failed/banklist.html' )
df [ 0 ]
Bank Name
City
ST
CERT
Acquiring Institution
Closing Date
Updated Date
Loss Share Type
Agreement Terminated
Termination Date
0
First CornerStone Bank
King of Prussia
PA
35312
First-Citizens Bank & Trust Company
May 6, 2016
July 12, 2016
none
NaN
NaN
1
Trust Company Bank
Memphis
TN
9956
The Bank of Fayette County
April 29, 2016
August 4, 2016
none
NaN
NaN
2
North Milwaukee State Bank
Milwaukee
WI
20364
First-Citizens Bank & Trust Company
March 11, 2016
June 16, 2016
none
NaN
NaN
3
Hometown National Bank
Longview
WA
35156
Twin City Bank
October 2, 2015
April 13, 2016
none
NaN
NaN
4
The Bank of Georgia
Peachtree City
GA
35259
Fidelity Bank
October 2, 2015
April 13, 2016
none
NaN
NaN
...
...
...
...
...
...
...
...
...
...
...
543
National State Bank of Metropolis
Metropolis
IL
3815
Banterra Bank of Marion
December 14, 2000
March 17, 2005
none
NaN
NaN
544
Bank of Honolulu
Honolulu
HI
21029
Bank of the Orient
October 13, 2000
March 17, 2005
none
NaN
NaN
SQL Data
Note
If you are completely unfamiliar with SQL you can check out my other course: "Complete SQL Bootcamp" to learn SQL.
The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default. You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs.
If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compatibility, but this is deprecated and will be removed in a future version). This mode requires a Python database adapter which respect the Python DB-API.
See also some cookbook examples for some advanced strategies.
The key functions are:
read_sql_table(table_name, con[, schema, ...])
Read SQL database table into a DataFrame.
read_sql_query(sql, con[, index_col, ...])
Read SQL query into a DataFrame.
read_sql(sql, con[, index_col, ...])
Read SQL query or database table into a DataFrame.
DataFrame.to_sql(name, con[, flavor, ...])
Write records stored in a DataFrame to a SQL database.
from sqlalchemy import create_engine
engine = create_engine ( 'sqlite:///:memory:' )
df . to_sql ( 'data' , engine )
sql_df = pd . read_sql ( 'data' , con = engine )
sql_df
index
a
b
c
d
0
0
0
1
2
3
1
1
4
5
6
7
2
2
8
9
10
11
3
3
12
13
14
15