Pandas — Data Overviewing, Slicing and Selection




Untitled







Data Overview, Slicing and Selection

In [1]:
import pandas as pd
In [2]:
stock_data = pd.read_csv("pair_ETF_0302.csv")

When you read from raw data file, you get a Dataframe.

In [3]:
type(stock_data)
Out[3]:
pandas.core.frame.DataFrame

Data Overview

Get an overview of the data that you are working on. In this stage, you will focus on getting a board idea of data size, data type and information.

In [4]:
stock_data.head() #This is a function call, remember the ()
Out[4]:
pairs ave_return total_return volatility sharp_ratio
0 (‘NYT’, ‘BANC’) -1683.602874 -15.167593 139.660600 -0.108603
1 (‘EEA’, ‘ESS’) 7.915179 0.121772 123.131991 0.000989
2 (‘BXP’, ‘LFC’) 1194.149695 11.593686 88.070115 0.131642
3 (‘PCF’, ‘ANIK’) 2.441353 0.021605 13.484533 0.001602
4 (‘NDSN’, ‘POL’) -293.861701 -3.719768 36.593718 -0.101650
In [5]:
stock_data.tail()
Out[5]:
pairs ave_return total_return volatility sharp_ratio
58587 (‘INFO’, ‘MYF’) 1111.644206 8.964873 84.062452 0.106645
58588 (‘AL’, ‘CSL’) 917.521116 7.399364 63.715749 0.116131
58589 (‘PFL’, ‘DSW’) -876.763054 -7.070670 49.698238 -0.142272
58590 (‘HWBK’, ‘KRNY’) -106.253729 -0.856885 22.987587 -0.037276
58591 (‘AN’, ‘KAP’) 36.888918 0.297491 7.444777 0.039960

You can access columns the below methods

In [6]:
stock_data.pairs.head()
Out[6]:
0    ('NYT', 'BANC')
1     ('EEA', 'ESS')
2     ('BXP', 'LFC')
3    ('PCF', 'ANIK')
4    ('NDSN', 'POL')
Name: pairs, dtype: object
In [7]:
stock_data["pairs"].head()
Out[7]:
0    ('NYT', 'BANC')
1     ('EEA', 'ESS')
2     ('BXP', 'LFC')
3    ('PCF', 'ANIK')
4    ('NDSN', 'POL')
Name: pairs, dtype: object

You will get a data type of Series. You can view Series as a single unit of Dataframe.

In [8]:
type(stock_data["pairs"])
Out[8]:
pandas.core.series.Series

You can access specific cell like this.

In [9]:
stock_data["pairs"][0]
Out[9]:
"('NYT', 'BANC')"
In [10]:
type(stock_data["pairs"][0])
Out[10]:
str

Slicing and Selection

You can access a certain portion of data. Please remind that index starts from 0.

In [11]:
stock_data[:].head() #Showing all slices
Out[11]:
pairs ave_return total_return volatility sharp_ratio
0 (‘NYT’, ‘BANC’) -1683.602874 -15.167593 139.660600 -0.108603
1 (‘EEA’, ‘ESS’) 7.915179 0.121772 123.131991 0.000989
2 (‘BXP’, ‘LFC’) 1194.149695 11.593686 88.070115 0.131642
3 (‘PCF’, ‘ANIK’) 2.441353 0.021605 13.484533 0.001602
4 (‘NDSN’, ‘POL’) -293.861701 -3.719768 36.593718 -0.101650
In [12]:
stock_data[2:].head()  #start from the 3rd row
Out[12]:
pairs ave_return total_return volatility sharp_ratio
2 (‘BXP’, ‘LFC’) 1194.149695 11.593686 88.070115 0.131642
3 (‘PCF’, ‘ANIK’) 2.441353 0.021605 13.484533 0.001602
4 (‘NDSN’, ‘POL’) -293.861701 -3.719768 36.593718 -0.101650
5 (‘DAKT’, ‘LPG’) -578.896261 -5.122976 255.150999 -0.020078
6 (‘AGR’, ‘MXIM’) 2203.294225 21.600924 163.622809 0.132017
In [13]:
stock_data[:3] #get until the 4th row
Out[13]:
pairs ave_return total_return volatility sharp_ratio
0 (‘NYT’, ‘BANC’) -1683.602874 -15.167593 139.660600 -0.108603
1 (‘EEA’, ‘ESS’) 7.915179 0.121772 123.131991 0.000989
2 (‘BXP’, ‘LFC’) 1194.149695 11.593686 88.070115 0.131642
In [14]:
stock_data["pairs"].head()  # Usually I use column name to access column
Out[14]:
0    ('NYT', 'BANC')
1     ('EEA', 'ESS')
2     ('BXP', 'LFC')
3    ('PCF', 'ANIK')
4    ('NDSN', 'POL')
Name: pairs, dtype: object

More Slicing and Selection

This doesn’t work because it’s accessing the index, and the Dataframe doesn’t have the index named 0

In [15]:
stock_data[0] 
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-15-e3654389130b> in <module>()
----> 1 stock_data[0]

C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2683             return self._getitem_multilevel(key)
   2684         else:
-> 2685             return self._getitem_column(key)
   2686 
   2687     def _getitem_column(self, key):

C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   2690         # get column
   2691         if self.columns.is_unique:
-> 2692             return self._get_item_cache(key)
   2693 
   2694         # duplicate columns & possible reduce dimensionality

C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   2484         res = cache.get(item)
   2485         if res is None:
-> 2486             values = self._data.get(item)
   2487             res = self._box_item_values(item, values)
   2488             cache[item] = res

C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3063                 return self._engine.get_loc(key)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066 
   3067         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0
In [16]:
stock_data.iloc[0] #But you can access nth row using .iloc[n]
Out[16]:
pairs           ('NYT', 'BANC')
ave_return              -1683.6
total_return           -15.1676
volatility              139.661
sharp_ratio           -0.108603
Name: 0, dtype: object
In [17]:
type(stock_data.iloc[0]) #it returns a Series
Out[17]:
pandas.core.series.Series
In [18]:
stock_data.iloc[0:2] #You can also feed a portion to .iloc and it will return Dataframe
Out[18]:
pairs ave_return total_return volatility sharp_ratio
0 (‘NYT’, ‘BANC’) -1683.602874 -15.167593 139.660600 -0.108603
1 (‘EEA’, ‘ESS’) 7.915179 0.121772 123.131991 0.000989
In [19]:
stock_data.loc[[1,2]] #.loc can also be feeded a list of index
Out[19]:
pairs ave_return total_return volatility sharp_ratio
1 (‘EEA’, ‘ESS’) 7.915179 0.121772 123.131991 0.000989
2 (‘BXP’, ‘LFC’) 1194.149695 11.593686 88.070115 0.131642

Using .loc[], you can conduct more complex selection

In [20]:
stock_data.loc[:,"pairs"].head() # .loc can also be used to access columns 
Out[20]:
0    ('NYT', 'BANC')
1     ('EEA', 'ESS')
2     ('BXP', 'LFC')
3    ('PCF', 'ANIK')
4    ('NDSN', 'POL')
Name: pairs, dtype: object

Remember to feed a list when you want to select multiple columns or rows

In [21]:
stock_data.loc[:,["pairs","total_return"]].head()
Out[21]:
pairs total_return
0 (‘NYT’, ‘BANC’) -15.167593
1 (‘EEA’, ‘ESS’) 0.121772
2 (‘BXP’, ‘LFC’) 11.593686
3 (‘PCF’, ‘ANIK’) 0.021605
4 (‘NDSN’, ‘POL’) -3.719768
In [22]:
stock_data.loc[stock_data["ave_return"]>0].head()
Out[22]:
pairs ave_return total_return volatility sharp_ratio
1 (‘EEA’, ‘ESS’) 7.915179 0.121772 123.131991 0.000989
2 (‘BXP’, ‘LFC’) 1194.149695 11.593686 88.070115 0.131642
3 (‘PCF’, ‘ANIK’) 2.441353 0.021605 13.484533 0.001602
6 (‘AGR’, ‘MXIM’) 2203.294225 21.600924 163.622809 0.132017
8 (‘PUK’, ‘UTX’) 184.141559 1.629571 28.527874 0.057122

.loc is for index access, which means you should feed index or index generator into .iloc[]

In [23]:
stock_data.loc[[True,True,False]] #Boolean can be feeded into .loc, empty means false
Out[23]:
pairs ave_return total_return volatility sharp_ratio
0 (‘NYT’, ‘BANC’) -1683.602874 -15.167593 139.660600 -0.108603
1 (‘EEA’, ‘ESS’) 7.915179 0.121772 123.131991 0.000989

To use more complex selection, you can use lambda functions

In [24]:
stock_data.loc[lambda stock_data:stock_data["total_return"]>200000]
Out[24]:
pairs ave_return total_return volatility sharp_ratio
4684 (‘ITG’, ‘OIS’) inf inf NaN NaN
12016 (‘LEA’, ‘CTX’) 1.873543e+08 1.510922e+06 1.684723e+07 0.089684
24034 (‘ENVA’, ‘CTX’) 1.718711e+08 1.386057e+06 1.537679e+07 0.090140
34122 (‘TVPT’, ‘FULT’) inf inf NaN NaN
42235 (‘FOE’, ‘AOD’) inf inf NaN NaN
42749 (‘ALTR’, ‘EPD’) inf inf NaN NaN
46295 (‘FORM’, ‘AMSF’) inf inf NaN NaN
46678 (‘GS’, ‘ONB’) inf inf NaN NaN
54771 (‘NOV’, ‘CTX’) 1.789989e+08 1.443540e+06 1.601276e+07 0.090149

One Last Thing

As a good practice, you should always use .loc to set value. Because if you use direct access, like stock_data[“pairs”], Python will return a copy of the Dataframe, which is not changeable.

In [25]:
stock_data.loc[0,"ave_return"] = 0
In [26]:
stock_data.head()
Out[26]:
pairs ave_return total_return volatility sharp_ratio
0 (‘NYT’, ‘BANC’) 0.000000 -15.167593 139.660600 -0.108603
1 (‘EEA’, ‘ESS’) 7.915179 0.121772 123.131991 0.000989
2 (‘BXP’, ‘LFC’) 1194.149695 11.593686 88.070115 0.131642
3 (‘PCF’, ‘ANIK’) 2.441353 0.021605 13.484533 0.001602
4 (‘NDSN’, ‘POL’) -293.861701 -3.719768 36.593718 -0.101650

Series & Numpy Array

Pandas’s Series is internally Numpy Array. In the following demonstration, you will see if you take .values of the Series, you get an array.

In [30]:
stock_data.head()["pairs"].values
Out[30]:
array(["('NYT', 'BANC')", "('EEA', 'ESS')", "('BXP', 'LFC')",
       "('PCF', 'ANIK')", "('NDSN', 'POL')"], dtype=object)