Data Overview, Slicing and Selection¶

import pandas as pd

stock_data = pd.read_csv("pair_ETF_0302.csv")

When you read from raw data file, you get a Dataframe.

type(stock_data)

pandas.core.frame.DataFrame

Data Overview¶

Get an overview of the data that you are working on. In this stage, you will focus on getting a board idea of data size, data type and information.

stock_data.head() #This is a function call, remember the ()

stock_data.tail()

You can access columns the below methods

stock_data.pairs.head()

0    ('NYT', 'BANC')
1     ('EEA', 'ESS')
2     ('BXP', 'LFC')
3    ('PCF', 'ANIK')
4    ('NDSN', 'POL')
Name: pairs, dtype: object

stock_data["pairs"].head()

0    ('NYT', 'BANC')
1     ('EEA', 'ESS')
2     ('BXP', 'LFC')
3    ('PCF', 'ANIK')
4    ('NDSN', 'POL')
Name: pairs, dtype: object

You will get a data type of Series. You can view Series as a single unit of Dataframe.

type(stock_data["pairs"])

pandas.core.series.Series

You can access specific cell like this.

stock_data["pairs"][0]

"('NYT', 'BANC')"

type(stock_data["pairs"][0])

str

Slicing and Selection¶

You can access a certain portion of data. Please remind that index starts from 0.

stock_data[:].head() #Showing all slices

stock_data[2:].head()  #start from the 3rd row

stock_data[:3] #get until the 4th row

stock_data["pairs"].head()  # Usually I use column name to access column

0    ('NYT', 'BANC')
1     ('EEA', 'ESS')
2     ('BXP', 'LFC')
3    ('PCF', 'ANIK')
4    ('NDSN', 'POL')
Name: pairs, dtype: object

More Slicing and Selection¶

This doesn’t work because it’s accessing the index, and the Dataframe doesn’t have the index named 0

stock_data[0]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-15-e3654389130b> in <module>()
----> 1 stock_data[0]

C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2683             return self._getitem_multilevel(key)
   2684         else:
-> 2685             return self._getitem_column(key)
   2686 
   2687     def _getitem_column(self, key):

C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   2690         # get column
   2691         if self.columns.is_unique:
-> 2692             return self._get_item_cache(key)
   2693 
   2694         # duplicate columns & possible reduce dimensionality

C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   2484         res = cache.get(item)
   2485         if res is None:
-> 2486             values = self._data.get(item)
   2487             res = self._box_item_values(item, values)
   2488             cache[item] = res

C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3063                 return self._engine.get_loc(key)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066 
   3067         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

stock_data.iloc[0] #But you can access nth row using .iloc[n]

pairs           ('NYT', 'BANC')
ave_return              -1683.6
total_return           -15.1676
volatility              139.661
sharp_ratio           -0.108603
Name: 0, dtype: object

type(stock_data.iloc[0]) #it returns a Series

pandas.core.series.Series

stock_data.iloc[0:2] #You can also feed a portion to .iloc and it will return Dataframe

stock_data.loc[[1,2]] #.loc can also be feeded a list of index

Using .loc[], you can conduct more complex selection

stock_data.loc[:,"pairs"].head() # .loc can also be used to access columns

0    ('NYT', 'BANC')
1     ('EEA', 'ESS')
2     ('BXP', 'LFC')
3    ('PCF', 'ANIK')
4    ('NDSN', 'POL')
Name: pairs, dtype: object

Remember to feed a list when you want to select multiple columns or rows

stock_data.loc[:,["pairs","total_return"]].head()

stock_data.loc[stock_data["ave_return"]>0].head()

.loc is for index access, which means you should feed index or index generator into .iloc[]

stock_data.loc[[True,True,False]] #Boolean can be feeded into .loc, empty means false

To use more complex selection, you can use lambda functions

stock_data.loc[lambda stock_data:stock_data["total_return"]>200000]

One Last Thing¶

As a good practice, you should always use .loc to set value. Because if you use direct access, like stock_data[“pairs”], Python will return a copy of the Dataframe, which is not changeable.

stock_data.loc[0,"ave_return"] = 0

stock_data.head()

Series & Numpy Array¶

Pandas’s Series is internally Numpy Array. In the following demonstration, you will see if you take .values of the Series, you get an array.

stock_data.head()["pairs"].values

array(["('NYT', 'BANC')", "('EEA', 'ESS')", "('BXP', 'LFC')",
       "('PCF', 'ANIK')", "('NDSN', 'POL')"], dtype=object)

	pairs	ave_return	total_return	volatility	sharp_ratio
0	(‘NYT’, ‘BANC’)	-1683.602874	-15.167593	139.660600	-0.108603
1	(‘EEA’, ‘ESS’)	7.915179	0.121772	123.131991	0.000989
2	(‘BXP’, ‘LFC’)	1194.149695	11.593686	88.070115	0.131642
3	(‘PCF’, ‘ANIK’)	2.441353	0.021605	13.484533	0.001602
4	(‘NDSN’, ‘POL’)	-293.861701	-3.719768	36.593718	-0.101650

	pairs	ave_return	total_return	volatility	sharp_ratio
58587	(‘INFO’, ‘MYF’)	1111.644206	8.964873	84.062452	0.106645
58588	(‘AL’, ‘CSL’)	917.521116	7.399364	63.715749	0.116131
58589	(‘PFL’, ‘DSW’)	-876.763054	-7.070670	49.698238	-0.142272
58590	(‘HWBK’, ‘KRNY’)	-106.253729	-0.856885	22.987587	-0.037276
58591	(‘AN’, ‘KAP’)	36.888918	0.297491	7.444777	0.039960

	pairs	ave_return	total_return	volatility	sharp_ratio
0	(‘NYT’, ‘BANC’)	-1683.602874	-15.167593	139.660600	-0.108603
1	(‘EEA’, ‘ESS’)	7.915179	0.121772	123.131991	0.000989
2	(‘BXP’, ‘LFC’)	1194.149695	11.593686	88.070115	0.131642
3	(‘PCF’, ‘ANIK’)	2.441353	0.021605	13.484533	0.001602
4	(‘NDSN’, ‘POL’)	-293.861701	-3.719768	36.593718	-0.101650

	pairs	ave_return	total_return	volatility	sharp_ratio
2	(‘BXP’, ‘LFC’)	1194.149695	11.593686	88.070115	0.131642
3	(‘PCF’, ‘ANIK’)	2.441353	0.021605	13.484533	0.001602
4	(‘NDSN’, ‘POL’)	-293.861701	-3.719768	36.593718	-0.101650
5	(‘DAKT’, ‘LPG’)	-578.896261	-5.122976	255.150999	-0.020078
6	(‘AGR’, ‘MXIM’)	2203.294225	21.600924	163.622809	0.132017

	pairs	ave_return	total_return	volatility	sharp_ratio
0	(‘NYT’, ‘BANC’)	-1683.602874	-15.167593	139.660600	-0.108603
1	(‘EEA’, ‘ESS’)	7.915179	0.121772	123.131991	0.000989
2	(‘BXP’, ‘LFC’)	1194.149695	11.593686	88.070115	0.131642

Data Organizing

How to Correctly Install CuDF for Python

Introduction of CuDF

The Challenges

Installing the Nvidia Cuda framework (Ubuntu)

Install CuDF

Pandas — Data Overviewing, Slicing and Selection

Data Overview, Slicing and Selection¶

Data Overview¶

Slicing and Selection¶

More Slicing and Selection¶

One Last Thing¶

Series & Numpy Array¶

Pandas — Reading Raw Data

Pandas — Data Clean-Up.ipynb

Pandas — Indexing, Grouping & Aggregation.ipynb

Pandas — Adding Columns or Rows to Dataframe.ipynb

Pandas – Reading Data

Data Overview, Slicing and Selection

	pairs	ave_return	total_return	volatility	sharp_ratio
4684	(‘ITG’, ‘OIS’)	inf	inf	NaN	NaN
12016	(‘LEA’, ‘CTX’)	1.873543e+08	1.510922e+06	1.684723e+07	0.089684
24034	(‘ENVA’, ‘CTX’)	1.718711e+08	1.386057e+06	1.537679e+07	0.090140
34122	(‘TVPT’, ‘FULT’)	inf	inf	NaN	NaN
42235	(‘FOE’, ‘AOD’)	inf	inf	NaN	NaN
42749	(‘ALTR’, ‘EPD’)	inf	inf	NaN	NaN
46295	(‘FORM’, ‘AMSF’)	inf	inf	NaN	NaN
46678	(‘GS’, ‘ONB’)	inf	inf	NaN	NaN
54771	(‘NOV’, ‘CTX’)	1.789989e+08	1.443540e+06	1.601276e+07	0.090149

	pairs	ave_return	total_return	volatility	sharp_ratio
0	(‘NYT’, ‘BANC’)	0.000000	-15.167593	139.660600	-0.108603
1	(‘EEA’, ‘ESS’)	7.915179	0.121772	123.131991	0.000989
2	(‘BXP’, ‘LFC’)	1194.149695	11.593686	88.070115	0.131642
3	(‘PCF’, ‘ANIK’)	2.441353	0.021605	13.484533	0.001602
4	(‘NDSN’, ‘POL’)	-293.861701	-3.719768	36.593718	-0.101650