Data Overview, Slicing and Selection¶
import pandas as pd
stock_data = pd.read_csv("pair_ETF_0302.csv")
When you read from raw data file, you get a Dataframe.
type(stock_data)
Data Overview¶
Get an overview of the data that you are working on. In this stage, you will focus on getting a board idea of data size, data type and information.
stock_data.head() #This is a function call, remember the ()
stock_data.tail()
You can access columns the below methods
stock_data.pairs.head()
stock_data["pairs"].head()
You will get a data type of Series. You can view Series as a single unit of Dataframe.
type(stock_data["pairs"])
You can access specific cell like this.
stock_data["pairs"][0]
type(stock_data["pairs"][0])
Slicing and Selection¶
You can access a certain portion of data. Please remind that index starts from 0.
stock_data[:].head() #Showing all slices
stock_data[2:].head() #start from the 3rd row
stock_data[:3] #get until the 4th row
stock_data["pairs"].head() # Usually I use column name to access column
More Slicing and Selection¶
This doesn’t work because it’s accessing the index, and the Dataframe doesn’t have the index named 0
stock_data[0]
stock_data.iloc[0] #But you can access nth row using .iloc[n]
type(stock_data.iloc[0]) #it returns a Series
stock_data.iloc[0:2] #You can also feed a portion to .iloc and it will return Dataframe
stock_data.loc[[1,2]] #.loc can also be feeded a list of index
Using .loc[], you can conduct more complex selection
stock_data.loc[:,"pairs"].head() # .loc can also be used to access columns
Remember to feed a list when you want to select multiple columns or rows
stock_data.loc[:,["pairs","total_return"]].head()
stock_data.loc[stock_data["ave_return"]>0].head()
.loc is for index access, which means you should feed index or index generator into .iloc[]
stock_data.loc[[True,True,False]] #Boolean can be feeded into .loc, empty means false
To use more complex selection, you can use lambda functions
stock_data.loc[lambda stock_data:stock_data["total_return"]>200000]
One Last Thing¶
As a good practice, you should always use .loc to set value. Because if you use direct access, like stock_data[“pairs”], Python will return a copy of the Dataframe, which is not changeable.
stock_data.loc[0,"ave_return"] = 0
stock_data.head()
Series & Numpy Array¶
Pandas’s Series is internally Numpy Array. In the following demonstration, you will see if you take .values of the Series, you get an array.
stock_data.head()["pairs"].values