Run Regression in Python with Statsmodel Package

Run Regression
In [9]:
from statsmodels import api as sm
from my_libs import *

Regress the SPY and VIX index

  • Need to translate the result into np.array
  • Need to change type to float
In [51]:
spy = get_price_data(["SPY"],method='day',back_day=20).dropna().Return.values.astype(float)
spy_ = spy*30
All price data of Close is actually Adj Close
Connection Successful
Finished SPY

Constructed a model of vix = intercept + b0 * spy + b1 * spy * 30

In [52]:
ip = pd.DataFrame({"spy":spy,"spy_":spy_})
dp = get_price_data(["^VIX"],method='day',back_day=20).dropna().Return.values.astype(float)
All price data of Close is actually Adj Close
Connection Successful
no data for ^VIX
'NoneType' object has no attribute 'index'
switching to realtimeday method
Finished ^VIX
In [53]:
ip = sm.add_constant(ip)
/home/ken/.local/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
In [54]:
sm.OLS(dp,ip).fit().summary()
/home/ken/.local/lib/python2.7/site-packages/scipy/stats/stats.py:1416: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13
  "anyway, n=%i" % int(n))
Out[54]:
OLS Regression Results
Dep. Variable: y R-squared: 0.737
Model: OLS Adj. R-squared: 0.713
Method: Least Squares F-statistic: 30.80
Date: Sun, 04 Aug 2019 Prob (F-statistic): 0.000173
Time: 19:40:53 Log-Likelihood: 25.241
No. Observations: 13 AIC: -46.48
Df Residuals: 11 BIC: -45.35
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.0033 0.011 0.305 0.766 -0.021 0.027
spy -0.0109 0.002 -5.550 0.000 -0.015 -0.007
spy_ -0.3256 0.059 -5.550 0.000 -0.455 -0.196
Omnibus: 9.222 Durbin-Watson: 1.071
Prob(Omnibus): 0.010 Jarque-Bera (JB): 4.912
Skew: -1.262 Prob(JB): 0.0858
Kurtosis: 4.641 Cond. No. 5.85e+17


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.82e-35. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

OLS Regression Model




OLS Regression Model







We will talk about Ordinary Least Square model in Python. In this article, we will just go over how to use OLS in Python without explaining the interpretation of the result.

Here, we will use sklearn and statsmodels packages to perform OLS modeling and compare the differences

In [105]:
from sklearn import linear_model as lm
import statsmodels.api as sm
from data_source_lib import *
from matplotlib import pyplot as pl

Use our data source class to get AAPL and SPY daily stock price

In [121]:
data_source = get_stock_data(tic_list=["AAPL","SPY"],freq = "daily")
data = data_source.get_ondemand_data()

# We can screen each stock by ticker names
AAPL = data[data.Ticker=="AAPL"]
SPY = data[data.Ticker=="SPY"]
Finished SPY
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.11it/s]

First, we will use sklearn package

In [109]:
# define the instance 
reg = lm.LinearRegression()
In [122]:
# Before applying the data, we should 
# turn it into numpy array

AAPL = np.array(AAPL[["close"]]) # we have imported numpy in our data source libary
SPY = np.array(SPY[["close"]])

# The reason I use SPY[["close"]] is to get 2D array
In [123]:
reg.fit(X=AAPL,y=SPY)
Out[123]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [124]:
reg.coef_
Out[124]:
array([[0.39143143]])
In [125]:
reg.intercept_
Out[125]:
array([199.09234191])

However, the sklearn package didn’t offer full statistic information in the OLS model. So we should use statsmodel instead for more information

Second, statsmodel Package

In [133]:
# This is to add a constant into the independent side of the model

AAPL2 = sm.add_constant(AAPL)
In [134]:
model = sm.OLS(SPY,AAPL2)
In [135]:
model = model.fit()
In [136]:
model.summary()
Out[136]:
OLS Regression Results
Dep. Variable: y R-squared: 0.639
Model: OLS Adj. R-squared: 0.636
Method: Least Squares F-statistic: 221.5
Date: Sun, 28 Oct 2018 Prob (F-statistic): 1.89e-29
Time: 23:33:17 Log-Likelihood: -383.96
No. Observations: 127 AIC: 771.9
Df Residuals: 125 BIC: 777.6
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 199.0923 5.344 37.257 0.000 188.516 209.668
x1 0.3914 0.026 14.882 0.000 0.339 0.443
Omnibus: 35.484 Durbin-Watson: 0.103
Prob(Omnibus): 0.000 Jarque-Bera (JB): 59.620
Skew: -1.304 Prob(JB): 1.13e-13
Kurtosis: 5.114 Cond. No. 2.44e+03

Graphing/Plotting




There are a lot to say for graphing, or plotting. The package we used in Python is matplotlib, versus ggplot in R programming. In order to illustrate plotting, I also import numpy here to create some sample dataset.

In [4]:
from matplotlib import pylab as plt
import numpy as np

You can simply create plots like this:

In [28]:
x =np.linspace(-np.pi, np.pi, 255,endpoint=True)
# numpy's linespace function is pretty good at creating a x axis

y = np.sin(x)
plt.plot(x,y)
plt.show()
In [32]:
# Of course you can overlap plots

x =np.linspace(-np.pi, np.pi, 255,endpoint=True)
y = np.sin(x)
z = 2*x
plt.plot(x,y)
plt.plot(x,z)
plt.show()

Just need to make sure two plots run at the same time and they will be overlapped. At least one axis is the same among plots otherwise it will return an error.

If you want to customize the output, you need to do more.

Creating Subplot

You can have more than one plot in one canvas. The way to control it is to use subplot() method

The syntax for subplot is plt.subplot(No.row No.Col No.)

The top left plot of a 2×2 plot is plt.subplot(221)
The plot on its right is plt.subplot(222)

In [40]:
my_plot = plt.subplot(221) 
my_plot.plot(x,y)
# usually we store it into a variale for further formatting

my_plot2=plt.subplot(222) 
my_plot2.plot(x,z)

plt.show()

Setting The Plot Space

In [42]:
# Set the canvas
# The value in figsize is how many increments
plt.figure(figsize=(8,5), dpi=80)

my_plot = plt.subplot(111)
my_plot.plot(x,z)

# You can also set how the plot is being framed

my_plot.spines['right'].set_color('none')
my_plot.spines['top'].set_color('none')
my_plot.xaxis.set_ticks_position('bottom')
my_plot.yaxis.set_ticks_position('left')

# This property sets how the graph looks like
my_plot.spines['left'].set_position(('axes',0))
my_plot.spines['bottom'].set_position(("axes",0))

##### What if we change "axes" to" data"? 

plt.show()

And More

In [40]:
# Set the canvas
# The value in figsize is how many increments
plt.figure(figsize=(8,5), dpi=80)

my_plot = plt.subplot(111)

# You can also set color, line width, style and label
my_plot.plot(x,y,color="red", linewidth=1.5, linestyle="-", label="cosine")

# You can also set how the plot is being framed

my_plot.spines['right'].set_color('none')
my_plot.spines['top'].set_color('none')
my_plot.xaxis.set_ticks_position('bottom')
my_plot.yaxis.set_ticks_position('left')

# This property sets how the graph looks like
my_plot.spines['left'].set_position(('data',0))
my_plot.spines['bottom'].set_position(("data",0))


# I can also manipulate the axises 
plt.xlim(x.min()*1.1, x.max()*1.1) # set limits of current axis
plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi],
           [r'$-\pi

#39;, r‘$-\pi/2


#39;, r‘$0


#39;, r‘$+\pi/2


#39;, r‘$+\pi


#39;])
plt.ylim(y.min()*1.1,y.max()*1.1)
plt.yticks(range(10,10,1)
)
# annotate a specific point
plt.annotate(r‘$\sin(\frac{\pi}{2})=1


#39;,
xy=(np.pi/2,1), xycoords=‘data’,
xytext=(60, 40), textcoords=‘offset points’, fontsize=16,
arrowprops=dict(arrowstyle=“->”, connectionstyle=“arc3,rad=.2”))

plt.show()

Scatter Plot and More

In [44]:
import time
np.random.seed(int(time.time()))
#trial = [i for i in np.random.rand(100) ]
trial = np.array(np.random.rand(100))
y = trial *2
plt.scatter(trial,y)

# we can save the picture file
plt.savefig("test.png",dpi=72)

Histogram

Let’s get some finance data for this example

In [1]:
from data_source_lib import *

# import our magic lab
In [11]:
get_ins = get_stock_data(["AAPL"],freq= "daily",day_range=300)
my_data = get_ins.get_ondemand_data()["close"]
Finished AAPL
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.07it/s]
In [26]:
plt.figure(figsize=(8,5), dpi=80)
my_hist = plt.hist(my_data)
plt.xticks(range(50,300,10))
plt.show()

Technical Analysis

You may hear about Technical Analyst. That’s the basis of quantitative analysis of the stock market. There is a package in Python that makes the calculation of all the technical indicator much easier. I chose some of my favorite indicators and integrate into my code library.

Applying the talib is a little bit tricky, but take a look at this one line of code.

price.loc[price.Ticker==i,"ADXR"]= ta.ADXR(price.loc[price.Ticker==i].High.values, price.loc[price.Ticker==i].Low.values, price.loc[price.Ticker==i].Close.values, timeperiod=14)

The variable price is a DataFrame that return from our data getting object. Since we only want to analyze one stock’s time series feature, we want to filter the one ticker at one calculation, using price.Ticker == i. Then, the ADXR,  Average Directional Movement Index Rating indicator takes in High, Low and close. Basically, put them in the function of ta.ADXR()

For more information about the talib, check TA-Lib : Technical Analysis Library

Here’s the full picture of the function.

 




Technical Analysis







In [16]:
import talib as ta
import data_source_lib as da
In [15]:
get_data = da.get_stock_data(["AAPL"],freq = "daily")
price = get_data.get_ondemand_data()
Finished AAPL
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.56it/s]
In [40]:
 def get_technicals(price) :
    import pandas as pd
    import tqdm
    from IPython.display import clear_output
    if not isinstance(price, pd.DataFrame):
        raise "Please feed a DataFrame object"
    for i in tqdm.tqdm(range(len(set(price.Ticker)))):
       
       
        i = list(set(price.Ticker))
        #print price .loc[price.Ticker==i]

        #price.groupby('Ticker').get_group(list(set(price.Ticker))[i])
        #price.loc[price.Ticker==i,"ADX"]= ta.ADX(price.loc[price.Ticker==i].High.values, price.loc[price.Ticker==i].Low.values, price.loc[price.Ticker==i].Close.values, timeperiod=14)
        price.loc[price.Ticker==i,"ADXR"]= ta.ADXR(price.loc[price.Ticker==i].High.values, price.loc[price.Ticker==i].Low.values,\
                                                   price.loc[price.Ticker==i].Close.values, timeperiod=14)
        price.loc[price.Ticker==i,"APO"]= ta.APO(price.loc[price.Ticker==i].Close.values, fastperiod=12, slowperiod=26, matype=0)
        price.loc[price.Ticker==i,"AROONOSC"]= ta.AROONOSC(price.loc[price.Ticker==i].High.values,price.loc[price.Ticker==i].Close.values, timeperiod=14)
        price.loc[price.Ticker==i,"CCI"]= ta.CCI(price.loc[price.Ticker==i].High.values,price.loc[price.Ticker==i].Low.values,price.loc[price.Ticker==i].Close.values, timeperiod=14)
        price.loc[price.Ticker==i,"MFI"]= ta.MFI(price.loc[price.Ticker==i].High.values, price.loc[price.Ticker==i].Low.values, price.loc[price.Ticker==i].Close.values,\
                                                 price.loc[price.Ticker==i].loc[price.Ticker==i].Volume.values.astype(float),timeperiod=14)
        price.loc[price.Ticker==i,"MACD"], price.loc[price.Ticker==i,"MACD_signal"], price.loc[price.Ticker==i,"MACD_hist"] = ta.MACD(price.loc[price.Ticker==i].Close.values, fastperiod=12, slowperiod=26, signalperiod=9)
        price.loc[price.Ticker==i,"ROCP"]= ta.ROCP(price.loc[price.Ticker==i].Close.values, timeperiod=10)
        #price.loc[price.Ticker==i,"ROCR100"]= ta.ROCR100(price.loc[price.Ticker==i].Close.values, timeperiod=10)
        price.loc[price.Ticker==i,"RSI"]= ta.RSI(price.loc[price.Ticker==i].Close.values, timeperiod=14)
        price.loc[price.Ticker==i,"MA_fast"] = price.Close.rolling(10).mean()
        price.loc[price.Ticker==i,"MA_slow"] = price.Close.rolling(30).mean()
        clear_output()
        print "\nDone:", i

Create Your Own Abstract Data Reader Object

Base on the available stock data sources we have, we should create our own data reader in the abstract data structure, which means we will hide the data getting process but only takes in command and give out standardized data table.

Here are all the dependencies we need:

from enum import Enum
from datetime import datetime, timedelta
import pandas as pd
import time
from IPython.display import clear_output
import tqdm
import requests as re
import json

Things need to mentions is that Json is used to wrap/format web request data into dataframe. tqdm is a package for progress bar display.

Our goal is to build a class object to get data, let build the constructor first. Since getting data from sources involves many arguments, one of the best practice for me is to initialize those parameters in the constructor.

     def __init__(self,tic_list, output="table", **kwargs):
        self.arg_list = {"freq": 'minutes',"start_date": datetime.now()-timedelta(days =256),"end_date":datetime.now(), "timeframe": 256, "file_name":""}
        
        self.tic_list = tic_list
        self.output = output
        self.arg_list["start_date"] 
        
        
        for key , arg in kwargs.iteritems():
            
            if key in ["freq","start_date","end_date"]:
                self.arg_list[key]=arg
            
            if key in ["timeframe"]:
                self.arg_list[key]=arg
                self.arg_list["start_date"] = datetime.now()-timedelta(days =arg)

In the constructor, I take in a list of tickers and a couple of data format setting variables. They include frequency, start date and end date. I also set a variable to control output type, which gives me the flexibility to choose data storage options. Later, I will talk about using mongo database to store stock data.

Next, it’s the most important part, data query function.

The first one is the historical data query function, get_ondemand_data()

def get_ondemand_data(self, interval = 1):
         
         self.result = pd.DataFrame()
         
         for i in tqdm.tqdm(range(len(self.tic_list))):
             trial = 0
             i = self.tic_list[i]
             while trial <3:
                 try:
                     api_key = '95b5894daf3abced33fe48e7f265315e'
                     start_date=self.arg_list["start_date"].strftime("%Y%m%d%H%M%S")
                     end_date=self.arg_list["end_date"].strftime("%Y%m%d%H%M%S")
                     # This is the required format for datetimes to access the API

                     api_url = 'http://marketdata.websol.barchart.com/getHistory.csv?' + \
                                             'key={}&symbol={}&type={}&startDate={}&endDate={}&interval={}'\
                                              .format(api_key, i, self.arg_list["freq"], start_date,end_date,interval)

                     temp = pd.read_csv(api_url, parse_dates=['timestamp'])
                     temp.set_index('timestamp', inplace=True)



                     #index= pd.MultiIndex.from_product([[i],temp.index])
                     #temp=pd.DataFrame(data=temp.values,index=index,columns=temp.columns)

                     self.result = self.result.append(temp)
                     clear_output()
                     print "Finished", i
                     
                     #time.sleep(5)
                     trial=3

                 except Exception as e:
                     print e
                     print "error occorded in getting data for ", i
                     trial +=1
                     time.sleep(10)
                     if trial == 3:
                         self.error.append([i,'get_ondemand'])
         return self.data_output()

I won’t go deep on this, but please be noticed that I add an out loop for retries of data requests in case connection error occurs.

The second one is the get stock quote function, get_quote()

def get_quote(self):
   
   self.result = pd.DataFrame()
   
   for i in tqdm.tqdm(range(len(self.tic_list))):
       i = self.tic_list[i]

   profile="https://financialmodelingprep.com/api/company/price/{}".format(i)

   temp = re.get(profile, verify=False).text

   temp=self.result.replace("\n","")

   temp = self.result.replace("<pre>","")

   temp= json.loads(result)

   temp = pd.DataFrame(result).transpose()
   
   self.result = self.result.append(temp)
   
   self.data_output()

Lastly, we need to create a function for standradize output.

def data_output(self):
  
   self.result = self.result.reset_index()
   self.result["Close"] = self.result["close"]
   self.result = self.result.rename(columns={'symbol':'Ticker','timestamp':"TimeStamp","high":"High","low":"Low","open":"Open","volume":"Volume"})
   self.result["Return"]=( self.result.Close.diff(1)/self.result.Close)
   
   if self.output == "table":
       
       return self.result
    
   if self.output == "file":
       self.result.to_csv(self.arg_list["file_name"])

Put them all together. You can save them into a .py file so that next time when you use them, you can just import the .py file name.




Create Your Own Abstract Data Reader Object







Base on the available stock data sources we have, we should create our own data reader in abstract data structure, which means we will hide the data getting process but only takes in command and give out standardized data table.

In [34]:
from enum import Enum
from datetime import datetime, timedelta
import pandas as pd
import time
from IPython.display import clear_output
import tqdm
import requests as re
import json
In [ ]:
class get_stock_data():
    
     def __init__(self,tic_list, output="table", **kwargs):
        self.arg_list = {"freq": 'minutes',"start_date": datetime.now()-timedelta(days =256),\
                    "end_date":datetime.now(), "day_range": 256, "file_name":""}
        
        self.tic_list = tic_list
        self.output = output
        self.arg_list["start_date"] 
        
        
        for key , arg in kwargs.iteritems():
            
            if key in ["freq","start_date","end_date"]:
                self.arg_list[key]=arg
            
            if key in ["timeframe"]:
                self.arg_list[key]=arg
                self.arg_list["start_date"] = datetime.now()-timedelta(days =arg)
    
        self.error = []
    
     def data_output(self):
       
        self.result = self.result.reset_index()
        self.result["Close"] = self.result["close"]
        self.result = self.result.rename(columns={'symbol':'Ticker','timestamp':"TimeStamp","high":"High","low":"Low","open":"Open","volume":"Volume"})
        self.result["Return"]=( self.result.Close.diff(1)/self.result.Close)
        
        if self.output == "table":
            
            return self.result
    
        if self.output == "file":
            self.result.to_csv(self.arg_list["file_name"])
     
        

    
     def get_ondemand_data(self, interval = 1):
            
            self.result = pd.DataFrame()
            
            for i in tqdm.tqdm(range(len(self.tic_list))):
                trial = 0
                i = self.tic_list[i].upper()
                while trial <3:
                    try:
                        api_key = '95b5894daf3abced33fe48e7f265315e'
                        start_date=self.arg_list["start_date"].strftime("%Y%m%d%H%M%S")
                        end_date=self.arg_list["end_date"].strftime("%Y%m%d%H%M%S")
                        # This is the required format for datetimes to access the API

                        api_url = 'http://marketdata.websol.barchart.com/getHistory.csv?' + \
                                                'key={}&symbol={}&type={}&startDate={}&endDate={}&interval={}'\
                                                 .format(api_key, i, self.arg_list["freq"], start_date,end_date,interval)

                        temp = pd.read_csv(api_url, parse_dates=['timestamp'])
                        temp.set_index('timestamp', inplace=True)



                        #index= pd.MultiIndex.from_product([[i],temp.index])
                        #temp=pd.DataFrame(data=temp.values,index=index,columns=temp.columns)

                        self.result = self.result.append(temp)
                        clear_output()
                        print "Finished", i
                        
                        #time.sleep(5)
                        trial=3

                    except Exception as e:
                        print e
                        print "error occorded in getting data for ", i
                        trial +=1
                        time.sleep(10)
                        if trial == 3:
                            self.error.append([i,'get_ondemand'])
            return self.data_output()
           
            
            
     def get_quote(self):
        
        self.result = pd.DataFrame()
        
        for i in tqdm.tqdm(range(len(self.tic_list))):
            i = self.tic_list[i].upper()

        profile="https://financialmodelingprep.com/api/company/price/{}".format(i)

        temp = re.get(profile, verify=False).text

        temp=self.result.replace("\n","")

        temp = self.result.replace("<pre>","")

        temp= json.loads(result)

        temp = pd.DataFrame(result).transpose()
        
        self.result = self.result.append(temp)
        
        self.data_output()
            
            
         
        
    
In [38]:
my = get_stock_data(["AAPL"],day_range=2)
my.get_ondemand_data().head()
Finished AAPL

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.24s/it]

Out[38]:
TimeStamp Ticker tradingDay Open High Low close Volume Close Return
0 2018-09-21 13:30:00 AAPL 2018-09-21 220.78 221.0988 220.6601 221.0599 8036258 221.0599 NaN
1 2018-09-21 13:31:00 AAPL 2018-09-21 221.05 221.1700 220.1900 220.4000 153898 220.4000 -0.002994
2 2018-09-21 13:32:00 AAPL 2018-09-21 220.42 220.4600 220.1200 220.2200 160661 220.2200 -0.000817
3 2018-09-21 13:33:00 AAPL 2018-09-21 220.22 220.5650 219.9509 220.4600 257272 220.4600 0.001089
4 2018-09-21 13:34:00 AAPL 2018-09-21 220.47 220.7000 220.2906 220.6101 129467 220.6101 0.000680