kenli | Ken Li, FRM

Building an Interactive Data Exploration App with R Shiny

November 17, 2023November 17, 2023

Introduction

In this tutorial, we will walk through the creation of an interactive data exploration application using R Shiny. This app allows users to filter data, view various charts, and download them for further analysis.

Prerequisites

Basic understanding of R programming
R and RStudio installed
Shiny, ggplot2, and DT packages installed

App Overview

Our R Shiny app includes:

A filterable table
Interactive charts including bar plots, scatter plots, and line plots
Data download functionality

Getting Started

First, ensure you have the required libraries:

library(shiny)
library(DT)
library(ggplot2)

Data Preparation

Load and preprocess your data. In our case, we are reading from a CSV file and creating bins for age and income:

dataset = read.csv("dataset.csv")
# Create bins for age and income
dataset$AGE_Bin = cut(dataset$AGE,5,include.lowest = TRUE)
dataset$INCOME_Bin = cut(dataset$INCOME,5,include.lowest = TRUE,dig.lab = 6)

The code contains the UI and Server in two parts. I will layout the complete code of each part here, and later in the article, I will delve into the very intuitive UI design in Shiny.

Building the UI

The user interface (UI) is designed with fluidPage for a responsive layout.

ui <-   fluidPage(
    
    h1("Rshiny Homework"),
    h2("Demographic Exploartion"),
    h3("Filterable Table"),
    DT::dataTableOutput("table"),
    br(),
    h3("Charts"),
    selectInput(
        "option",
        "Demography",
        c("AGE_Bin","INCOME_Bin","GENDER"),
        selected = NULL,
        multiple = FALSE,
        selectize = TRUE,
        width = NULL,
        size = NULL
    ),
    
    actionButton("gobutton", "View Chart", class = "btn-success"),
    plotOutput("disPlot"),
    downloadButton(outputId = "disPlot_download", label = "Download Chart",class = "btn-success"),
    
    br(),
    hr(),
    br(),
    h3("Relationship Between Variables"),
    
    tabsetPanel(
        tabPanel("Scatter", 
                 plotOutput("Scatter", brush="selected_range"),
                 br(),
                 downloadButton(outputId = "scatter_download", label = "Download Chart",class = "btn-success"),
                 br(),
                 br(),
                 DT::dataTableOutput("brushed_table")
        ),
        tabPanel("Distribution", 
                 plotOutput("displot2"),
                 downloadButton(outputId = "displot2_download", label = "Download Chart",class = "btn-success"),
                 br(),
                 plotOutput("displot3"),
                 downloadButton(outputId = "displot3_download", label = "Download Chart",class = "btn-success")
                 
        )
    ),
    
    br(),
    hr(),
    br(),
    h3("Line Plot"),
    plotOutput("lineplot"),
    downloadButton(outputId = "lineplot_download", label = "Download Chart",class = "btn-success"),
    br(),
    plotOutput("lineplot2"),
    downloadButton(outputId = "lineplot2_download", label = "Download Chart",class = "btn-success")
)

Server Logic

The server function contains the logic for rendering plots and tables based on user input. As you may find, all backend data handling and visual design goes in here.

server <- function(input,output, session) {
    
    library(ggplot2)
    library(shiny)
    library(DT)
    # library(stringr)
    
    #setwd("C:/Users/kli4/Downloads/Shiny_HW")
    
    dataset = read.csv("dataset.csv")
    dataset$AGE_Bin = cut(dataset$AGE,5,include.lowest = TRUE)
    dataset$INCOME_Bin = cut(dataset$INCOME,5,include.lowest = TRUE,dig.lab = 6)
    # dataset$INCOME_Bin <- lapply(strsplit(gsub("]|[[(]", "", levels(dataset$INCOME_Bin)), ","),
    #           prettyNum, big.mark=".", decimal.mark=",", input.d.mark=".", preserve.width="individual")
    
    
    plot_var <- eventReactive(input$gobutton,{
        
        selection <- input$option
        
        data_agg <-aggregate(x=dataset$Customer, by=list(SELECTION=dataset[,c(selection)],TREATMENT = dataset[,"TREATMENT"]),length)
        names(data_agg) = c("SELECTION","TREATMENT", "Customer")
        
        return(data_agg)
        
    })
    
    
    output$disPlot <- renderPlot({
        displot = ggplot(plot_var(), aes(x=SELECTION,y=Customer,fill=TREATMENT)) + geom_bar(position="stack",stat="identity")
        
        output$disPlot_download <- downloadHandler(
            filename = function() { paste(input$option, '.jpg', sep='') },
            content = function(file){
                ggsave(file,plot=displot)
            })
        displot
    })
    

    output$table <- DT::renderDataTable(datatable(dataset))
 
    scatter_plot <- ggplot(dataset, aes(x=AGE,y=INCOME)) + geom_point()
    
    scatter_plot = scatter_plot + facet_grid(GENDER ~ TREATMENT)
    
    output$Scatter <- renderPlot({
        scatter_plot
    })
    
    scatter_brushed <- reactive({
        
        my_brush <- input$selected_range
        sel_range <- brushedPoints(dataset, my_brush)
        return(sel_range)
        
    })
    output$brushed_table <- DT::renderDataTable(DT::datatable(scatter_brushed()))
    
    
    
    displot2 <- ggplot(dataset, aes(online.Activity.A)) + geom_histogram(aes(fill=AGE_Bin), bins = 5)
    
    displot2 = displot2 + facet_grid(GENDER ~ TREATMENT)
    
    displot3 <- ggplot(dataset, aes(online.ACTIVITY.B)) + geom_histogram(aes(fill=AGE_Bin), bins = 5)
    
    displot3 = displot3 + facet_grid(GENDER ~ TREATMENT)
    
    output$displot2 <- renderPlot({
        displot2
    })
    
    output$displot3 <- renderPlot({
        displot3
    })
    # 
    # scatter_brushed2 <- reactive({
    #   
    #   my_brush <- input$selected_range2
    #   sel_range <- brushedPoints(dataset, my_brush)
    #   return(sel_range)
    #   
    # })
    # output$brushed_table2 <- DT::renderDataTable(DT::datatable(scatter_brushed2()))
    
    data_agg2 <-aggregate(list(Activity_A=dataset$online.Activity.A), by=list(DAY=dataset$DAY,TREATMENT=dataset$TREATMENT,GENDER=dataset$GENDER),mean)
    
    lineplot <- ggplot(data_agg2, aes(x=DAY, y=Activity_A, group=c(TREATMENT))) + geom_line(aes(color=TREATMENT)) + geom_point()
    lineplot = lineplot + facet_grid(GENDER ~ TREATMENT)
    
    output$lineplot <- renderPlot({
        lineplot
    })
    
    data_agg2 <-aggregate(list(Activity_B=dataset$online.ACTIVITY.B), by=list(DAY=dataset$DAY,TREATMENT=dataset$TREATMENT, GENDER=dataset$GENDER),mean)
    
    lineplot2 <- ggplot(data_agg2, aes(x=DAY, y=Activity_B, group=c(TREATMENT))) + geom_line(aes(color=TREATMENT)) + geom_point()
    lineplot2 = lineplot2 + facet_grid(GENDER ~ TREATMENT)
    
    output$lineplot2 <- renderPlot({
        lineplot2
    })
    
    #Downloads
    
    output$lineplot2_download <- downloadHandler(
        filename = "Activity_B Line.jpg",
        content = function(file){
            ggsave(file,plot=lineplot2)
        })
    
    output$lineplot_download <- downloadHandler(
        filename = "Activity_A Line.jpg",
        content = function(file){
            ggsave(file,plot=lineplot)
        })
    
    output$displot2_download <- downloadHandler(
        filename = "ActivityA_Dist.jpg",
        content = function(file){
            ggsave(file,plot=displot2)
        })
    output$displot3_download <- downloadHandler(
        filename = "ActivityB_Dist.jpg",
        content = function(file){
            ggsave(file,plot=displot3)
        })
    
    output$scatter_download <- downloadHandler(
        filename = "Age_Income.jpg",
        content = function(file){
            ggsave(file,plot=scatter_plot)
        })
    

}

UI Design in R Shiny

UI design in R Shiny is easy and intuitive. It’s an HTML element as a function concept. Let’s dive into how UI is designed in our R Shiny app, using the provided code as an example.

Basic Structure

R Shiny UI is structured using functions defining the layout and its elements. The fluidPage() function is often used for its responsive layout capabilities, meaning the app’s interface adjusts nicely to different screen sizes.

ui <- fluidPage(
    # UI components are nested here
)

Organizing Content with Headers and Separators

Headers (h1, h2, h3, etc.) and separators (hr()) are used to organize content and improve readability. In our app, headers indicate different sections:

h1("Rshiny Homework"),
h2("Demographic Exploration"),
h3("Filterable Table"),

Data Display

The DT::dataTableOutput() function is used to render data tables in the UI. This function takes an output ID as an argument, linking it to the server logic that provides the data:

DT::dataTableOutput("table"),

Interactive Inputs

Interactive inputs, such as selectInput, allowing users to interact with the app and control what data or plot is displayed. In our app, selectInput is used for choosing demographic aspects to display in a chart:

selectInput(
    "option",
    "Demography",
    c("AGE_Bin", "INCOME_Bin", "GENDER"),
    selected = NULL,
    multiple = FALSE,
    selectize = TRUE,
    width = NULL,
    size = NULL
),

Action Buttons

Action buttons, created with actionButton(), trigger reactive events in the server. Our app uses an action button to generate plots based on user selection:

actionButton("gobutton", "View Chart", class = "btn-success"),

Displaying Plots

To display plots, plotOutput() is used. This function references an output ID from the server side where the plot is rendered:

plotOutput("disPlot"),

Interactive Plots

I use ggplot2 for creating interactive plots. For example, a scatter plot is generated based on user-selected variables:

scatter_plot <- ggplot(dataset, aes(x=AGE,y=INCOME)) + geom_point()

Tabbed Panels

Tabbed panels, created with tabsetPanel(), help in organizing content into separate views within the same space. Each tabPanel holds different content:

tabsetPanel(
    tabPanel("Scatter", ...),
    tabPanel("Distribution", ...)
),

Download Handlers

We provide functionality for users to download plots as JPEG files:

output$scatter_download <- downloadHandler(
    filename = "Age_Income.jpg",
    content = function(file){
        ggsave(file,plot=scatter_plot)
    })

downloadButton(outputId = "scatter_download", label = "Download Chart", class = "btn-success"),

Running the App

Finally, to run the app, use:

shinyApp(ui = ui, server = server)

Airflow Systemd Config File

June 11, 2021June 11, 2021

Need to pay attention to environment path, user.

[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service

[Service]
#EnvironmentFile=/etc/sysconfig/airflow
Environment=PATH=$PATH:/home/ken/miniconda3/bin/
User=ken
Group=ken
Type=simple
ExecStart=/home/ken/miniconda3/bin/airflow webserver
Restart=on-failure
RestartSec=5s
PrivateTmp=true

[Install]
WantedBy=multi-user.target

How to Correctly Install CuDF for Python

May 31, 2021May 31, 2021

Introduction of CuDF

CuDF is a powerful tool in the era of big data. It utilizes GPU computing framework Cuda to speed up data ETL and offers a Pandas-like interface. The tool is developed by the team, RapidAI.

You can check out their Git repo here.

I love the tool. It gives me a way to make full use of my expensive graphic card, which most of the time only used for gaming. Most importantly, for a company like Owler, which has to handle 14 millions+ company profiles, even a basic data transformation task might take days. This tool is possible to help speed up the process by about 80+ times. GPU computing has been a norm for ML/AL. CuDF makes it also good for the upper stream of the flow, the ETL. And ETL is in high demand for almost every company with digital capacity in the world.

The Challenges

It’s nice to have this tool for our day-to-day data work. However, the convenience comes at a cost. That is, the installation of CuDF is quite confusing and hard to follow. It also has some limitations in OS and Python versions. Currently, it only works with Linux and Python 3.7+. And it only provides a condo-forge way to install; otherwise, you need to build from the source.

The errors range from solving environment fail, dependencies conflict, inability to find GPU, and such. I have been installing CuDF into a couple of servers, including personal desktop, AWS, and so on. Each time, I have to spend hours dealing with multiple kinds of errors and try them again and again. When it finally works, I don’t know which one is the critical step because there were so many variables. Most ugly, when you have dependency conflict error, you have to wait for a very long time after 4 solving environment attempts until it displays the conflicting package for you.

But the good news is, from the most recent installation, I can finally understand the cause for the complication and summarize an easy to follow guide for anyone who wants to enjoy this tool.

In short, the key is, use miniconda or create a new environment in anaconda to install.

Let me walk through the steps.

Installing the Nvidia Cuda framework (Ubuntu)

Installing Cuda is simple when you have the right machine. You can follow the guide here from Nvidia official webpage. If you encounter an installation error, please check if you are selecting the right architecture and meet the hardware/driver requirements.

However, if you have an older version of Cuda installed and wish to upgrade that. The Nvidia guide won’t help you anyway. The correct way is to uninstall the older version Cuda first before doing anything from the guide. The reason is that, at least in Ubuntu, the installation step will change your apt-get source library; once you do that, you will no longer be able to uninstall the older version, and it may cause conflict.

To uninstall Cuda, you can try the following steps. (for Ubuntu)

Remove nvidia-cuda-toolkit

sudo apt-get remove nvidia-cuda-toolkit

You may want to use this command instead to remove all the dependencies as well.

sudo apt-get remove --auto-remove nvidia-cuda-toolkit

Remove Cuda

sudo apt-get remove cuda

If you forgot to remove the older version cuda before installing the new version, you would need to remove all dependencies for cuda and start over the new version installation.

sudo apt-get remove --auto-remove cuda

Install cuda by following the Nvidia official guide. Link above.

Install CuDF

I highly recommend installing CuDF in miniconda. This will avoid most of the package dependency conflicts. If you have dependency conflicts, you will probably get the below error. Or you will be waiting forever during the last solving environment step.

Miniconda is a much cleaner version of Anaconda. It has very few packages installed out of box. So it will avoid the CuDF installation running into conflict.

If you already have Anaconda installed and wish to keep it. You can try creating a new condo environment with no default package for CuDF.

conda create --no-default-packages -n myenv python=3.7

Building Sales Prediction Models

February 3, 2021February 3, 2021

In this work sample, I demonstrated how to build a predictive model for sales data using Python with Sklearn, Statsmodels, and other libraries. The sample includes using linear regression model and time series model.

Homework - Modeling (2)

In [52]:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import sklearn.preprocessing as pre
import sklearn.metrics as metric
import sklearn.linear_model as lm
import sklearn as skl
import statsmodels.tsa as tsa
import seaborn as sns
from  matplotlib import pyplot as plt
pd.set_option('display.max_rows', 90)

In [103]:

raw_data = pd.read_csv("Test_data.csv")

In [104]:

raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      92 non-null     object 
 1   sales     92 non-null     float64
 2   m_tv      92 non-null     float64
 3   m_rd      92 non-null     float64
 4   m_online  92 non-null     float64
 5   price     92 non-null     float64
 6   promo     92 non-null     int64  
 7   holidays  30 non-null     float64
dtypes: float64(6), int64(1), object(1)
memory usage: 5.9+ KB

In [105]:

raw_data.describe()

Out[105]:

	sales	m_tv	m_rd	m_online	price	promo	holidays
count	92.000000	92.000000	92.000000	92.000000	92.000000	92.000000	30.0
mean	5.184400	0.560531	0.418528	0.209329	-0.001423	10.532609	1.0
std	0.108638	0.064835	0.038475	0.019127	0.060020	4.031336	0.0
min	4.857496	0.366586	0.294887	0.142494	-0.140803	2.000000	1.0
25%	5.116671	0.514974	0.396311	0.198650	-0.040600	8.000000	1.0
50%	5.201595	0.578128	0.426288	0.212973	0.001642	10.000000	1.0
75%	5.265476	0.611251	0.442458	0.224060	0.035743	13.000000	1.0
max	5.385093	0.656982	0.484950	0.236512	0.151495	22.000000	1.0

In [106]:

raw_data.head(5)

Out[106]:

	date	sales	m_tv	m_rd	m_online	price	promo	holidays
0	3/26/2017	5.206100	0.596845	0.471012	0.227368	-0.128877	9	NaN
1	4/2/2017	5.385093	0.618830	0.482700	0.226155	-0.078913	15	NaN
2	4/9/2017	5.230386	0.606462	0.468500	0.210219	-0.016652	9	NaN
3	4/16/2017	5.078445	0.547290	0.452152	0.197514	0.054442	14	1.0
4	4/23/2017	5.216146	0.469708	0.434500	0.200274	0.121796	8	NaN

Looks like Sunday marks as holiday as well

In [107]:

pd.to_datetime(raw_data.date,format="%m/%d/%Y")

raw_data.index = pd.to_datetime(raw_data.date,format="%m/%d/%Y")

raw_data.index.freq = "W"

raw_data = raw_data.drop("date",axis=1)

In [108]:

sns.heatmap(raw_data.isnull(),yticklabels=False,cbar=False,cmap="viridis");

In [109]:

raw_data = raw_data.fillna(0)

EDA - Exploratory Data Analysis¶

Plot the sales series¶

In [110]:

ax = raw_data.sales.plot(figsize=(20,6))
for x in raw_data.query('holidays==1').index:       
    ax.axvline(x=x, color='k', alpha = 0.3);
ax.autoscale()

Vertical line makes the holidays

Explore seasonality¶

In [111]:

season = tsa.seasonal.seasonal_decompose(raw_data.sales,freq=7)

<ipython-input-111-3156addd6116>:1: FutureWarning: the 'freq'' keyword is deprecated, use 'period' instead
  season = tsa.seasonal.seasonal_decompose(raw_data.sales,freq=7)

In [112]:

ax = season.plot();
ax.set_figheight(8)
ax.set_figwidth(12)

Seasonality only explains very little portion (about 0.025m of sales) of the data. We can assume no strong seanality in the sales series.

In [113]:

#Augmented Dickey-Fuller to test whether the sales series is stational or not

result = tsa.stattools.adfuller(raw_data.sales,autolag='AIC') 

In [114]:

result[1]  

Out[114]:

3.942673135297742e-05

P values is less than 0.01. It's significant under 99% confidence. Sales data observes no unit root. It's stationary

In [115]:

fig, ax = plt.subplots(2,3,figsize = (12,8))
# fig.set_figheight(8)
# fig.set_figwidth(17)

sns.distplot(raw_data.sales,ax=ax[0][0])
sns.distplot(raw_data.m_tv,ax=ax[0][1])
sns.distplot(raw_data.promo,ax=ax[0][2])
sns.distplot(raw_data.m_rd,ax=ax[1][0])
sns.distplot(raw_data.m_online,ax=ax[1][1])
sns.distplot(raw_data.price,ax=ax[1][2])

Out[115]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ffb7bfeeaf0>

The feature and sales series show basically normal but skew to the left

Find if there's any outliers¶

In [116]:

sns.boxplot(x="variable",y="value",data=pd.melt(raw_data.drop(["promo","holidays"],axis=1)))

Out[116]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ffb7c13ed00>

In [117]:

fig, ax = plt.subplots(2,3,figsize = (12,8))
# fig.set_figheight(8)
# fig.set_figwidth(17)

sns.boxplot(raw_data.sales,ax=ax[0][0]    ,orient="v")
sns.boxplot(raw_data.m_tv,ax=ax[0][1]     ,orient="v" )
sns.boxplot(raw_data.promo,ax=ax[0][2]    ,orient="v" )
sns.boxplot(raw_data.m_rd,ax=ax[1][0]     ,orient="v" )
sns.boxplot(raw_data.m_online,ax=ax[1][1] ,orient="v" )
sns.boxplot(raw_data.price,ax=ax[1][2]    ,orient="v")

Out[117]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ffb7c719cd0>

There're not a lot of outliers. Let's keep them for now.

Train Test Split¶

In [122]:

## Train Test Split

train = raw_data[:-20]
test = raw_data[-20:]

x_train = train.drop("sales",axis=1)
y_train = train.sales

x_test = test.drop("sales",axis=1)
y_test = test.sales

Start with a simple model with all feature available¶

In [136]:

model = skl.linear_model.LinearRegression().fit(x_train,y_train)

pd.DataFrame(model.coef_,x_train.columns,columns=['Coefficient'])

prediction = model.predict(x_train)

print ("R Square is %.4f"%metric.r2_score(y_train,prediction))

print ("RMSE is %.4f"%np.sqrt(metric.mean_squared_error(y_train,prediction)))

R Square is 0.6440
RMSE is 0.0644

In [137]:

prediction_test = model.predict(x_test)

print ("R Square is %.4f"%metric.r2_score(y_test,prediction_test))

print ("RMSE is %.4f"%np.sqrt(metric.mean_squared_error(y_test,prediction_test)))

R Square is 0.6680
RMSE is 0.0600

In [138]:

sns.residplot(y_train,prediction-y_train);

As we can see some small clusters in the residual plot, we can assume that there's information that was not explained by the features.

In [139]:

# Look at whether all features are significant or not
ip = sm.add_constant(x_train)
dp = y_train
sm.OLS(dp,ip).fit().summary()

Out[139]:

OLS Regression Results
Dep. Variable:	sales	R-squared:	0.644
Model:	OLS	Adj. R-squared:	0.611
Method:	Least Squares	F-statistic:	19.60
Date:	Sun, 20 Dec 2020	Prob (F-statistic):	6.54e-13
Time:	23:00:11	Log-Likelihood:	95.323
No. Observations:	72	AIC:	-176.6
Df Residuals:	65	BIC:	-160.7
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	4.2199	0.122	34.455	0.000	3.975	4.464
m_tv	0.9603	0.152	6.313	0.000	0.657	1.264
m_rd	0.6732	0.233	2.893	0.005	0.209	1.138
m_online	0.6775	0.502	1.350	0.182	-0.325	1.680
price	0.5054	0.144	3.518	0.001	0.218	0.792
promo	0.0012	0.002	0.593	0.555	-0.003	0.005
holidays	-0.0195	0.019	-1.027	0.308	-0.057	0.018

Omnibus:	0.269	Durbin-Watson:	2.146
Prob(Omnibus):	0.874	Jarque-Bera (JB):	0.359
Skew:	0.137	Prob(JB):	0.836
Kurtosis:	2.789	Cond. No.	750.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [140]:

x_train_v2 = x_train.drop(["m_online","promo","holidays"],axis=1)

model_v2 = skl.linear_model.LinearRegression().fit(x_train_v2,y_train)

prediction_v2 = model_v2.predict(x_train_v2)

print ("In-sample valuation\n")

print ("R Square is %.4f"%metric.r2_score(y_train,prediction_v2))

print ("RMSE is %.4f"%np.sqrt(metric.mean_squared_error(y_train,prediction_v2)))

In-sample valuation

R Square is 0.6230
RMSE is 0.0663

In [141]:

## Out of sample test V2

x_test_v2 = x_test.drop(["m_online","promo","holidays"],axis=1)

prediction_v2_test = model_v2.predict(x_test_v2)

print ("Out-of-sample valuation")

print ("R Square is %.4f"%metric.r2_score(y_test,prediction_v2_test))

print ("RMSE is %.4f"%np.sqrt(metric.mean_squared_error(y_test,prediction_v2_test)))

Out-of-sample valuation
R Square is 0.6750
RMSE is 0.0594

Try to remove outliers since we are not doing time series¶

In [142]:

raw_data_noout = raw_data[raw_data.sales >= 4.9]
raw_data_noout = raw_data_noout[raw_data_noout.m_tv >= 0.4]
raw_data_noout = raw_data_noout[raw_data_noout.m_rd >= 0.32]

In [143]:

## Train Test Split

train = raw_data_noout[:-20]
test =  raw_data_noout [-20:]

x_train_noout = train.drop("sales",axis=1)
y_train_noout = train.sales

# V2 independent variables
x_train_v2_noout = x_train_noout.drop(["m_online","promo","holidays"],axis=1)

In [144]:

model_v2_noout = skl.linear_model.LinearRegression().fit(x_train_v2_noout,y_train_noout)

print (pd.DataFrame(model_v2_noout.coef_,x_train_v2_noout.columns,columns=['Coefficient']))

prediction_v2_noout = model_v2_noout.predict(x_train_v2)

print ("In-sample valuation\n")
print ("R Square is %.4f"%metric.r2_score(y_train,prediction_v2_noout))
print ("RMSE is %.4f"%np.sqrt(metric.mean_squared_error(y_train,prediction_v2_noout)))

       Coefficient
m_tv      0.854639
m_rd      0.889650
price     0.378369
In-sample valuation

R Square is 0.6152
RMSE is 0.0669

No significant improvement after removing outliers

Standardizing variables¶

In [145]:

# Standardized

norm = pre.StandardScaler().fit_transform(raw_data.drop(["holidays"],axis=1))

raw_data_norm = pd.DataFrame(norm,columns=raw_data.drop(["holidays"],axis=1).columns).set_index(raw_data.index)

raw_data_norm = raw_data_norm.join(raw_data[["holidays"]])


fig, ax = plt.subplots(2,3,figsize = (12,8))
# fig.set_figheight(8)
# fig.set_figwidth(17)

sns.distplot(raw_data_norm.sales,ax=ax[0][0])
sns.distplot(raw_data_norm.m_tv,ax=ax[0][1])
sns.distplot(raw_data_norm.promo,ax=ax[0][2])
sns.distplot(raw_data_norm.m_rd,ax=ax[1][0])
sns.distplot(raw_data_norm.m_online,ax=ax[1][1])
sns.distplot(raw_data_norm.price,ax=ax[1][2])

Out[145]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ffb78eb0df0>

In [146]:

train = raw_data_norm[:-20]
test =  raw_data_norm [-20:]

x_train_norm = train.drop("sales",axis=1)
y_train_norm = train.sales

# V2 independent variables
x_train_v2_norm = x_train_norm.drop(["m_online","promo","holidays"],axis=1)

x_test_norm = test.drop("sales",axis=1)
x_test_norm = x_test_norm.drop(["m_online","promo","holidays"],axis=1)
y_test_norm = test.sales

In [147]:

# In sample
model_v2_norm = skl.linear_model.LinearRegression().fit(x_train_v2_norm,y_train_norm)

print (pd.DataFrame(model_v2_norm.coef_,x_train_v2_norm.columns,columns=['Coefficient']))

prediction_v2_norm = model_v2_norm.predict(x_train_v2_norm)

print ("In-sample valuation\n")
print ("R Square is %.4f"%metric.r2_score(y_train_norm,prediction_v2_norm))
print ("RMSE is %.4f"%np.sqrt(metric.mean_squared_error(y_train_norm,prediction_v2_norm)))

       Coefficient
m_tv      0.602596
m_rd      0.288124
price     0.249482
In-sample valuation

R Square is 0.6230
RMSE is 0.6132

In [148]:

# Out of sample

prediction_v2_norm = model_v2_norm.predict(x_test_norm)

print ("out-of-sample valuation\n")
print ("R Square is %.4f"%metric.r2_score(y_test_norm,prediction_v2_norm))
print ("RMSE is %.4f"%np.sqrt(metric.mean_squared_error(y_test_norm,prediction_v2_norm)))

out-of-sample valuation

R Square is 0.6750
RMSE is 0.5497

In [529]:

plot_data = pd.DataFrame({"Real":y_test_norm,"Prediction":prediction_v2_norm})
plot_data.plot()

Out[529]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa45ded7e50>

In [366]:

sns.residplot(y_train_norm,prediction_v2_norm-y_train_norm)

Out[366]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa45ab9b400>

Also, standardizing data didn't give us some improvement as well

ARMA Model¶

Since we don't have other feature gathered at this point. It's not a bad idea to look at the sales time series. We have already known that the sales series is stationary and the holiday variable is not significant. We can use ARMA model to fit the sales time series.

In [15]:

# Train Test Split
train = raw_data[:-20]
test = raw_data[-20:]
train_arma = train.sales
test_arma = test.sales

By comparing different orders, we found ARMA(2,2) is the best

In [155]:

model_arma = tsa.arima_model.ARIMA(train_arma,order=(2,0,2)).fit()

model_arma.summary()

Out[155]:

ARMA Model Results
Dep. Variable:	sales	No. Observations:	72
Model:	ARMA(2, 2)	Log Likelihood	68.623
Method:	css-mle	S.D. of innovations	0.090
Date:	Sun, 20 Dec 2020	AIC	-125.246
Time:	23:11:19	BIC	-111.586
Sample:	03-26-2017	HQIC	-119.808
	- 08-05-2018

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	5.1917	0.012	434.728	0.000	5.168	5.215
ar.L1.sales	1.1234	0.024	47.128	0.000	1.077	1.170
ar.L2.sales	-0.9838	0.018	-53.400	0.000	-1.020	-0.948
ma.L1.sales	-1.0280	0.060	-17.274	0.000	-1.145	-0.911
ma.L2.sales	0.9998	0.081	12.296	0.000	0.840	1.159

Roots
	Real	Imaginary	Modulus	Frequency
AR.1	0.5709	-0.8310j	1.0082	-0.1542
AR.2	0.5709	+0.8310j	1.0082	0.1542
MA.1	0.5141	-0.8579j	1.0001	-0.1641
MA.2	0.5141	+0.8579j	1.0001	0.1641

In [156]:

predictions_arima = model_arma.predict(start=len(train_arma), 
                                 end=len(train_arma)+len(test_arma)-1, 
                                 dynamic=False, 
                                 typ='levels')

print ("R Square is %.4f"%metric.r2_score(test_arma,predictions_arima))
print ("RMSE is %.4f"%np.sqrt(metric.mean_squared_error(test_arma,predictions_arima)))

R Square is 0.1870
RMSE is 0.0939

In [19]:

plot_data = pd.DataFrame({"Real":test_arma,"Prediction":predictions_arima})
plot_data.plot()

Out[19]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ffb774b4340>

In [20]:

predictions_arima = model_arma.predict(start=0, 
                                 end=len(train_arma)+len(test_arma)-1, 
                                 dynamic=False, 
                                 typ='levels')

In [21]:

predictions_arima

Out[21]:

2017-03-26    5.191661
2017-04-02    5.195336
2017-04-09    5.244598
2017-04-16    5.191784
2017-04-23    5.120293
                ...   
2018-11-25    5.145701
2018-12-02    5.100121
2018-12-09    5.134042
2018-12-16    5.216988
2018-12-23    5.276798
Freq: W-SUN, Length: 92, dtype: float64

Add a smoothing festure and ARMA predition to V2 model¶

In [35]:

raw_data_ = raw_data

raw_data_["sale_pred"] = predictions_arima

train = raw_data_[:-20]
test = raw_data_[-20:]

train.loc[:,"sales_MA"] = train.sales.rolling(2).mean()
test.loc[:,"sales_MA"] = test.sales.rolling(2).mean()


train = train.dropna()
test = test.dropna()

x_train_v3 = train.drop("sales",axis=1).drop(["m_online","promo","holidays"],axis=1)
y_train_v3 = train.sales


x_test_v3 = test.drop("sales",axis=1).drop(["m_online","promo","holidays"],axis=1)
y_test_v3 = test.sales

/Users/ken/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py:845: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
/Users/ken/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py:966: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s

In [36]:

model_v3 = skl.linear_model.LinearRegression().fit(x_train_v3,y_train_v3)

prediction_v3 = model_v3.predict(x_train_v3)

print ("In-sample valuation\n")

print ("R Square is %.4f"%metric.r2_score(y_train_v3,prediction_v3))

print ("RMSE is %.4f"%np.sqrt(metric.mean_squared_error(y_train_v3,prediction_v3)))

In-sample valuation

R Square is 0.7654
RMSE is 0.0526

In [40]:

prediction_v3 = model_v3.predict(x_test_v3)

print ("Out-of-sample valuation\n")

print ("R Square is %.4f"%metric.r2_score(y_test_v3,prediction_v3))

print ("RMSE is %.4f"%np.sqrt(metric.mean_squared_error(y_test_v3,prediction_v3)))

Out-of-sample valuation

R Square is 0.7220
RMSE is 0.0558

In [41]:

ip = sm.add_constant(x_train_v3)
dp = y_train_v3
sm.OLS(dp,ip).fit().summary()

Out[41]:

OLS Regression Results
Dep. Variable:	sales	R-squared:	0.765
Model:	OLS	Adj. R-squared:	0.747
Method:	Least Squares	F-statistic:	42.42
Date:	Sun, 20 Dec 2020	Prob (F-statistic):	3.43e-19
Time:	21:25:34	Log-Likelihood:	108.32
No. Observations:	71	AIC:	-204.6
Df Residuals:	65	BIC:	-191.1
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	3.5799	1.017	3.519	0.001	1.548	5.611
m_tv	0.6535	0.180	3.629	0.001	0.294	1.013
m_rd	0.6625	0.206	3.213	0.002	0.251	1.074
price	0.0917	0.125	0.732	0.467	-0.159	0.342
sale_pred	-0.5319	0.193	-2.757	0.008	-0.917	-0.147
sales_MA	0.7183	0.122	5.896	0.000	0.475	0.962

Omnibus:	0.682	Durbin-Watson:	2.326
Prob(Omnibus):	0.711	Jarque-Bera (JB):	0.666
Skew:	-0.223	Prob(JB):	0.717
Kurtosis:	2.837	Cond. No.	1.19e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.19e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [157]:

sns.residplot(y_test_v3,prediction_v3-y_test_v3);

In [158]:

plot_data = pd.DataFrame({"Real":y_test_v3,"Prediction":prediction_v3})
plot_data.plot();

In [ ]:

Reef Tank Lesson

August 27, 2020September 12, 2020

While removing fish tank cover will significantly help reduce the heat/temp, water will vapor much quicker. To avoid salinity going too high, keep the salinity below 1.025 (the upper limit of the proper range).

Saltwater tank rarely has low oxygen condition. If you see fist breathing hard, it’s likely a bacteria spike, especially coming with cloudly water.

Removing Fish Tank Cover

May 28, 2020

There’s a reason to remove fish tank cover. It’s to increase water vaporization so that the water temperature can be lower during summertime.

Beginner’s Guide to Care A Fish Tank

October 12, 2019March 7, 2020

Beginner’s Guide to Care A Fish Tank

(not to start a new one)

	Cares	Important Notes
Daily	Observe the fish’s activities and see if they are active and craving for food. If not, do an emergency check. Feed a little bit of food that can be consumed in 3 mins (literally just a little bit 3 – 5 pieces or 1/5 of a spoon)
Weekly	Fill the tank with drink water (not hot, not cold) up to the original level	Water change guideline:Water is the key to fishes; test the temp with your figure and make sure you don’t feel hot or cold before add water; this is to ensure that the water temp is similar to tank water, so the fishes won’t be shocked by changing temp
Every other week	Change 1/3 of the tank water. Use the water to water plants (it’s rich in the nutrient) Look at the back filter and see if it needs to be cleaned	Add water slowly and prevent currentYou can use the tank lid to help soften the water flow. Use drinking water to clean the filter material
[emergency check]Fishes become less active or sick	Observe the fin and tail and body of the fish and see if there’s any holds, rot, or any imperfection, if so use: this. If you find no sign of those or some white dots on fish body, use this. Fish getting sick usually means they feel stress in water, so we need to do a series of water changes: change half of the water every other day for at least a week according to the water change guideline. You can dose this magic beneficial bacterial every time you change the water	There’re generally two kinds of fish diseases, bacterial or parasitic. Use the right treatment. Be extra care in selecting the right treatment. Usually, if you observe some strange dot or color on the skin, use bacterial treatment first. If you see fish behave strangely, use parasitic treatment. In such a case, don’t use bacterial treatment as it may kill the fish. During the treatment, also raise the temperature to 85 F
Adding fishes	Only buy fish from a clean tank, if you observe some fish seems sick in the tank, don’t buy from that. If you have this, dose less than 1 ml in the bag before acclimating the fish. To acclimate the fish: Pour excessive water out of the bag, but make sure fish have room to moveFloat the bag in the water. While floating, add a small amount of water every 10 mins After adding more than half of the bag water and at least 1 hour, use a net to add the fish to the tank. Don’t add the bag water to the tank (prevent diseases)	If you don’t have an quarantine tank, rotating two treatments can prevent illness when adding fish but need to make sure no fish have a parasitic illness and should dose 12 hrs apart.

Run Regression in Python with Statsmodel Package

August 5, 2019August 5, 2019

Run Regression

In [9]:

from statsmodels import api as sm
from my_libs import *

Regress the SPY and VIX index

Need to translate the result into np.array
Need to change type to float

In [51]:

spy = get_price_data(["SPY"],method='day',back_day=20).dropna().Return.values.astype(float)
spy_ = spy*30

All price data of Close is actually Adj Close
Connection Successful
Finished SPY

Constructed a model of vix = intercept + b0 * spy + b1 * spy * 30

In [52]:

ip = pd.DataFrame({"spy":spy,"spy_":spy_})
dp = get_price_data(["^VIX"],method='day',back_day=20).dropna().Return.values.astype(float)

All price data of Close is actually Adj Close
Connection Successful
no data for ^VIX
'NoneType' object has no attribute 'index'
switching to realtimeday method
Finished ^VIX

In [53]:

ip = sm.add_constant(ip)

/home/ken/.local/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

In [54]:

sm.OLS(dp,ip).fit().summary()

/home/ken/.local/lib/python2.7/site-packages/scipy/stats/stats.py:1416: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13
  "anyway, n=%i" % int(n))

Out[54]:

OLS Regression Results
Dep. Variable:	y	R-squared:	0.737
Model:	OLS	Adj. R-squared:	0.713
Method:	Least Squares	F-statistic:	30.80
Date:	Sun, 04 Aug 2019	Prob (F-statistic):	0.000173
Time:	19:40:53	Log-Likelihood:	25.241
No. Observations:	13	AIC:	-46.48
Df Residuals:	11	BIC:	-45.35
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	0.0033	0.011	0.305	0.766	-0.021	0.027
spy	-0.0109	0.002	-5.550	0.000	-0.015	-0.007
spy_	-0.3256	0.059	-5.550	0.000	-0.455	-0.196

Omnibus:	9.222	Durbin-Watson:	1.071
Prob(Omnibus):	0.010	Jarque-Bera (JB):	4.912
Skew:	-1.262	Prob(JB):	0.0858
Kurtosis:	4.641	Cond. No.	5.85e+17

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.82e-35. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Linear Discriminant Analysis (LDA)

June 3, 2019July 27, 2019

A scoring model is a family of statistical tools developed from qualitative and quantitative empirical data that determines the appropriate parameters and variables for predicting default. Linear discriminant analysis (LDA) is one of the most popular statistical methods used for developing scoring models. An LDA-based model is a reduced form model due to its dependency on exogenous variable selection, the default composition, and the default definition. A scoring function is a linear function of variables produced by an LDA. The variables are chosen based on their estimated contribution to the likelihood of default and come from an extensive pool of qualitative features and accounting ratios. The contributions (i.e., weights) of each accounting ratio to the overall score are represented by Altmans Z-score. Although there are many discriminant analysis methods, the one referenced in this topic is the ordinary least squares method.

LDA categorizes firms into two groups: the first represents performing (solvent) firms and the second represents defaulting (insolvent) firms. One of the challenges of this categorization is whether or not it is possible to predict which firms will be solvent and which will be insolvent prior to default. A Z-score is assigned to each firm at some point prior to default on the basis of both financial and nonfinancial information. A Z cut-off point is used to differentiate both groups, although it is imperfect as both solvent and insolvent firms may have similar scores. This may lead to incorrect classifications.

Altman proposed the following LDA model:
[latex]Z = 1.21x_1 +1.40x_2 + 3.30x_3 + 0.6x_4 + 0.999x_5[/latex]

where:

[latex]x_1[/latex] = working capital / total assets

[latex]x_2[/latex] = accrued capital reserves / total assets

[latex]x_3[/latex] = EBIT / total assets

[latex]x_4[/latex] = equity market value / face value of term debt

[latex]x_5[/latex] = sales / total assets

In this model, the higher the Z-score, the more likely it is that a firm will be classified in the group of solvent firms. The Z-score cut-off (also known as the discriminant threshold) was set at Z = 2.673. The model was used not only to plug in current values to determine a Z-score, but also to perform stress tests to show what would happen to each component (and its associated weighting) if a financial factor changed.

Another example of LDA is the RiskCalc model, which was developed by Moodys. It incorporates variables that span several areas, such as financial leverage, growth, liquidity, debt coverage, profitability, size, and assets. The model is tailored to individual countries, with the model for a country like Italy driven by the positive impact on credit quality of factors such as higher profitability, higher liquidity, lower financial leverage, strong activity ratios, high growth, and larger company sizes.

With LDA, one of the main goals is to optimize variable coefficients such that Z-scores minimize the inevitable overlapping zone between solvent and insolvent firms. For two groups of borrowers with similar Z-scores, the overlapping zone is a risk area where firms may end up incorrectly classified, historical versions of LDA would sometimes consider a gray area allowing for three Z-score range interpretations to determine who would be granted funding: very safe borrowers, very risky borrowers, and the middle ground of borrowers that merited further investigation. In the current world, LDA incorporates the two additional objectives of measuring default probability and assigning ratings.

The process of fitting empirical data into a statistical model is called calibration. LDA calibration involves quantifying the probability of default by using statistical-based outputs of ratings systems and accounting for differences between the default rates of samples and the overall population. This process implies that more work is still needed, even after the scoring function is estimated and Z-scores are obtained, before the model can be used. In the case of the model being used simply to accept or reject credit applications, calibration simply involves adjusting the Z-score cut-off to account for differences between sample and population default rates. In the case of the model being used to categorize borrowers into different ratings classes (thereby assigning default probabilities to borrowers), calibration will include a cut-off adjustment and a potential rescaling of Z-score default quantifications.

Because of the relative infrequency of actual defaults, a more accurate model can be derived by attempting to create more balanced samples with relatively equal (in size) groups of both performing and defaulting firms. However, the risk of equaling the sample group sizes is that the model applied to a real population will tend to overpredict defaults. To protect against this risk, the results obtained from the sample must be calibrated. If the model is only used to classify potential borrowers into performing versus defaulting firms, calibration will only involve adjusting the Z cut-off using Bayes theorem to equate the frequency of defaulting borrowers per the model to the frequency in the actual population.

Prior probabilities represent the probability of default when there is no collected evidence on the borrower. Prior probabilities qinsojv and qsolv represent the prior probabilities of insolvency and solvency, respectively. One proposed solution is to adjust the cut-off point by the following relation:

[latex] ln(\frac{q_solv}{q_insolv})[/latex]

If it is the case that the prior probabilities are equal (which would occur in a balanced sample), there is no adjustment needed to the cut-off point (i.e., relation is equal to 0). If the population is unbalanced, an adjustment is made by adding an amount from the relation just shown to the original cut-off quantity.

For example, assume a sample exists where the cut-off point is 1.00. Over the last 20 years, the average default rate is 3.73% (i.e., [latex]q_insolv[/latex] = 3.73%). This implies that qsolv is equal to 96.25%, and the relation will dictate that we must add [latex]ln(\frac{96.25%}{3.75%})[/latex] or 3.25 to the cut-off point (1.00 + 3.25 = 4.25).

The risk is the potential misclassification of borrowers leading to unfavorable decisions rejecting a borrower in spite of them being solvent or accepting a borrower that ends up defaulting. In the case of the first borrower, the cost of the error is an opportunity cost (C O STolv/insolv). In the case of the second borrower, the cost is the loss given default (COSTinsolv/soly). These costs are not equal, so the correct approach may be to adjust the cut-off point to account for these different costs by adjusting the relation equation as follows:

[latex] ln(\frac{q_solv \times COST_solv/insolv}{q_insolv \times COST_insolv/solv})[/latex]

Extending the earlier example, imagine the current assessment of loss given default is 50% and the opportunity cost is 20%. The cut-off score will require an adjustment of: [latex]ln\frac{96.25% \times 20%}{3.75% \times 50%} [/latex]= 2.33.

The cut-off point selection is very sensitive to factors such as overall credit portfolio profile, the market risk environment, market trends, funding costs, past performance/budgets, and customer segment competitive positions.

Note that LDA models typically offer only two decisions: accept or reject. Modern internal rating systems, which are based on the concept of default probability, require more options for decisions.

(Risk Model Discussion) Distinguish between the Structural and the Reduced-form Approaches

June 3, 2019August 3, 2019

Distinguish between the structural approaches and the reduced-form approaches to predicting default.

The foundation of a structural approach (e.g., the Merton model) is the financial and economic theoretical assumptions that describe the overall path to default. Under this approach, building a model involves estimating the formal relationships that link the relevant variables of the model. In contrast, reduced form models (e.g., statistical and numerical approaches) arrive at a final solution using the set of variables that is most statistically suitable without factoring in the theoretical or conceptual causal relationships among variables.

A reduced form model will not make any ex ante assumptions about causal drivers for default (unlike structural models); specific firm characteristics are linked to default, using statistics to tie them to default data. As such, the default event itself represents a real-life event. The independent variables in these models are combined based on their estimated contribution to the final result and can change in terms of relevance depending on firm size, firm sector, and economic cycle stage.

A significant model risk in reduced form approaches results from a models dependency on the sample used to estimate it. To derive valid results, there must be a strong level of homogeneity between the sample and the population to which the model is applied.

Reduced form models used for credit risk can be classified into statistical and numerical- based categories. Statistical-based models use variables and relations that are selected and calibrated by statistical procedures. Numerical-based approaches use algorithms that connect actual defaults with observed variables. Both approaches can aggregate profiles, such as industry, sector, size, location, capitalization, and form of incorporation, into homogeneous top-down segment classifications. A bottom-up approach may also be used, which would classify variables based on case-by-case impacts. While numerical and statistical methods are primarily considered bottom-up approaches, experts-based approaches tend to be the most bottom up.