11. Working with Data II: Statistics

In this chapter we introduce summary statistics and statistical graphing methods using the pandas and the statsmodels library.

11.1. Importing Data

We will be working with three data sets. Lecture_Data_Excel_a.csv. Lecture_Data_Excel_b.csv. Lecture_Data_Excel_c.xlsx.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
#from ggplot import *
from scipy import stats as st
import math as m
import seaborn as sns
import time  # Imports system time module to time your script

plt.close('all')  # close all open figures
# Read in small data from .csv file
# Filepath
filepath = 'Lecture_Data/'
# In windows you can also specify the absolute path to your data file
# filepath = 'C:/Dropbox/Towson/Teaching/3_ComputationalEconomics/Lectures/Lecture_Data/'

# ------------- Load data --------------------
df = pd.read_csv(filepath + 'Lecture_Data_Excel_a.csv', dtype={'Frequency': float})

df = df.drop('Unnamed: 2', axis=1)
df = df.drop('Unnamed: 3', axis=1)

Let us now have a look at the dataframe to see what we have got.

print(df)
                 Area  Frequency
0          Accounting       73.0
1             Finance       52.0
2  General management       36.0
3     Marketing sales       64.0
4               other       28.0

We have data on field of study in the column Area and we have absolute frequences that tell us how many students study in those areas. We next form relative frequencies by dividing the absolute frequencies by the total number of students. We put the results in a new column called relFrequency.

# Make new column with relative frequency
df['relFrequency'] = df['Frequency']/df['Frequency'].sum()
# Let's have a look at it, it's a nested list
print(df)
                 Area  Frequency  relFrequency
0          Accounting       73.0      0.288538
1             Finance       52.0      0.205534
2  General management       36.0      0.142292
3     Marketing sales       64.0      0.252964
4               other       28.0      0.110672

We can now interact with the DataFrame. Let us first simply “grab” the relative frequencies out of the DataFrame into a numpy array. This is now simply a numerical vector that you have already worked with in previous chapters.

xv = df['relFrequency'].values
print('Array xv is:', xv)
Array xv is: [0.28853755 0.2055336  0.14229249 0.25296443 0.11067194]

11.2. Making Simple Graphs from our Data

11.2.1. Bar Charts

We first make a bar chart of the absolute frequencies.

fig, ax = plt.subplots()

ax.plot(1. + np.arange(len(xv)), xv, '-o')
plt.show()
_images/Slides_Data_2_FalseBarchart_1.png

Hey, that’s not a bar chart. Let’s try again.

fig, ax = plt.subplots()

ax.bar(1. + np.arange(len(xv)), xv, align='center')

# Annotate with text
ax.set_xticks(1. + np.arange(len(xv)))
for i, val in enumerate(xv):
    ax.text(i+1, val/2, str(round(val*100, 2))+'%', va='center', ha='center', color='black')
ax.set_ylabel('%')
ax.set_title('Relative Frequency Barchart')
plt.show()

# We can also save this graph as a pdf in a subfolder called 'Graphs'
# You will have to make this subfolder first though.

#savefig('./Graphs/fig1.pdf')
_images/Slides_Data_2_Barchart_1.png

This simple command plots the barchart into a window and saves it as fig1.pdf into a subfolder called Graphs. Don’t forget to first make this subfolder, otherwise Python will throw an error when it can’t find the folder.

You also see that the height of the bars is symply the y-coordinate of the points in the previous scatter plot.

11.2.2. Pie Charts

We next make a pie chart using the relative frequencies stored in vector xv.

fig, ax = plt.subplots()
ax.pie(xv, labels = np.round(xv*100, 2), shadow=True)
# Annotate with text
ax.set_title('Relative Frequency Pie Chart')
plt.show()
_images/Slides_Data_2_Piechart_1.png

11.2.3. Histograms

Next we use a new data file called Lecture_Data_Excel_b.csv. Let us first import the data and have a quick look.

df = pd.read_csv(filepath + 'Lecture_Data_Excel_b.csv')
print(df.head())
   Height  Response  AverageMathSAT  Age  Female  Education  Race
0     1.1         3             370   20       0          2  Hisp
1     1.2         4             393   20       0          4   Wht
2     1.3         4             413   20       0          4   Blk
3     1.4         5             430   20       0          4   Wht
4     1.5         3             440   20       0          2   Mex

Let us clean this somewhat and drop a column that we do not plan on using, the Response column.

df = pd.read_csv(filepath + 'Lecture_Data_Excel_b.csv')
df = df.drop('Response' , axis=1)
print(df.head())
   Height  AverageMathSAT  Age  Female  Education  Race
0     1.1             370   20       0          2  Hisp
1     1.2             393   20       0          4   Wht
2     1.3             413   20       0          4   Blk
3     1.4             430   20       0          4   Wht
4     1.5             440   20       0          2   Mex

The remaining data in the DataFrame has observations on on height, age and other variables. We first make a histogram of the continuous variable Height. The actual command for the histogram is hist. It returns three variables prob, bins, and patches. We use them in the following to add information to the histogram.

heightv = df['Height'].values

# Initialize
N = len(heightv) # number of obs.
B = 8           # number of bins in histogram

fig, ax = plt.subplots()
plt.subplots_adjust(wspace = 0.8, hspace = 0.8)

prob, bins, patches = ax.hist(heightv, bins=B, align='mid' )
plt.show()
_images/Slides_Data_2_HistogramA_1.png

We next add some text to the histogram so that we can convey even more information to the reader/user of our research.

heightv = df['Height'].values

# Initialize
N = len(heightv) # number of obs.
B = 8           # number of bins in histogram

fig, ax = plt.subplots()
plt.subplots_adjust(wspace = 0.8, hspace = 0.8)

prob, bins, patches = ax.hist(heightv, bins=B, align='mid' )
# Annotate with text
for i, p in enumerate(prob):
    percent = int(float(p)/N*100)
    # only annotate non-zero values
    if percent:
        ax.text(bins[i]+0.3, p/2.0, str(percent)+'%', rotation=75, va='bottom', ha='center')

ax.set_xlabel('Height groups')
ax.set_ylabel('Number of obs')
ax.set_title('Histogram of Height')
ax.set_xlim(min(heightv),max(heightv))

plt.show()
_images/Slides_Data_2_HistogramB_1.png

We can also make histograms with pandas built in commands. However, with this method you have less information about where the bars appear and what height they are, so you won’t be able to add text at the appropriate positions as conveniently as before.

heightv = df['Height'].values

# Initialize
N = len(heightv) # number of obs.
B = 8           # number of bins in histogram

fig, ax = plt.subplots()
plt.subplots_adjust(wspace = 0.8, hspace = 0.8)

df['Height'].hist(bins = B, ax = ax, color = 'k', alpha = 0.3)
#
plt.show()
_images/Slides_Data_2_Histogram3_1.png

Now we add a couple more variables and make a series of histograms all in the same figure.

heightv = df['Height'].values

# Initialize
N = len(heightv) # number of obs.
B = 8           # number of bins in histogram

fig, ax = plt.subplots(2,2)
plt.subplots_adjust(wspace = 0.8, hspace = 0.8)

prob, bins, patches = ax[0,0].hist(heightv, bins=B, align='mid' )
# Annotate with text
for i, p in enumerate(prob):
    percent = int(float(p)/N*100)
    # only annotate non-zero values
    if percent:
        ax[0,0].text(bins[i]+0.3, p/2.0, str(percent)+'%', rotation=75, va='bottom', ha='center')

ax[0,0].set_xlabel('Height groups')
ax[0,0].set_ylabel('Number of obs')
ax[0,0].set_title('Histogram of Height')
ax[0,0].set_xlim(min(heightv),max(heightv))

# Using Panda's built in histogram method
df['Height'].hist(bins = B, ax = ax[0,1], color = 'k', alpha = 0.3)
ax[0,1].set_title('Histogram of Height')
ax[0,1].set_xlabel('Height groups')
ax[0,1].set_ylabel('Number of obs')
#
df['AverageMathSAT'].hist(bins = B, ax = ax[1,0], color = 'g', alpha = 0.3)
ax[1,0].set_title('Histogram of Math SAT Scores')
#
#
df['Age'].hist(bins = B, ax = ax[1,1], color = 'r', alpha = 0.3)
ax[1,1].set_title('Histogram of Age')
#
plt.show()
_images/Slides_Data_2_Histogram4_1.png

Note

Another way to graph a histogram is by using the powerful ggplot package which is a translation of R’s ggplot2 package into Python. It follows a completely different graphing philosophy based on the grammar of grapics by:

Hadley Wickham. A layered grammar of graphics. Journal of Computational and Graphical Statistics, vol. 19, no. 1, pp. 3–28, 2010.

After installing the ggplot library, you would plot the data as follows:

print( ggplot(aes(x='Height'), data = df) + geom_histogram() +
ggtitle("Histogram of Height using ggplot") + labs("Height", "Freq"))

Warning

The ggplot Python version seems to not be compatible with the latest version of Python 3.6. You would need to install an older version of Python if you wanted to use this library as of writing this.

We can also use the seaborn library to make fancy looking histograms quickly as in Histogram2.

import seaborn as sns
sns.histplot(df['AverageMathSAT'].dropna(),bins=10, color='g')
plt.show()
_images/Slides_Data_2_Histogram2_1.png

Fig. 11.1 Histogram with Distribution Added

11.2.4. Boxplots

A boxplot of height is made as follows:

fig, ax = plt.subplots()
ax.boxplot(heightv)
# Annotate with text
ax.set_title('Boxplot of Height')
plt.show()
_images/Slides_Data_2_Boxplot_1.png

11.3. Summary Statistics

We next go through some basic summary statistics measures.

11.3.1. Measures of Central Tendency

A quick way to produce summary statistics of our data set is to use panda's command describe.

print(df.describe())
          Height  AverageMathSAT        Age     Female  Education
count  60.000000       60.000000  60.000000  60.000000  60.000000
mean    6.816667      493.666667  20.933333   0.333333   3.016667
std     3.571743       34.018274   1.176992   0.475383   1.096863
min     1.100000      370.000000  20.000000   0.000000   1.000000
25%     3.950000      478.500000  20.000000   0.000000   2.000000
50%     6.850000      507.000000  20.000000   0.000000   3.000000
75%     9.650000      509.250000  22.000000   1.000000   4.000000
max    12.700000      542.000000  23.000000   1.000000   5.000000

Or we can produce the same statistics by hand. We first calculate the mean, median and mode. Note that mode is part of the stats package which was imported as st using: from scipy import stats as st. Now we have to add st to the mode command: st.mode(heightv) in order to call it. The other commands are part of the numpy package so they have the prefix np.

# Number of observations (sample size)
N = len(heightv)

# Mean - Median - Mode
print("Mean(height)=  {:.3f}".format(np.mean(heightv)))
print("Median(height)=  {:.3f}".format(np.median(heightv)))

# Mode (value with highest frequency)
print("Mode(height)=  {}".format(st.mode(heightv)))
Mean(height)=  6.817
Median(height)=  6.850
Mode(height)=  ModeResult(mode=array([1.1]), count=array([1]))

We can also just summarize a variable using the st.describe command from the stats package.

# Summary stats
print("Summary Stats")
print("-------------")
print(st.describe(heightv))
Summary Stats
-------------
DescribeResult(nobs=60, minmax=(1.1, 12.7), mean=6.8166666666666655,
variance=12.757344632768362, skewness=0.019099619781273072,
kurtosis=-1.1611486008105545)

11.3.2. Measures of Dispersion

We now calculate the range, variance and standard deviations. Remember that for variance and standard deviation there is a distinction between population and sample.

# returns smallest a largest element
print("Range = {:.3f}".format(max(heightv)-min(heightv)))
print("Population variance = {:.3f}".\

# population variance
  format(np.sum((heightv-np.mean(heightv))**2)/N))

# sample variance
print("Sample variance     = {:.3f}".\
  format(np.sum((heightv-np.mean(heightv))**2)/(N-1)))

print("Pop.   standard dev = {:.3f}".\
  format(np.sqrt(np.sum((heightv-np.mean(heightv))**2)/N)))

print("Sample standard dev = {:.3f}". \
  format(np.sqrt(np.sum((heightv-np.mean(heightv))**2)/(N-1))))

#Or simply
print("Pop.   standard dev = {:.3f}".format(np.std(heightv)))
Range = 11.600
Population variance = 12.545
Sample variance     = 12.757
Pop.   standard dev = 3.542
Sample standard dev = 3.572
Pop.   standard dev = 3.542

11.3.3. Measures of Relative Standing

Percentiles are calculated as follows:

print("1 quartile (25th percentile) = {:.3f}".\
  format(st.scoreatpercentile(heightv, 25)))
print("2 quartile (50th percentile) = {:.3f}".\
  format(st.scoreatpercentile(heightv, 50)))
print("3 quartile (75th percentile) = {:.3f}".\
  format(st.scoreatpercentile(heightv, 75)))

# Inter quartile rante: Q3-Q1 or P_75-P_25
print("IQR = P75 - P25              = {:.3f}".\
  format(st.scoreatpercentile(heightv, 75)\
  -st.scoreatpercentile(heightv, 25)))
1 quartile (25th percentile) = 3.950
2 quartile (50th percentile) = 6.850
3 quartile (75th percentile) = 9.650
IQR = P75 - P25              = 5.700

11.3.4. Data Aggregation or Summary Statistics by Subgroup

We next want to summarize the data by subgroup. Let’s assume that we would like to compare the average Height by Race. We can use the groupby() to accomplish this.

groupRace = df.groupby('Race')
print('Summary by Race', groupRace.mean())
Summary by Race         Height  AverageMathSAT        Age    Female
Education
Race
Blk   6.278261      491.652174  20.826087  0.304348   3.565217
Hisp  6.525000      472.750000  21.500000  0.250000   1.750000
Mex   6.171429      491.571429  20.928571  0.285714   2.428571
Oth   8.750000      516.000000  21.000000  0.500000   2.500000
Wht   7.917647      500.411765  20.941176  0.411765   3.117647

We could also use a different category like gender.

groupGender = df.groupby('Female')
print('Summary by Gender', groupGender.mean())
Summary by Gender         Height  AverageMathSAT   Age  Education
Female
0        4.765           479.7  21.4       2.95
1       10.920           521.6  20.0       3.15

Or we can also combine the two or more categories and summarize the data as follows.

groupRaceGender = df.groupby(['Race', 'Female'])
print('Summary by Race and Gender', groupRaceGender.mean())
Summary by Race and Gender                 Height  AverageMathSAT
Age  Education
Race Female
Blk  0        4.350000      479.562500  21.1875   3.562500
     1       10.685714      519.285714  20.0000   3.571429
Hisp 0        5.600000      461.333333  22.0000   1.666667
     1        9.300000      507.000000  20.0000   2.000000
Mex  0        4.480000      481.900000  21.3000   2.100000
     1       10.400000      515.750000  20.0000   3.250000
Oth  0        6.300000      506.000000  22.0000   1.000000
     1       11.200000      526.000000  20.0000   4.000000
Wht  0        5.310000      480.600000  21.6000   3.400000
     1       11.642857      528.714286  20.0000   2.714286

Be careful and don’t forget to add the various categories in a list using the brackets [...]].

11.4. Measures of Linear Relationship

11.4.1. Covariance

We first grab two variables heightv and agev from the dataframe and define them as numpy arrays.

xv = df['Age'].values
yv = df['Height'].values
n = len(xv)

We then go ahead and calculate the covariance.

# Population covariance
print("Population covariance = {:.3f}".\
  format(np.sum((xv-np.mean(xv))*(yv-np.mean(yv)))/n))

# Sample covariance
print("Sample covariance     = {:.3f}".\
  format(np.sum((xv-np.mean(xv))*(yv-np.mean(yv)))/(n-1)))

# or simply
print("Sample covariance     = {:.3f}".format(df.Age.cov(df.Height)))
Population covariance = -0.159
Sample covariance     = -0.162
Sample covariance     = -0.162

11.4.2. Correlation Coefficient

We can calculate the correlation coefficient by hand or using the pandas toolbox command.

print("Correlation coefficient =  {:.3f}".\
  format((np.sum((xv-np.mean(xv))*(yv-np.mean(yv)))/n)\
  /(np.std(xv)*np.std(yv))))

print("Correlation coefficient = {:.3f}".format(df.Age.corr(df.Height)))
Correlation coefficient =  -0.038
Correlation coefficient = -0.038

11.4.3. Regression Line: Example 1

We first generate some data. A variable x and a variable y.

# Define data. 2 vectors xv and yv
xv = np.arange(1,9,1)
yv = np.array([6,1,9,5,17,12,14,15])

We then “cast” these into a pandas DataFrame.

# Define data frame
df = pd.DataFrame(np.array([xv, yv]).T, columns = ['x', 'y'])

We fist make a scatterplot with least squares trend line: y = beta_0 + beta_1 * x + epsilon.

p = np.polyfit(xv, yv, 1)
print("p = {}".format(p))

# Scatterplot
fig, ax = plt.subplots()
ax.set_title('Linear regression with polyfit()')
ax.plot(xv, yv, 'o', label = 'data')
ax.plot(xv, np.polyval(p,xv),'-', label = 'Linear regression')
ax.legend(['Data', 'OLS'], loc='best')
plt.show()
p = [1.77380952 1.89285714]
_images/Slides_Data_2_Scatter1_1.png

Example 1 Again with Statsmodels library

We then run the same regression using a more general method called ols which is part of statsmodels. When calling the ols function you need to add the module name (statsmodels was imported as sm) in front of it: sm.OLS(). You then define the independent variable y and the dependent variables x's.

Warning

If you google for OLS and Python you may come across code that uses Pandas directly to run an OLS regression as in pd.OLS(y=Ydata, x=Xdata). This will not work anymore as this API has been deprecated in newer versions of Pandas. You need to use the Statsmodel Library

import statsmodels.api as sm
from patsy import dmatrices

# Run OLS regression
# This line adds a constant number column to the X-matrix and extracts
# the relevant columns from the dataframe
y, X = dmatrices('y ~ x', data=df, return_type='dataframe')
res = sm.OLS(y, X).fit()

# Show coefficient estimates
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:
0.609
Model:                            OLS   Adj. R-squared:
0.544
Method:                 Least Squares   F-statistic:
9.358
Date:                Thu, 12 May 2022   Prob (F-statistic):
0.0222
Time:                        15:51:22   Log-Likelihood:
-20.791
No. Observations:                   8   AIC:
45.58
Df Residuals:                       6   BIC:
45.74
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025
0.975]
------------------------------------------------------------------------------
Intercept      1.8929      2.928      0.646      0.542      -5.272
9.058
x              1.7738      0.580      3.059      0.022       0.355
3.193
==============================================================================
Omnibus:                        0.555   Durbin-Watson:
3.176
Prob(Omnibus):                  0.758   Jarque-Bera (JB):
0.309
Skew:                           0.394   Prob(JB):
0.857
Kurtosis:                       2.449   Cond. No.
11.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
/home/jjung/anaconda3/envs/pweaveEnv/lib/python3.9/site-
packages/scipy/stats/stats.py:1541: UserWarning: kurtosistest only
valid for n>=20 ... continuing anyway, n=8
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "

Or if you simply want to have a look at the estimated coefficients you can access them via:

print('Parameters:')
print(res.params)
Parameters:
Intercept    1.892857
x            1.773810
dtype: float64

Finally, we use the model to make a prediction of y when x = 8.5.

# Make prediction for x = 8.5
betas = res.params.values
print("Prediction of y for x = 8.5 is: {:.3f}".\
  format(np.sum(betas * np.array([1,8.5]))))
Prediction of y for x = 8.5 is: 16.970

We can also use the results from the OLS command to make a scatterplot with the trendline through it.

fig, ax = plt.subplots()
ax.set_title('Linear Regression with Fitted Values')
ax.plot(xv, yv, 'o', label = 'data')
ax.plot(xv, res.predict(X).values,'k-', label = 'Linear regression')
ax.legend(['Data', 'OLS'], loc='best')
plt.show()
_images/Slides_Data_2_Scatter2_1.png

As you have seen above, you can use the estimates from this model to make predictions of Y (called Y-hat) for specific values of variable X. We next use all the X values from our original data and predict for each one of those what the corresponding predicted Y-value would be (\(\hat{y}\)) and then calculate the difference between the actual Y value in the data and the predicted Y-value Y-hat. In the graph above, it is essentially the distance of the blue dots (the data) to the black line (the prediction from the model). These differences, \(y-\hat{y}\), are also called the residual errors of your model. The OLS procedure tries to find beta estimates to minimize these residual errors (OLS actually tried to minimize the sum of all the squared errors, or \(\sum_i{(y_i-\hat{y}_i)^2}\).

Here is the code to accomplish this.

y_hatv = res.predict(X).values  # Predicts y-hat based on all X values in data
residualsv = yv - y_hatv

fig, ax = plt.subplots()
ax.set_title('Residual Plot')
ax.plot(yv, residualsv, 'o', label = 'data')
ax.plot(yv, np.zeros(len(yv)),'k-')
ax.set_xlabel('Y - Dependent Variable')
plt.show()
_images/Slides_Data_2_ResidualScatter_1.png

If the errors are systematically different over the range of X-values we call this heteroskedasticity. If you dedect heteroskedasticity you would have to adjust for that using a more “robust” estimator. These more robust procedures basically adjust the standard errors (make them larger) to account for the effect that heteroskedasticity has on your estimate.

Here is how you estimate the “robust” version of your regression.

import statsmodels.api as sm
from patsy import dmatrices

# Run OLS regression
# This line adds a constant number column to the X-matrix and extracts
# the relevant columns from the dataframe
y, X = dmatrices('y ~ x', data=df, return_type='dataframe')
res_rob = sm.OLS(y, X).fit(cov_type = 'HC3')

# Show coefficient estimates
print(res_rob.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:
0.609
Model:                            OLS   Adj. R-squared:
0.544
Method:                 Least Squares   F-statistic:
11.04
Date:                Thu, 12 May 2022   Prob (F-statistic):
0.0160
Time:                        15:51:23   Log-Likelihood:
-20.791
No. Observations:                   8   AIC:
45.58
Df Residuals:                       6   BIC:
45.74
Df Model:                           1
Covariance Type:                  HC3
==============================================================================
                 coef    std err          z      P>|z|      [0.025
0.975]
------------------------------------------------------------------------------
Intercept      1.8929      3.363      0.563      0.574      -4.699
8.484
x              1.7738      0.534      3.322      0.001       0.727
2.820
==============================================================================
Omnibus:                        0.555   Durbin-Watson:
3.176
Prob(Omnibus):                  0.758   Jarque-Bera (JB):
0.309
Skew:                           0.394   Prob(JB):
0.857
Kurtosis:                       2.449   Cond. No.
11.5
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
/home/jjung/anaconda3/envs/pweaveEnv/lib/python3.9/site-
packages/scipy/stats/stats.py:1541: UserWarning: kurtosistest only
valid for n>=20 ... continuing anyway, n=8
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "

Note

Stata uses HC1 as it’s default when you run reg y x, r, while most R methods default to HC3. Always be sure to check the documentation.

If you compare the previous results to the robust results you will find that the coefficient estimates are the same but the standard errors in the robust version are larger.

11.4.4. Regression Line: Example 2

In this example we estimate an OLS model with categorical (dummy variables). This is a bit more involved as we increase the number of explanatory variables. In addition, some explanatory variables are categorical variables. In order to use them in our OLS regression we first have to make so called dummy variables (i.e. indicator variables that are either 0 or 1).

I will next demonstrate how to make dummy variables. I first show you a more cumbersome way to do it which is more powerful as it gives you more control over how to exactly define your dummy variable. Then I show you a quick way to do it if your categorical variable is already in the form of distinct categories.

We first import our dataset and investigate the form of our categorical variable Race.

df = pd.read_csv(filepath + 'Lecture_Data_Excel_b.csv')
print(df['Race'].describe())

# Let's rename one variable name
df = df.rename(columns={'AverageMathSAT': 'AvgMathSAT'})
count      60
unique      5
top       Blk
freq       23
Name: Race, dtype: object

From this summary we see that our categorical variables has five unique race categories. We therefore start with making a new variable for each one of these race categories that we pre-fill with zeros.

Note

I use a prefix d_ for my dummy variables which is a convenient naming convention that allows for quick identification of dummy variables for users who want to work with these data in the future.

df['d_Wht'] = 0
df['d_Blk'] = 0
df['d_Mex'] = 0
df['d_Hisp'] = 0
df['d_Oth'] = 0

We next need to set these dummy variables equal to one, whenever the person is of that race. Let us first have a look at what the following command procudes.

print(df['Race']=='Wht')
0     False
1      True
2     False
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11     True
12    False
13    False
14    False
15    False
16    False
17    False
18     True
19    False
20    False
21    False
22     True
23    False
24    False
25    False
26    False
27     True
28    False
29    False
30    False
31     True
32    False
33     True
34    False
35    False
36    False
37     True
38    False
39     True
40     True
41    False
42    False
43    False
44    False
45     True
46    False
47    False
48    False
49    False
50    False
51    False
52    False
53     True
54    False
55    False
56     True
57     True
58     True
59     True
Name: Race, dtype: bool

This is a vector with true/false statements. It results in a True statement if the race of the person is white and a false statement otherwise. We can use this vector “inside” a dataframe to make a conditional statement. We can combine this with the .loc() function of Pandas to replace values in the d_Wht column given conditions are met in the Race column. Here is the syntax that replaces zeros in the d_Wht column with the value one, whenever the race of the person is equal to Wht in the Race column.

df.loc[(df['Race'] == 'Wht'), 'd_Wht'] = 1
print(df.head())
   Height  Response  AvgMathSAT  Age  Female  Education  Race  d_Wht
d_Blk  \
0     1.1         3         370   20       0          2  Hisp      0
0
1     1.2         4         393   20       0          4   Wht      1
0
2     1.3         4         413   20       0          4   Blk      0
0
3     1.4         5         430   20       0          4   Wht      1
0
4     1.5         3         440   20       0          2   Mex      0
0

   d_Mex  d_Hisp  d_Oth
0      0       0      0
1      0       0      0
2      0       0      0
3      0       0      0
4      0       0      0

You can do this now for the other four race categories to get your complete set of dummy variables.

df.loc[(df['Race'] == 'Blk'), 'd_Blk'] = 1
df.loc[(df['Race'] == 'Mex'), 'd_Mex'] = 1
df.loc[(df['Race'] == 'Hisp'), 'd_Hisp'] = 1
df.loc[(df['Race'] == 'Oth'), 'd_Oth'] = 1
print(df.head())
   Height  Response  AvgMathSAT  Age  Female  Education  Race  d_Wht
d_Blk  \
0     1.1         3         370   20       0          2  Hisp      0
0
1     1.2         4         393   20       0          4   Wht      1
0
2     1.3         4         413   20       0          4   Blk      0
1
3     1.4         5         430   20       0          4   Wht      1
0
4     1.5         3         440   20       0          2   Mex      0
0

   d_Mex  d_Hisp  d_Oth
0      0       1      0
1      0       0      0
2      0       0      0
3      0       0      0
4      1       0      0

This is a very powerful way to make your own dummy variables as it allows you complete control over what you code as 0 and what you code as 1. Inside the .loc() function you can use more complex conditional statments such as >, <, >=, <=, as well as logical commands such as | (which stands for the logical or) and & (which stands for the logical and).

Warning

The or and and python statements require truth-values. For pandas these are considered ambiguous so you should use “bitwise” | (or) or & (and) operations.

We next create a categorical variable that indicates whether somebody is Mexican or Hispanic.

df['d_Mex_Hisp'] = 0
df.loc[((df['Race'] == 'Mex') | (df['Race'] == 'Hisp')), 'd_Mex_Hisp'] = 1

Here is the quick way to make dummy variables out of a categorical variables with distinct categories. I first import the data again, so we can work with the original (raw) data.

df = pd.read_csv(filepath + 'Lecture_Data_Excel_b.csv')

dummies = pd.get_dummies(df['Race'], prefix = 'd')
df = df.join(dummies)
print(df.head())
   Height  Response  AverageMathSAT  Age  Female  Education  Race
d_Blk  \
0     1.1         3             370   20       0          2  Hisp
0
1     1.2         4             393   20       0          4   Wht
0
2     1.3         4             413   20       0          4   Blk
1
3     1.4         5             430   20       0          4   Wht
0
4     1.5         3             440   20       0          2   Mex
0

   d_Hisp  d_Mex  d_Oth  d_Wht
0       1      0      0      0
1       0      0      0      1
2       0      0      0      0
3       0      0      0      1
4       0      1      0      0

We next use OLS again to run the regression. At the end we make a prediction based on some values for the independent variables x1, x2, etc.

When you specify your regression with the race-dummy variable you need to make sure that you drop one of the dummy variables. Here I drop the d_wht category, which is referred to as base category because we dropped it. This is arbitrary but you need to drop one of the categories as you otherwise have the probelm of perfect multicollinearity.

import statsmodels.api as sm
from patsy import dmatrices

y, X = dmatrices('Height ~ Age + Education + Female + d_Blk + d_Hisp + d_Mex + d_Oth', data=df, return_type='dataframe')
res = sm.OLS(y, X).fit()

# Show coefficient estimates
print(res.summary())
print()
print('Parameters:')
print(res.params)
                            OLS Regression Results
==============================================================================
Dep. Variable:                 Height   R-squared:
0.944
Model:                            OLS   Adj. R-squared:
0.936
Method:                 Least Squares   F-statistic:
124.4
Date:                Thu, 12 May 2022   Prob (F-statistic):
3.52e-30
Time:                        15:51:23   Log-Likelihood:
-74.722
No. Observations:                  60   AIC:
165.4
Df Residuals:                      52   BIC:
182.2
Df Model:                           7
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025
0.975]
------------------------------------------------------------------------------
Intercept    -35.1898      2.662    -13.220      0.000     -40.531
-29.849
Age            1.8944      0.124     15.269      0.000       1.645
2.143
Education     -0.0514      0.127     -0.406      0.686      -0.305
0.203
Female         8.7359      0.308     28.387      0.000       8.118
9.353
d_Blk         -0.4600      0.299     -1.540      0.130      -1.059
0.139
d_Hisp        -1.1084      0.534     -2.074      0.043      -2.181
-0.036
d_Mex         -0.6566      0.338     -1.941      0.058      -1.335
0.022
d_Oth         -0.0816      0.681     -0.120      0.905      -1.448
1.285
==============================================================================
Omnibus:                        0.297   Durbin-Watson:
0.581
Prob(Omnibus):                  0.862   Jarque-Bera (JB):
0.478
Skew:                          -0.097   Prob(JB):
0.787
Kurtosis:                       2.607   Cond. No.
486.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.

Parameters:
Intercept   -35.189807
Age           1.894377
Education    -0.051375
Female        8.735906
d_Blk        -0.459986
d_Hisp       -1.108371
d_Mex        -0.656576
d_Oth        -0.081628
dtype: float64

When you interpret the coefficient estimates of the race dummy variables, you interpret them in relation to the dropped category, which is “white” in my case. Here this would mean that on average, African Americans are \(-0.46\) units of height shorter than individuals in the base category that we have omitted. Similarly , on average Hispanics are \(-1.1084\) height units shorter than individuals in the base category. Keep in mind that these estimates are not really statistically significant (large p-values) and the sample is extremely small. So be careful with your statements!

Prediction: If age = 20, education = 12 years, gender = female and you are Mexican then the predicted height is …

# Take out results and define it as numpy-vector
betas = res.params.values
print("Prediction: {}".\
  format(np.sum(betas * np.array([1, 20, 12, 1, 0, 0, 1,0]))))
Prediction: 10.16056748631931

And here is the residual plot again to check for heteroskedasticity:

yv = y['Height'].values
y_hatv = res.predict(X).values  # Predicts y-hat based on all X values in data
residualsv = yv - y_hatv

fig, ax = plt.subplots()
ax.set_title('Residual Plot')
ax.plot(yv, residualsv, 'o', label = 'data')
ax.plot(yv, np.zeros(len(yv)),'k-')
ax.set_xlabel('Height (Y dependent variable)')
plt.show()
_images/Slides_Data_2_ResidualScatter2_1.png

11.5. Measures of Non-Linear Relationship

We next demonstrate how to run Probit and Logit regressions. These type of regression models are used when the dependent variable is a categorical 0/1 variable. If you instead ran a simple linear OLS regression on such a variable, the possible predictions from this linear probability model could fall outside of the [0,1] range. The Probit and Logit specifications ensure that predictions are probabilities that will fall within the [0,1] range.

Make sure that you have already downloaded the Excel data file Lecture_Data_Excel_c.xlsx.

We next import the Excel file data directly from Excel using the pd.read_excel() function from Pandas.

Important

In order for this to work you need to make sure that the xlrd package is installed. Open a terminal window and type:

conda install xlrd

Press yes at the prompt and it will install the package.

You can now use the command. It is important to specify the Excel Sheet that you want to import data from since an Excel file can have multiple spreadsheets inside.

In our example the data resides in the sheet with the name binary. So we specify this in the excel_read() function.

# Read entire excel spreadsheet
filepath = 'Lecture_Data/'
df = pd.read_excel(filepath + "Lecture_Data_Excel_c.xlsx", 'binary')

# Check how many sheets we have inside the excel file
#xl.sheet_names

# Pick one sheet and define it as your DataFrame by parsing a sheet
#df = xl.parse("binary")

print(df.head())

# rename the 'rank' column because there is also a DataFrame method called 'rank'
df.columns = ["admit", "gre", "gpa", "prestige"]
   admit  gre   gpa  rank
0      0  380  3.61     3
1      1  660  3.67     3
2      1  800  4.00     1
3      1  640  3.19     4
4      0  520  2.93     4

We next make dummies out of the rank variable.

# dummify rank
dummy_ranks = pd.get_dummies(df['prestige'], prefix='d_prest')

# Join dataframes
df = df.join(dummy_ranks)
print(df.head())
   admit  gre   gpa  prestige  d_prest_1  d_prest_2  d_prest_3
d_prest_4
0      0  380  3.61         3          0          0          1
0
1      1  660  3.67         3          0          0          1
0
2      1  800  4.00         1          1          0          0
0
3      1  640  3.19         4          0          0          0
1
4      0  520  2.93         4          0          0          0
1

11.5.1. Logit

We now estimate a Logit model as follows:

# Import the correct version of statsmodels
import statsmodels.api as sm
from patsy import dmatrices

y, X = dmatrices('admit ~ gre + gpa + d_prest_2 + d_prest_3 + d_prest_4', data=df, return_type='dataframe')

# Define the model
logit = sm.Logit(y, X)

# fit the model
res = logit.fit()

# Print results
print(res.summary())
Optimization terminated successfully.
         Current function value: 0.573147
         Iterations 6
                           Logit Regression Results
==============================================================================
Dep. Variable:                  admit   No. Observations:
400
Model:                          Logit   Df Residuals:
394
Method:                           MLE   Df Model:
5
Date:                Thu, 12 May 2022   Pseudo R-squ.:
0.08292
Time:                        15:51:24   Log-Likelihood:
-229.26
converged:                       True   LL-Null:
-249.99
Covariance Type:            nonrobust   LLR p-value:
7.578e-08
==============================================================================
                 coef    std err          z      P>|z|      [0.025
0.975]
------------------------------------------------------------------------------
Intercept     -3.9900      1.140     -3.500      0.000      -6.224
-1.756
gre            0.0023      0.001      2.070      0.038       0.000
0.004
gpa            0.8040      0.332      2.423      0.015       0.154
1.454
d_prest_2     -0.6754      0.316     -2.134      0.033      -1.296
-0.055
d_prest_3     -1.3402      0.345     -3.881      0.000      -2.017
-0.663
d_prest_4     -1.5515      0.418     -3.713      0.000      -2.370
-0.733
==============================================================================

A Slightly different summary

print(res.summary2())
                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: 0.083
Dependent Variable: admit            AIC:              470.5175
Date:               2022-05-12 15:51 BIC:              494.4663
No. Observations:   400              Log-Likelihood:   -229.26
Df Model:           5                LL-Null:          -249.99
Df Residuals:       394              LLR p-value:      7.5782e-08
Converged:          1.0000           Scale:            1.0000
No. Iterations:     6.0000
------------------------------------------------------------------
               Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
------------------------------------------------------------------
Intercept     -3.9900    1.1400  -3.5001  0.0005  -6.2242  -1.7557
gre            0.0023    0.0011   2.0699  0.0385   0.0001   0.0044
gpa            0.8040    0.3318   2.4231  0.0154   0.1537   1.4544
d_prest_2     -0.6754    0.3165  -2.1342  0.0328  -1.2958  -0.0551
d_prest_3     -1.3402    0.3453  -3.8812  0.0001  -2.0170  -0.6634
d_prest_4     -1.5515    0.4178  -3.7131  0.0002  -2.3704  -0.7325
=================================================================

And here is the residual plot again to check for heteroskedasticity:

yv = y['admit'].values
y_hatv = res.predict(X).values  # Predicts y-hat based on all X values in data
residualsv = yv - y_hatv

fig, ax = plt.subplots()
ax.set_title('Residual Plot')
ax.plot(yv, residualsv, 'o', label = 'data')
ax.plot(yv, np.zeros(len(yv)),'k-')
ax.set_xlabel('admit (Y dependent variable)')
plt.show()
_images/Slides_Data_2_ResidualScatter3_1.png

Not surprisingly we see that since the dependent variable can only take on the values 0 or 1, the residual plot shows only residuals at these two locations.

11.5.2. Probit

If you want to run a Probit regression, you can do:

import statsmodels.api as sm
from patsy import dmatrices

y, X = dmatrices('admit ~ gre + gpa + d_prest_2 + d_prest_3 + d_prest_4', data=df, return_type='dataframe')
# Define the model
probit = sm.Probit(y, X)

# Fit the model
res = probit.fit()

# Print results
print(res.summary())
Optimization terminated successfully.
         Current function value: 0.573016
         Iterations 5
                          Probit Regression Results
==============================================================================
Dep. Variable:                  admit   No. Observations:
400
Model:                         Probit   Df Residuals:
394
Method:                           MLE   Df Model:
5
Date:                Thu, 12 May 2022   Pseudo R-squ.:
0.08313
Time:                        15:51:25   Log-Likelihood:
-229.21
converged:                       True   LL-Null:
-249.99
Covariance Type:            nonrobust   LLR p-value:
7.219e-08
==============================================================================
                 coef    std err          z      P>|z|      [0.025
0.975]
------------------------------------------------------------------------------
Intercept     -2.3868      0.674     -3.541      0.000      -3.708
-1.066
gre            0.0014      0.001      2.120      0.034       0.000
0.003
gpa            0.4777      0.195      2.444      0.015       0.095
0.861
d_prest_2     -0.4154      0.195     -2.126      0.033      -0.798
-0.032
d_prest_3     -0.8121      0.209     -3.893      0.000      -1.221
-0.403
d_prest_4     -0.9359      0.246     -3.810      0.000      -1.417
-0.454
==============================================================================

A Slightly different summary

print(res.summary2())
                         Results: Probit
=================================================================
Model:              Probit           Pseudo R-squared: 0.083
Dependent Variable: admit            AIC:              470.4132
Date:               2022-05-12 15:51 BIC:              494.3620
No. Observations:   400              Log-Likelihood:   -229.21
Df Model:           5                LL-Null:          -249.99
Df Residuals:       394              LLR p-value:      7.2189e-08
Converged:          1.0000           Scale:            1.0000
No. Iterations:     5.0000
------------------------------------------------------------------
               Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
------------------------------------------------------------------
Intercept     -2.3868    0.6741  -3.5408  0.0004  -3.7080  -1.0656
gre            0.0014    0.0006   2.1200  0.0340   0.0001   0.0026
gpa            0.4777    0.1955   2.4441  0.0145   0.0946   0.8608
d_prest_2     -0.4154    0.1954  -2.1261  0.0335  -0.7983  -0.0325
d_prest_3     -0.8121    0.2086  -3.8934  0.0001  -1.2210  -0.4033
d_prest_4     -0.9359    0.2456  -3.8101  0.0001  -1.4173  -0.4545
=================================================================

11.6. Tutorials

Some of the notes above are summaries of these tutorials: