12. Working with Data III: Case Study¶

In this chapter we introduce a case study using Corona virus data. We will track infection rates and plot figures using the latest data tracking the spread of the corona virus.

12.1. Importing Data¶

We will be working with data from the Johns Hopkins Whiting School of Engineering, Center for Systems Science and Engineering. Their Github portal is at: https://github.com/CSSEGISandData

This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).

You can find their dashboard with all the visual information under this link: Dashboard

We will use their data (it’s updated twice daily) and make our own graphs. We first need to import the data from their website. We can simply do this with a Pandas function .read_csv().

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# ----------------------------------------
i_downloadData = 1   # Indicator flag whether you want to freshly download the
                     # data
# ----------------------------------------

if i_downloadData == 1:
    urlBase = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'
    urlConf = urlBase + 'time_series_covid19_confirmed_global.csv'
    urlDead = urlBase + 'time_series_covid19_deaths_global.csv'
    urlRec  = urlBase + 'time_series_covid19_recovered_global.csv'

    # Download and Save
    dfConf = pd.read_csv(urlConf, on_bad_lines='skip')
    #dfConf.to_pickle('CoronaConfirmed')

    dfDead = pd.read_csv(urlDead, on_bad_lines='skip')
    #dfDead.to_pickle('CoronaDeath')

    dfRec = pd.read_csv(urlRec, on_bad_lines='skip')
    #dfRec.to_pickle('CoronaRecovered')

Note

If you want to locally store the data and not download the data everytime you run your script file you could save the data first with:

# Save data locally on harddrive
dfConf.to_pickle('CoronaConfirmed')
dfDead.to_pickle('CoronaDeath')
dfRec.to_pickle('CoronaRecovered')

And then simply read it from your harddisk using:

# Read data from harddrive
dfConf = pd.read_pickle('CoronaConfirmed')
dfDead = pd.read_pickle('CoronaDeath')
dfRec= pd.read_pickle('CoronaRecovered')

You would then of course have to “outcomment” the webreading section above or set the i_downloadData flag equal to zero so that the downloading part gets skipped.

Instead of saving the data to a file I will simply assign the imported data into a new dataframe that I am not going to manipulate.

dfConf_orig = dfConf.copy()
dfDead_orig = dfDead.copy()
dfRec_orig  = dfRec.copy()

For each application I will then copy the original data from the _orig dataframes.

Have a careful look at the data. Use the Variable Explorer tab in Spyder to investigate the dataframe. The nature of your data is basically an observation over time of confirmed corona virus infections by Province/State as the “smallest” geographical denominator. You also know which country the Pronvince/State belongs to (you see this in the second column) and then you also have the Latitude/Longitude coordinates of the Province/State from which the corona cases are reported from. This is followed by daily observations from this Province/State.

12.2. Plotting Cases of Infections¶

I first copy the original data into new dataframes because I want to keep the raw data intact and untouched in case I want to come back to it later, which we will!

dfConf = dfConf_orig.copy()
dfDead = dfDead_orig.copy()
dfRec  = dfRec_orig.copy()

We next add a column with the sum of all the confirmed coronavirus cases for each Province/State. In other words, we sum up all the columns of the time series of cases which starts in column five, so that we go from [4:] to the end.

dfConf['Confirmed']=dfConf.iloc[:,-1]
dfDead['Dead']=dfDead.iloc[:,-1]
dfRec['Recovered']=dfRec.iloc[:,-1]

We then drop the entire time series and only keep the overall sum of cases. We are not interested in the single day observations for this first summary graph.

dfConf.drop(dfConf.iloc[:, 4:-2], inplace = True, axis = 1)
dfDead.drop(dfDead.iloc[:, 4:-2], inplace = True, axis = 1)
dfRec.drop(dfRec.iloc[:, 4:-2], inplace = True, axis = 1)

We next merge the three dataframes together by Province/State, Country/Region, Lat, and Long variables so that we have the sum of all confirmed infection cases, the sum of all corona virus associated deaths, and the sum of all the recovered cases for each Province/State in the same dataframe.

dftemp = pd.merge(dfConf, dfDead, \
   on=['Province/State', 'Country/Region','Lat','Long'], how='inner')
df = pd.merge(dftemp, dfRec, \
   on=['Province/State', 'Country/Region','Lat','Long'], how='inner')

We have now one dataframe with the sum of all confirmed coronavirus cases, the sum of all deaths due to corona virus, as well as the sum of all recorded recoveries from a coronavirus infection.

We next plot the infection cases by their latitude and longitude of the province/state where they were recorded. We plot circles and use the number of cases per 1000 as circle size. The larger the circle in the plot, the more cases have been recorded for the Latitude/Longitude coordinate.

ax = df.plot(kind="scatter", x="Long", y="Lat", alpha=0.4,
    s=df["Confirmed"]/1000, label="Confirmed Infections", color = "Blue")
df.plot(kind="scatter", x="Long", y="Lat", alpha=0.4,
    s=df["Dead"]/1000, label="Deaths", color="Red", ax=ax)
plt.legend()
plt.show()

From this graph you can already see the outline of countries. However, it would be better if we could superimpose the information onto a real map of the world. We do this next.

12.3. Plotting Cases of Infections on a Map¶

We next use the Cartopy library to plot the same information superimposed on a worl map.

Note

You will need to install the cartopy library via the command line. Open a command line terminal and type:

conda install -c conda-forge cartopy

Followed by enter, then hit y for yes when it prompts you. This should install the cartopy library.

We will next import the cartopy library.

import cartopy.crs as ccrs

fig = plt.figure(figsize=(14, 14))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()
plt.scatter(df['Long'].values, df['Lat'].values, transform=ccrs.PlateCarree(), \
        label=None, s=df["Confirmed"]/1000, c="Blue", linewidth=0, alpha=0.4)
plt.scatter(df['Long'].values, df['Lat'].values, transform=ccrs.PlateCarree(), \
        label=None, s=df["Dead"]/1000, c="Red", linewidth=0, alpha=0.4)

plt.legend()
plt.show()

You now have a nice plot of the world map and the corona cases superimposed on it.

12.4. Plotting Time Series of Cases¶

We now go back to the original dataframe with the time series data of confirmed coronavirus cases. We then drop some variables that we do not need, such as Province/State, Lat, and Long.

# Here is the original data again
dfConf = dfConf_orig.copy()
dfDead = dfDead_orig.copy()
dfRec  = dfRec_orig.copy()

Now pick the confirmed cases and drop some columns

dfConf_t = dfConf
dfConf_t = dfConf_t.drop(columns = ['Province/State', 'Lat', 'Long'])

Let us have a look at the first three rows to see how our raw data look like.

print(dfConf_t[0:3])

  Country/Region  1/22/20  1/23/20  1/24/20  1/25/20  1/26/20  1/27/20
\
0    Afghanistan        0        0        0        0        0        0
1        Albania        0        0        0        0        0        0
2        Algeria        0        0        0        0        0        0

   1/28/20  1/29/20  1/30/20  ...  1/17/22  1/18/22  1/19/22  1/20/22
\
0        0        0        0  ...   158826   158974   159070   159303
1        0        0        0  ...   233654   236486   239129   241512
2        0        0        0  ...   226749   227559   228918   230470

   1/21/22  1/22/22  1/23/22  1/24/22  1/25/22  1/26/22
0   159516   159548   159649   159896   160252   160692
1   244182   246412   248070   248070   248859   251015
2   232325   234536   236670   238885   241406   243568

[3 rows x 737 columns]

We next add up all the cases for each day by Country/Region using the groupby() function that comes with Pandas.

dfConf_t = dfConf_t.groupby('Country/Region').sum()

And again, let us peek at the data real quick.

print(dfConf_t[0:3])

                1/22/20  1/23/20  1/24/20  1/25/20  1/26/20  1/27/20
1/28/20  \
Country/Region
Afghanistan           0        0        0        0        0        0
0
Albania               0        0        0        0        0        0
0
Algeria               0        0        0        0        0        0
0

                1/29/20  1/30/20  1/31/20  ...  1/17/22  1/18/22
1/19/22  \
Country/Region                             ...
Afghanistan           0        0        0  ...   158826   158974
159070
Albania               0        0        0  ...   233654   236486
239129
Algeria               0        0        0  ...   226749   227559
228918

                1/20/22  1/21/22  1/22/22  1/23/22  1/24/22  1/25/22
1/26/22
Country/Region
Afghanistan      159303   159516   159548   159649   159896   160252
160692
Albania          241512   244182   246412   248070   248070   248859
251015
Algeria          230470   232325   234536   236670   238885   241406
243568

[3 rows x 736 columns]

We now transpose the dataframe because we want the time observations as rows and not columns.

dfConf_t = dfConf_t.T
print(dfConf_t[0:3])

Country/Region  Afghanistan  Albania  Algeria  Andorra  Angola  \
1/22/20                   0        0        0        0       0
1/23/20                   0        0        0        0       0
1/24/20                   0        0        0        0       0

Country/Region  Antigua and Barbuda  Argentina  Armenia  Australia
Austria  \
1/22/20                           0          0        0          0
0
1/23/20                           0          0        0          0
0
1/24/20                           0          0        0          0
0

Country/Region  ...  United Kingdom  Uruguay  Uzbekistan  Vanuatu
Venezuela  \
1/22/20         ...               0        0           0        0
0
1/23/20         ...               0        0           0        0
0
1/24/20         ...               0        0           0        0
0

Country/Region  Vietnam  West Bank and Gaza  Yemen  Zambia  Zimbabwe
1/22/20               0                   0      0       0         0
1/23/20               2                   0      0       0         0
1/24/20               2                   0      0       0         0

[3 rows x 196 columns]

We are now ready to convert the index of the dataframe into a date index so that we can use the built in time series commands in Pandas.

# Converting the index as date
dfConf_t.index = pd.to_datetime(dfConf_t.index)

print(dfConf_t[0:3])

Country/Region  Afghanistan  Albania  Algeria  Andorra  Angola  \
2020-01-22                0        0        0        0       0
2020-01-23                0        0        0        0       0
2020-01-24                0        0        0        0       0

Country/Region  Antigua and Barbuda  Argentina  Armenia  Australia
Austria  \
2020-01-22                        0          0        0          0
0
2020-01-23                        0          0        0          0
0
2020-01-24                        0          0        0          0
0

Country/Region  ...  United Kingdom  Uruguay  Uzbekistan  Vanuatu
Venezuela  \
2020-01-22      ...               0        0           0        0
0
2020-01-23      ...               0        0           0        0
0
2020-01-24      ...               0        0           0        0
0

Country/Region  Vietnam  West Bank and Gaza  Yemen  Zambia  Zimbabwe
2020-01-22            0                   0      0       0         0
2020-01-23            2                   0      0       0         0
2020-01-24            2                   0      0       0         0

[3 rows x 196 columns]

Drop the last observation because it is an empty row.

if np.sum(dfConf_t.iloc[-1,:].values) > 0:
    # Do nothing
    print('Data complete.')
else:
    # If data is not there yet, drop last row
    dfConf_t = dfConf_t[:-1]
#end

Data complete.

We are now ready to plot the time series for different countries. We can choose the number of days we want to plot. Here we choose the most recent 700 observations.

nrObs = -700  # Just plot the recent 700 obs (days)

ax = dfConf_t['US'].iloc[nrObs:].plot()

ax.set_title('US: Number of Infections')
# Customize the major grid
ax.grid(which='major', linestyle='-', linewidth='0.5', color='Black')
# Customize the minor grid
ax.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.legend()
plt.show()

In order to plot multiple countries into a single graph, I first make a list of countries and then run a loop over this list and invoke the plot command. Otherwise we would have a lot of repeat code which is bad programming style.

#%% Plot

countryList = ['US', 'China', 'Korea, South', 'Austria', 'Japan', 'Italy', \
                'Germany', 'Spain', 'France', 'United Kingdom']

# Shorter list for alternative graph
# countryList = ['US', 'Korea, South', 'Austria', 'Japan', \
#                'Germany', 'Spain', 'France', 'United Kingdom']

nrObs = -700

ax = dfConf_t[countryList[0]].iloc[nrObs:].plot(marker = '*')
for x in countryList[1:]:
    dfConf_t[x].iloc[nrObs:].plot(ax=ax, marker = '.')

ax.set_title('Corona Virus Infections: Absolute Levels')
# Customize the major grid
ax.grid(which='major', linestyle='-', linewidth='0.5', color='Black')
# Customize the minor grid
ax.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.legend()
plt.show()

We next investigate the changes in the numbers from one day to the next using the diff() function. It basically subtracts consecutive observations from each other, i.e., it takes the number of infections from day t and subtracts the number of infections from the prior day t-1.

ax = dfConf_t[countryList[0]].iloc[nrObs:].diff().plot(marker = '*')
for x in countryList[1:]:
    dfConf_t[x].iloc[nrObs:].diff().plot(ax=ax, marker = '.')

ax.set_title('Corona Virus Infections: Daily Increases')
# Customize the major grid
ax.grid(which='major', linestyle='-', linewidth='0.5', color='Black')
# Customize the minor grid
ax.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.legend()
plt.show()

12.5. Plotting Time Series of Corona Virus Deaths¶

We again need to transform the data into a time series dataframe first.

# Death Rates
dfDead_t = dfDead
dfDead_t = dfDead_t.drop(columns = ['Province/State', 'Lat', 'Long'])
dfDead_t = dfDead_t.groupby('Country/Region').sum()
dfDead_t = dfDead_t.T
# Converting the index as date
dfDead_t.index = pd.to_datetime(dfDead_t.index)
if np.sum(dfDead_t.iloc[-1,:].values) > 0:
    # Do nothing
    print('Data complete.')
else:
    # If data is not there yet, drop last row
    dfDead_t = dfDead_t[:-1]
#end

Data complete.

And we can now plot the information. We again start with the absolute levels.

ax = dfDead_t[countryList[0]].iloc[nrObs:].plot(marker = '*')
for x in countryList[1:]:
    dfDead_t[x].iloc[nrObs:].plot(ax=ax, marker = '.')

ax.set_title('Corona Virus Deaths: Absolute Levels')
# Customize the major grid
ax.grid(which='major', linestyle='-', linewidth='0.5', color='Black')
# Customize the minor grid
ax.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.legend()
plt.show()

The daily changes in number of deaths can be plotted as follows

ax = dfDead_t[countryList[0]].iloc[nrObs:].diff().plot(marker = '*')
for x in countryList[1:]:
    dfDead_t[x].iloc[nrObs:].diff().plot(ax=ax, marker = '.')

ax.set_title('Corona Virus Deaths: Daily Increases')
# Customize the major grid
ax.grid(which='major', linestyle='-', linewidth='0.5', color='Black')
# Customize the minor grid
ax.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.legend()
plt.show()

12.6. Key Concepts and Summary¶

Note

Importing data from Github
Plotting scatterplots
Plotting time series data
Differencing time series data

12.7. Self-Check Questions¶

Todo

Generate graphs that track the corona virus infection rates for all 50 US states
Generate graphs that show the daily change of number of infections for all 50 US states.