We use the Pandas package for data work. We can import the Pandas library using the import statement. Pandas allows easy organization of data in the spirit of DataFrame concept in R. You can think of a DataFrame as an array with labels similar to an Excel spreadsheet.
A DataFrame can be created in the following ways:
From another DataFrame
From a numpy array
From lists
From another pandas data structure called series which is basically a one dimensional DataFrame, i.e., a column or row of a matrix
From a file like a CSV file etc.
Later in this chapter we will be working with two data sets.
If you scrape information from a website or a pdf document, you often end up with your data stored across various lists. Sometimes these lists are simply holding the column information of a specific variable, sometimes the lists are nested lists, where one list contains the information of all variables per observation. pandas is intelligent enough (usually) to handle them all in a very similar way. I next illustrate the two most often found list structures in data work.
11.1.1 Every List Holds Information of a Separate Variable
Let us assume you observe information about 4! people. The unit of observation is therefore a person. For each person you know the name, age, and the city where they live. This information is stored in 3! separate lists.
There is an easy way to combine these lists into a DataFrame for easy statistical analysis. The process has two steps:
import pandas as pdnames = ['Katie', 'Nick', 'James', 'Eva']ages = [32, 32, 36, 31]locations = ['London', 'Baltimore', 'Atlanta', 'Madrid']# Now zip the lists togetherzipped =list(zip(names, ages, locations))# Assign the zipped lists into a DataFramemyData = pd.DataFrame(zipped, columns=['Name', 'Age', 'Location'])print(myData)
Name Age Location
0 Katie 32 London
1 Nick 32 Baltimore
2 James 36 Atlanta
3 Eva 31 Madrid
# Define the vectorsnames<-c('Katie', 'Nick', 'James', 'Eva')ages<-c(32, 32, 36, 31)locations<-c('London', 'Baltimore', 'Atlanta', 'Madrid')# Create a data framemyData<-data.frame(Name =names, Age =ages, Location =locations)# Print the data frameprint(myData)
Name Age Location
1 Katie 32 London
2 Nick 32 Baltimore
3 James 36 Atlanta
4 Eva 31 Madrid
11.1.2 Every List Holds Information of one Unit of Observation
In this example, the information is stored per person. So each person is stored in a list that contains all 3! variables. Then these person lists are stored in another list. We now are dealing with a nested list.
The steps to move the information from the lists into a DataFrame are pretty much the same. Pandas is intelligent in that way and recognizes the structure of the data.
import pandas as pd# This is the structure of your datadata = [['Katie', 32, 'London'], ['Nik', 32, 'Toronto'], ['James', 36, 'Atlanta'], ['Evan', 31, 'Madrid']]# Assign the multi-dimensional (or nested) list to DataFramemyData = pd.DataFrame(data, columns=['Name', 'Age', 'Location'])print(myData)
Name Age Location
0 Katie 32 London
1 Nik 32 Toronto
2 James 36 Atlanta
3 Evan 31 Madrid
# This is the structure of your datadata<-list(c('Katie', 32, 'London'),c('Nik', 32, 'Toronto'),c('James', 36, 'Atlanta'),c('Evan', 31, 'Madrid'))# Assign the nested list to a DataFramemyData<-data.frame( Name =sapply(data, `[`, 1), Age =sapply(data, `[`, 2), Location =sapply(data, `[`, 3))# Print the data frameprint(myData)
Name Age Location
1 Katie 32 London
2 Nik 32 Toronto
3 James 36 Atlanta
4 Evan 31 Madrid
11.2 Read Small Data Set from a Comma Separated (.csv) File
It is more common that your data is stored in a file such as a text file, Excel spreadsheet, or a file format of some other data management software such as Stata, R, SAS, or SPSS. The Pandas library does provide you with routines (i.e., commands) that will allow you to import data directly from the most common file formats.
In our first example, we will work with data stored as a comma separated file, called Lecture_Data_Excel_a.csv <Lecture_Data/Lecture_Data_Excel_a.csv>. We first import this file using the pd.read_csv command from the Pandas library.
import numpy as npimport matplotlib.pyplot as pltimport pandas as pdfrom scipy import stats as stimport math as mimport seaborn as snsimport time # Imports system time module to time your scriptplt.close('all') # close all open figures
# Read in small data from .csv file# Filepathfilepath ='Lecture_Data/'# In windows you can also specify the absolute path to your data file# filepath = 'C:/Dropbox/Towson/Teaching/3_ComputationalEconomics/Lectures/Lecture_Data/'# ------------- Load data --------------------df = pd.read_csv(filepath +'Lecture_Data_Excel_a.csv', dtype={'Frequency': float})# Let's have a look at it.print(df)
Area Frequency Unnamed: 2 Unnamed: 3
0 Accounting 73.0 NaN NaN
1 Finance 52.0 NaN NaN
2 General management 36.0 NaN NaN
3 Marketing sales 64.0 NaN NaN
4 other 28.0 NaN NaN
Alternatively we could have use the following command:
df = pd.read_table(filepath +'Lecture_Data_Excel_a.csv', sep=',')# Let's have a look at it.print(df)
Area Frequency Unnamed: 2 Unnamed: 3
0 Accounting 73 NaN NaN
1 Finance 52 NaN NaN
2 General management 36 NaN NaN
3 Marketing sales 64 NaN NaN
4 other 28 NaN NaN
# Filepathfilepath<-'Lecture_Data/'# In Windows, you can also specify the absolute path to your data file# filepath <- 'C:/Dropbox/Towson/Teaching/3_ComputationalEconomics/Lectures/Lecture_Data/'# Load datadf<-read.csv(paste0(filepath, 'Lecture_Data_Excel_a.csv'))# Let's have a look at itprint(df)
Area Frequency X X.1
1 Accounting 73 NA NA
2 Finance 52 NA NA
3 General management 36 NA NA
4 Marketing sales 64 NA NA
5 other 28 NA NA
The best way to import Excel files directly is probably to use the built in importer in Pandas.
Important
In order for this to work you need to make sure that the xlrd package is installed. Open a terminal window and type:
conda install xlrd
Press yes at the prompt and it will install the package.
Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl.
# Let's have a look at it.print(df)
Area Frequency Unnamed: 2 Unnamed: 3
0 Accounting 73 NaN NaN
1 Finance 52 NaN NaN
2 General management 36 NaN NaN
3 Marketing sales 64 NaN NaN
4 other 28 NaN NaN
# Load the readxl packagelibrary(readxl)# Filepathfilepath<-'Lecture_Data/'# In Windows, you can also specify the absolute path to your data file# filepath <- 'C:/Dropbox/Towson/Teaching/3_ComputationalEconomics/Lectures/Lecture_Data/'# Read entire Excel spreadsheetdf<-read_excel(paste0(filepath, 'Lecture_Data_Excel_a.xlsx'), sheet ='Lecture_Data_Excel_a', col_names =TRUE)# Let's have a look at itprint(df)
# A tibble: 5 × 4
Area Frequency `Unnamed: 2` `Unnamed: 3`
<chr> <dbl> <lgl> <lgl>
1 Accounting 73 NA NA
2 Finance 52 NA NA
3 General management 36 NA NA
4 Marketing sales 64 NA NA
5 other 28 NA NA
Note
Note how I also specified the excel sheet_name this is convenient in case you have more complex Excel spreadsheets with multiple tabs in it.
The DataFrame has an index for each row and a column header. We can query a DataFrame as follows:
print('Shape', df.shape)print('-------------------------')print('Number of rows', len(df))print('-------------------------')print('Column headers', df.columns)print('-------------------------')print('Data types', df.dtypes)print('-------------------------')print('Index', df.index)print('-------------------------')
Shape (5, 4)
-------------------------
Number of rows 5
-------------------------
Column headers Index(['Area', 'Frequency', 'Unnamed: 2', 'Unnamed: 3'], dtype='object')
-------------------------
Data types Area object
Frequency int64
Unnamed: 2 float64
Unnamed: 3 float64
dtype: object
-------------------------
Index RangeIndex(start=0, stop=5, step=1)
-------------------------
# Shape - Number of Rows and Columnscat("Shape: ", nrow(df), " rows, ", ncol(df), " columns\n")# Number of Rowscat("Number of rows: ", nrow(df), "\n")# Column Headerscat("Column headers: ", colnames(df), "\n")# Data Typescat("Data types: ", sapply(df, class), "\n")# Indexcat("Index: ", 1:nrow(df), "\n")
Shape: 5 rows, 4 columns
Number of rows: 5
Column headers: Area Frequency Unnamed: 2 Unnamed: 3
Data types: character numeric logical logical
Index: 1 2 3 4 5
Drop column 3 and 4 and let's have a look at the data again.
# axis = 1 means that you are dropping a columndf.drop('Unnamed: 2', axis=1, inplace=True)df.drop('Unnamed: 3', axis=1, inplace=True)# Let's have a look at it, it's a nested listprint(df)
Area Frequency
0 Accounting 73
1 Finance 52
2 General management 36
3 Marketing sales 64
4 other 28
# Load the dplyr library if you haven't alreadylibrary(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# Remove columns by namedf<-df%>%select(!c('Unnamed: 2', 'Unnamed: 3'))# Print the modified data frameprint(df)
# A tibble: 5 × 2
Area Frequency
<chr> <dbl>
1 Accounting 73
2 Finance 52
3 General management 36
4 Marketing sales 64
5 other 28
Warning
Sometimes column headers have white spaces in them. It is probably a good idea to remove all white spaces from header names. You can use the replace() command to accomplish this:
The \s+ is a regular expression which is basically a pattern matching language. The s stands for space and the + means look for one or more spaces and replace them with '', well, nothing i.e. a blank string.
We can make new columns using information from the existing columns.
Note
Making new columns in the DataFrame is very similar to making new variables in Stata using the generate or gen command.
More specifically, we next generate a new variable called relFrequency that contains the relative frequencies.
# Make new column with relative frequencydf['relFrequency'] = df['Frequency']/df['Frequency'].sum()# Let's have a look at it, it's a nested listprint(df)
Area Frequency relFrequency
0 Accounting 73 0.288538
1 Finance 52 0.205534
2 General management 36 0.142292
3 Marketing sales 64 0.252964
4 other 28 0.110672
# Make a new column with relative frequencydf$relFrequency<-df$Frequency/sum(df$Frequency)# Print the data frameprint(df)
# A tibble: 5 × 3
Area Frequency relFrequency
<chr> <dbl> <dbl>
1 Accounting 73 0.289
2 Finance 52 0.206
3 General management 36 0.142
4 Marketing sales 64 0.253
5 other 28 0.111
We can also just grab the relative frequencies out of the DataFrame and store it in the familiar a numpy array. This is now a vector. You can only do this if the information in the column is only numbers.
df['random'] = df['Frequency']/np.sqrt(df['Frequency'].sum())# Let's have a look at it.print(df)
Area Frequency relFrequency random
0 Accounting 73 0.288538 4.589471
1 Finance 52 0.205534 3.269212
2 General management 36 0.142292 2.263301
3 Marketing sales 64 0.252964 4.023646
4 other 28 0.110672 1.760345
# Add a new column 'random'df$random<-df$Frequency/sqrt(sum_xv)# Print the modified data frameprint(df)
# A tibble: 5 × 4
Area Frequency relFrequency random
<chr> <dbl> <dbl> <dbl>
1 Accounting 73 0.289 73
2 Finance 52 0.206 52
3 General management 36 0.142 36
4 Marketing sales 64 0.253 64
5 other 28 0.111 28
11.3 Some Pandas Tricks
11.3.1 Renaming Variables
For renaming column names we use a dictionary.
Note
Remember, dictionaries are similar to lists but the indexing of the elements of a dictionary is not done with 0,1,2, etc. but we user defined keys so that a dictionary consists of key - value pairs such as:
{'key1': value1, 'key2': value2, ... }
It works as follows. You see how the dictionary that you hand-in here defines the old column name as key and the new column name as the associated value.
area absFrequency relFrequency random
0 Accounting 73 0.288538 4.589471
1 Finance 52 0.205534 3.269212
2 General management 36 0.142292 2.263301
# Rename columns 'Area' to 'area' and 'Frequency' to 'absFrequency'colnames(df)[colnames(df)=="Area"]<-"area"colnames(df)[colnames(df)=="Frequency"]<-"absFrequency"# Display the first 3 rows of the modified data framehead(df, 3)
# A tibble: 3 × 4
area absFrequency relFrequency random
<chr> <dbl> <dbl> <dbl>
1 Accounting 73 0.289 73
2 Finance 52 0.206 52
3 General management 36 0.142 36
The head(x) method allows us to print x rows from the top of the DataFrame, whereas the tail(x) method does the same from the bottom of the DataFrame.
11.3.2 Dropping and Adding Columns
We next add a new column at a specific position in the DataFrame. Here it is going to be the third column and since Python indexes starting from zero, this is indicated as loc=2.
# Drop a column first so everything fitsdf = df.drop('random', axis=1)# Create a new column with some missing valuesdf['team'] = pd.Series([2,4, np.NaN,6,10], index=df.index)# or insert new column at specific locationdf.insert(loc=2, column='position', value=[np.NaN,1, np.NaN,1,3])print(df)
area absFrequency position relFrequency team
0 Accounting 73 NaN 0.288538 2.0
1 Finance 52 1.0 0.205534 4.0
2 General management 36 NaN 0.142292 NaN
3 Marketing sales 64 1.0 0.252964 6.0
4 other 28 3.0 0.110672 10.0
R indexes starting from one. So the third position is going to be following df[,2]:
# Drop the 'random' columndf$random<-NULL# Create a new 'team' column with missing valuesdf$team<-c(2, 4, NA, 6, 10)# Insert a new 'position' column at the second position with specific valuesdf<-data.frame(df[,1:2], position =c(NA, 1, NA, 1, 3), df[,3:ncol(df)])# Print the updated dataframeprint(df)
area absFrequency position relFrequency team
1 Accounting 73 NA 0.2885375 2
2 Finance 52 1 0.2055336 4
3 General management 36 NA 0.1422925 NA
4 Marketing sales 64 1 0.2529644 6
5 other 28 3 0.1106719 10
area absFrequency position relFrequency team
1 Finance 52 1.0 0.205534 4.0
3 Marketing sales 64 1.0 0.252964 6.0
4 other 28 3.0 0.110672 10.0
# Create a new DataFrame without missing valuesdf2<-df[complete.cases(df), ]# Print the resulting DataFrameprint(df2)
area absFrequency position relFrequency team
2 Finance 52 1 0.2055336 4
4 Marketing sales 64 1 0.2529644 6
5 other 28 3 0.1106719 10
Overwriting NaN with specific values If we want to overwrite the NaN entries with values, we can use the fillna() method. In this example we replace all missing values with zeros.
area absFrequency position relFrequency team
3 Marketing sales 64.0 1.0 0.252964 6.0
4 other 28.0 3.0 0.110672 10.0
5 NaN NaN NaN NaN NaN
# Create a row with missing valuesnew_row<-data.frame(area =NA, absFrequency =NA, position =NA, relFrequency =NA, team =NA)# Add the new row to the DataFramedf<-rbind(df, new_row)# Print the last 3 rowsprint(tail(df, 3))
area absFrequency position relFrequency team
4 Marketing sales 64 1 0.2529644 6
5 other 28 3 0.1106719 10
6 <NA> NA NA NA NA
area absFrequency position relFrequency team
0 Accounting 73.0 0.0 0.288538 2.0
1 Finance 52.0 1.0 0.205534 4.0
2 General management 36.0 0.0 0.142292 0.0
3 Marketing sales 64.0 1.0 0.252964 6.0
4 other 28.0 3.0 0.110672 10.0 0
# Assuming you want to drop the 6th rowdf_temp<-df[-6, ]# Print the updated DataFrameprint(df_temp)
area absFrequency position relFrequency team
1 Accounting 73 0 0.2885375 2
2 Finance 52 1 0.2055336 4
3 General management 36 0 0.1422925 0
4 Marketing sales 64 1 0.2529644 6
5 other 28 3 0.1106719 10
If you want the change to be permanent you need to either use the df.drop(5, inplace=True) option or you reassign the DataFrame to itself df = df.drop(5). The default dimension for dropping data is the row in the DataFrame, so you do not need to specify the , 0 part in the drop command.
However, if you drop a column, then you need the dimension specifier , 1.
Note
You can use the pd.concat() function to append one or more than one Pandas DataFrames together. There is also a pd.append() function but it is deprecated and will be removed in future releases. So do not use it.
# Create three data framesdf1<-data.frame(x =c(10, 11, 12), y =c(20, 21, 22))df2<-data.frame(x =c(30, 31, 32), y =c(40, 41, 42))df3<-data.frame(x =c(50, 51, 52), y =c(60, 61, 62))# Combine all three data framescombined<-rbind(df1, df2, df3)# View the final data frameprint(combined)
# Create three data framesdf1<-data.frame(x =c(10, 11, 12), y =c(20, 21, 22))df2<-data.frame(x =c(30, 31, 32), y =c(40, 41, 42))df3<-data.frame(x =c(50, 51, 52), y =c(60, 61, 62))# Append all three data frames togethercombined<-rbind(df1, df2, df3)# View the final data frameprint(combined)
Before running a regression, a researcher might like to drop certain observations that are clearly not correct such as negative age observations for instance.
Let us assume that in our current DataFrame we have information that tells us that a position variable must be strictly positive and cannot be zero. This would mean that row 0 and row 2 would need to be dropped from our DataFrame.
Before we mess up our DataFrame, let's make a new one and then drop row 0 and 2 based on the value condition of variable position.
area absFrequency position relFrequency team
1 Finance 52.0 1.0 0.205534 4.0
3 Marketing sales 64.0 1.0 0.252964 6.0
4 other 28.0 3.0 0.110672 10.0
df_temp<-subset(df_temp, position>0)# View the filtered data frameprint(df_temp)
area absFrequency position relFrequency team
2 Finance 52 1 0.2055336 4
4 Marketing sales 64 1 0.2529644 6
5 other 28 3 0.1106719 10
As you can see we accomplished this by "keeping" the observations for which the position value is strictly positive as opposed to "dropping" the negative ones.
And also let's make sure that the original DataFrame is still what it was.
area absFrequency position relFrequency team
0 Accounting 73.0 0.0 0.288538 2.0
3 Marketing sales 64.0 1.0 0.252964 6.0
1 Finance 52.0 1.0 0.205534 4.0
2 General management 36.0 0.0 0.142292 0.0
4 other 28.0 3.0 0.110672 10.0
5 Economics 25.0 NaN NaN NaN
# Sort the data frame by 'absFrequency' in descending orderdf<-df[order(-df$absFrequency), ]# View the sorted data frameprint(df)
area absFrequency position relFrequency team
1 Accounting 73 0 0.2885375 2
4 Marketing sales 64 1 0.2529644 6
2 Finance 52 1 0.2055336 4
3 General management 36 0 0.1422925 0
5 other 28 3 0.1106719 10
6 Economics 25 NA NA NA
The inplace=True option will immediately overwrite the DataFrame df with the new, sorted, one. If you set inplace=False you would have to assign a new DataFrame in order to see the changes:
# Sort the data frame by 'absFrequency' in descending orderdf1<-df[order(-df$absFrequency), ]
Where df1 would now be the sorted DataFrame and the original df DataFrame would still be unsorted.
Note
However, we now have two objects with identical data in it. This is often redundant and can eventually become a memory problem if the DataFrames are very large. Setting inplace=True is probably a good default setting.
If we want we could then reindex the DataFrame according to the new sort order.
area absFrequency position relFrequency team
0 Accounting 73.0 0.0 0.288538 2.0
1 Marketing sales 64.0 1.0 0.252964 6.0
2 Finance 52.0 1.0 0.205534 4.0
3 General management 36.0 0.0 0.142292 0.0
4 other 28.0 3.0 0.110672 10.0
5 Economics 25.0 NaN NaN NaN
# Reset the row names or row indicesrownames(df)<-NULLprint(df)
area absFrequency position relFrequency team
1 Accounting 73 0 0.2885375 2
2 Marketing sales 64 1 0.2529644 6
3 Finance 52 1 0.2055336 4
4 General management 36 0 0.1422925 0
5 other 28 3 0.1106719 10
6 Economics 25 NA NA NA
11.4 Merging DataFrames or Adding Columns
Let us create a new DataFrame that has the variable team in common with the previous DataFrame. This new DataFrame contains additional team information that we would like to merge into the original DataFrame.
df = df.drop('area', axis=1)print('-----------------------------')print('First, main, DataFrame: LEFT')print('-----------------------------')print(df)print('')print('-----------------------------')print('Second new DataFrame: RIGHT')print('-----------------------------')print(df_team)
-----------------------------
First, main, DataFrame: LEFT
-----------------------------
absFrequency position relFrequency team
0 73 0 0.28 2
1 64 1 0.25 6
2 52 1 0.20 4
3 36 0 0.14 0
4 28 3 0.11 10
-----------------------------
Second new DataFrame: RIGHT
-----------------------------
team sales
0 3 500
1 4 300
2 5 250
3 6 450
4 7 345
5 8 123
6 9 432
7 10 890
# Remove the 'area' column from the df data framedf<-df[, -which(names(df)=='area')]# Display the 'df' data framecat('-----------------------------\n')cat('First, main, DataFrame: LEFT\n')cat('-----------------------------\n')print(df)cat('\n')# Display the 'df_team' data framecat('-----------------------------\n')cat('Second new DataFrame: RIGHT\n')cat('-----------------------------\n')print(df_team)
-----------------------------
First, main, DataFrame: LEFT
-----------------------------
absFrequency position relFrequency team
1 73 0 0.28 2
2 64 1 0.25 6
3 52 1 0.20 4
4 36 0 0.14 0
5 28 3 0.11 10
-----------------------------
Second new DataFrame: RIGHT
-----------------------------
team sales
1 3 500
2 4 300
3 5 250
4 6 450
5 7 345
6 8 123
7 9 432
8 10 890
When merging two DataFrames we need to find a common variable that appears in both DataFrames along which we can connect them. In our example we use the team column as merge-key or key-variable because it appears in both DataFrames. We have now four different options for merging the two DataFrames.
Note
Merge Summary
The inner merge only keeps observations that are present in both DataFrames.
The left merge keeps all of the observations of the original (or main DataFrame) and adds the new information on the left if available.
The right merge only keeps info from the new DataFrame and adds the info of the original main DataFrame.
The outer merge method merges by team whenever possible but keeps the info from both DataFrames, even for observations where no merge/overlap occurred.
Now we do it step my step. Inner merge first. As you will see now, the inner merge only keeps observations that are present in both DataFrames. This is often too restrictive as many observations will be dropped.
The left merge keeps the original (or main DataFrame) and adds the new information on the left. When it cannot find a team number in the main DataFrame it simply drops the info from the new DataFrame.
Right Merge
-----------
absFrequency position relFrequency team sales
0 NaN NaN NaN 3 500
1 52.0 1.0 0.20 4 300
2 NaN NaN NaN 5 250
3 64.0 1.0 0.25 6 450
4 NaN NaN NaN 7 345
5 NaN NaN NaN 8 123
6 NaN NaN NaN 9 432
7 28.0 3.0 0.11 10 890
# Right Mergecat('Right Merge\n')cat('-----------\n')right_merged<-merge(df, df_team, by='team', all.y=TRUE)print(right_merged)
Right Merge
-----------
team absFrequency position relFrequency sales
1 3 NA NA NA 500
2 4 52 1 0.20 300
3 5 NA NA NA 250
4 6 64 1 0.25 450
5 7 NA NA NA 345
6 8 NA NA NA 123
7 9 NA NA NA 432
8 10 28 3 0.11 890
The outer merge method merges by team whenever possible but keeps the info from both DataFrames, even for observations where no merge/overlap occurred.
Outer Merge
-----------
absFrequency position relFrequency team sales
0 73.0 0.0 0.28 2 NaN
1 64.0 1.0 0.25 6 450.0
2 52.0 1.0 0.20 4 300.0
3 36.0 0.0 0.14 0 NaN
4 28.0 3.0 0.11 10 890.0
5 NaN NaN NaN 3 500.0
6 NaN NaN NaN 5 250.0
7 NaN NaN NaN 7 345.0
8 NaN NaN NaN 8 123.0
9 NaN NaN NaN 9 432.0
Outer Merge
-----------
team absFrequency position relFrequency sales
1 0 36 0 0.14 NA
2 2 73 0 0.28 NA
3 3 NA NA NA 500
4 4 52 1 0.20 300
5 5 NA NA NA 250
6 6 64 1 0.25 450
7 7 NA NA NA 345
8 8 NA NA NA 123
9 9 NA NA NA 432
10 10 28 3 0.11 890
11.5 Converting Types and Replacing Values in DataFrames
11.5.1 Converting Column Types
When converting text into a DataFrame to conduct statistical analysis it is sometimes necessary to convert a numbers column that is still a string, into floats, so that we can do math. Let's drop some more columns first so the DataFrame fits nicely in the output window.
We now generate a column of "words" or "strings" that happen do be numbers. Since they are defined as words, we cannot run statistical analysis on these numbers yet.
position relFrequency team string
0 0 0.28 2 2
1 1 0.25 6 40
2 1 0.20 4 34
3 0 0.14 0 6
4 3 0.11 10 10
position relFrequency team
count 5.000000 5.000000 5.000000
mean 1.000000 0.196000 4.400000
std 1.224745 0.071624 3.847077
min 0.000000 0.110000 0.000000
25% 0.000000 0.140000 2.000000
50% 1.000000 0.200000 4.000000
75% 1.000000 0.250000 6.000000
max 3.000000 0.280000 10.000000
# Add a new column named 'string' to your data framedf$string<-c('2', '40', '34', '6', '10')# Print the updated data frameprint(df)# Print summary statistics for the numeric columnssummary(df)
position relFrequency team string
1 0 0.28 2 2
2 1 0.25 6 40
3 1 0.20 4 34
4 0 0.14 0 6
5 3 0.11 10 10
position relFrequency team string
Min. :0 Min. :0.110 Min. : 0.0 Length:5
1st Qu.:0 1st Qu.:0.140 1st Qu.: 2.0 Class :character
Median :1 Median :0.200 Median : 4.0 Mode :character
Mean :1 Mean :0.196 Mean : 4.4
3rd Qu.:1 3rd Qu.:0.250 3rd Qu.: 6.0
Max. :3 Max. :0.280 Max. :10.0
We first need to reassign the data type of the string column. We then rename the column and print summary statistics.
# Transform strings into floats, i.e., words into numbersdf['string'] = df['string'].astype(float)# Rename the columndf = df.rename(columns={'string': 'salary'})# Print summary statisticsprint(df.describe())
position relFrequency team salary
count 5.000000 5.000000 5.000000 5.000000
mean 1.000000 0.196000 4.400000 18.400000
std 1.224745 0.071624 3.847077 17.343587
min 0.000000 0.110000 0.000000 2.000000
25% 0.000000 0.140000 2.000000 6.000000
50% 1.000000 0.200000 4.000000 10.000000
75% 1.000000 0.250000 6.000000 34.000000
max 3.000000 0.280000 10.000000 40.000000
# Transform strings into numbersdf$salary<-as.numeric(as.character(df$string))# Rename the columncolnames(df)[colnames(df)=='string']<-'salary'# Print summary statistics for the numeric columnssummary(df)
position relFrequency team salary salary
Min. :0 Min. :0.110 Min. : 0.0 Length:5 Min. : 2.0
1st Qu.:0 1st Qu.:0.140 1st Qu.: 2.0 Class :character 1st Qu.: 6.0
Median :1 Median :0.200 Median : 4.0 Mode :character Median :10.0
Mean :1 Mean :0.196 Mean : 4.4 Mean :18.4
3rd Qu.:1 3rd Qu.:0.250 3rd Qu.: 6.0 3rd Qu.:34.0
Max. :3 Max. :0.280 Max. :10.0 Max. :40.0
11.5.2 Replacing Values in a DataFrame Conditional on Criteria
If we need to replace values in a DataFrame based on certain criteria it is often more efficient to use indexing as opposed to loops. Let us first create a DataFrame with some random values:
# Set a seed for reproducibilityset.seed(123)# Create a data frame with random numbersdf1<-as.data.frame(matrix(runif(50, min =0, max =100), nrow =10))# Print the data frameprint(df1)
# Create a data frame with random numbersdf1<-as.data.frame(matrix(runif(50, min =0, max =100), nrow =10))# Set values less than 30 in the first column to 'Low'df1[df1[,2]<30, 2]<-'Low'# Print the data frameprint(df1)
# Create a data frame with random numbersdf1<-as.data.frame(matrix(runif(50, min =0, max =100), nrow =10))# Set values greater than 70 in the fourth column to 1000df1[df1[,4]>70, 4]<-1000# Print the data frameprint(df1)
And finally we can combine logical statements and replace values on a combination of conditions. So let's replace all the values in column 5 that are between 20 and 80 with the string middle.
# Create a data frame with random numbersdf1<-as.data.frame(matrix(runif(50, min =0, max =100), nrow =10))# Set values in the fourth column between 20 and 80 to 'Middle'df1[df1[,5]>20&df1[,5]<80, 5]<-'Middle'# Print the data frameprint(df1)