11 Working with Data I: Data Cleaning

Chapter Learning Objectives

Read Data
Pandas Library and DataFrames
Manipulating Data in DataFrames

We use the Pandas package for data work. We can import the Pandas library using the import statement. Pandas allows easy organization of data in the spirit of DataFrame concept in R. You can think of a DataFrame as an array with labels similar to an Excel spreadsheet.

A DataFrame can be created in the following ways:

From another DataFrame

From a numpy array

From lists

From another pandas data structure called series which is basically a one dimensional DataFrame, i.e., a column or row of a matrix

From a file like a CSV file etc.

Later in this chapter we will be working with two data sets.

You can download them from here: Lecture_Data_Excel_a.csv
Or the Excel version of it: Lecture_Data_Excel_a.xlsx
Here is a second data set in comma separated (.csv) format: Lecture_Data_Excel_b.csv

11.1 Create DataFrame from Multiple Lists

If you scrape information from a website or a pdf document, you often end up with your data stored across various lists. Sometimes these lists are simply holding the column information of a specific variable, sometimes the lists are nested lists, where one list contains the information of all variables per observation. pandas is intelligent enough (usually) to handle them all in a very similar way. I next illustrate the two most often found list structures in data work.

11.1.1 Every List Holds Information of a Separate Variable

Let us assume you observe information about 4! people. The unit of observation is therefore a person. For each person you know the name, age, and the city where they live. This information is stored in 3! separate lists.

There is an easy way to combine these lists into a DataFrame for easy statistical analysis. The process has two steps:

Use the zip() command to combine the lists

Assign the "zipped" lists to a Pandas DataFrame

Python Code
R Code

import pandas as pd

names = ['Katie', 'Nick', 'James', 'Eva']
ages = [32, 32, 36, 31]
locations = ['London', 'Baltimore', 'Atlanta', 'Madrid']

# Now zip the lists together
zipped = list(zip(names, ages, locations))

# Assign the zipped lists into a DataFrame
myData = pd.DataFrame(zipped, columns=['Name', 'Age', 'Location'])
print(myData)

    Name  Age   Location
0  Katie   32     London
1   Nick   32  Baltimore
2  James   36    Atlanta
3    Eva   31     Madrid

# Define the vectors
names <- c('Katie', 'Nick', 'James', 'Eva')
ages <- c(32, 32, 36, 31)
locations <- c('London', 'Baltimore', 'Atlanta', 'Madrid')

# Create a data frame
myData <- data.frame(Name = names, Age = ages, Location = locations)

# Print the data frame
print(myData)

   Name Age  Location
1 Katie  32    London
2  Nick  32 Baltimore
3 James  36   Atlanta
4   Eva  31    Madrid

11.1.2 Every List Holds Information of one Unit of Observation

In this example, the information is stored per person. So each person is stored in a list that contains all 3! variables. Then these person lists are stored in another list. We now are dealing with a nested list.

The steps to move the information from the lists into a DataFrame are pretty much the same. Pandas is intelligent in that way and recognizes the structure of the data.

Again the process has two steps:

Use the zip() command to combine the lists

Assign the "zipped" lists to a Pandas DataFrame

Python Code
R Code

import pandas as pd

# This is the structure of your data
data = [['Katie', 32, 'London'], ['Nik', 32, 'Toronto'], ['James', 36, 'Atlanta'], ['Evan', 31, 'Madrid']]

# Assign the multi-dimensional (or nested) list to DataFrame
myData = pd.DataFrame(data, columns=['Name', 'Age', 'Location'])
print(myData)

    Name  Age Location
0  Katie   32   London
1    Nik   32  Toronto
2  James   36  Atlanta
3   Evan   31   Madrid

# This is the structure of your data
data <- list(
  c('Katie', 32, 'London'),
  c('Nik', 32, 'Toronto'),
  c('James', 36, 'Atlanta'),
  c('Evan', 31, 'Madrid')
)

# Assign the nested list to a DataFrame
myData <- data.frame(
  Name = sapply(data, `[`, 1),
  Age = sapply(data, `[`, 2),
  Location = sapply(data, `[`, 3)
)

# Print the data frame
print(myData)

   Name Age Location
1 Katie  32   London
2   Nik  32  Toronto
3 James  36  Atlanta
4  Evan  31   Madrid

11.2 Read Small Data Set from a Comma Separated (.csv) File

It is more common that your data is stored in a file such as a text file, Excel spreadsheet, or a file format of some other data management software such as Stata, R, SAS, or SPSS. The Pandas library does provide you with routines (i.e., commands) that will allow you to import data directly from the most common file formats.

In our first example, we will work with data stored as a comma separated file, called Lecture_Data_Excel_a.csv <Lecture_Data/Lecture_Data_Excel_a.csv>. We first import this file using the pd.read_csv command from the Pandas library.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats as st
import math as m
import seaborn as sns
import time  # Imports system time module to time your script

plt.close('all')  # close all open figures

Python Code
R Code

# Read in small data from .csv file
# Filepath
filepath = 'Lecture_Data/'
# In windows you can also specify the absolute path to your data file
# filepath = 'C:/Dropbox/Towson/Teaching/3_ComputationalEconomics/Lectures/Lecture_Data/'

# ------------- Load data --------------------
df = pd.read_csv(filepath + 'Lecture_Data_Excel_a.csv', dtype={'Frequency': float})

# Let's have a look at it.
print(df)

                 Area  Frequency  Unnamed: 2  Unnamed: 3
0          Accounting       73.0         NaN         NaN
1             Finance       52.0         NaN         NaN
2  General management       36.0         NaN         NaN
3     Marketing sales       64.0         NaN         NaN
4               other       28.0         NaN         NaN

Alternatively we could have use the following command:

df = pd.read_table(filepath + 'Lecture_Data_Excel_a.csv', sep=',')

# Let's have a look at it.
print(df)

                 Area  Frequency  Unnamed: 2  Unnamed: 3
0          Accounting         73         NaN         NaN
1             Finance         52         NaN         NaN
2  General management         36         NaN         NaN
3     Marketing sales         64         NaN         NaN
4               other         28         NaN         NaN

# Filepath
filepath <- 'Lecture_Data/'
# In Windows, you can also specify the absolute path to your data file
# filepath <- 'C:/Dropbox/Towson/Teaching/3_ComputationalEconomics/Lectures/Lecture_Data/'

# Load data
df <- read.csv(paste0(filepath, 'Lecture_Data_Excel_a.csv'))

# Let's have a look at it
print(df)

                Area Frequency  X X.1
1         Accounting        73 NA  NA
2            Finance        52 NA  NA
3 General management        36 NA  NA
4    Marketing sales        64 NA  NA
5              other        28 NA  NA

The best way to import Excel files directly is probably to use the built in importer in Pandas.

Important

In order for this to work you need to make sure that the xlrd package is installed. Open a terminal window and type:

conda install xlrd

Press yes at the prompt and it will install the package.

You can use it as follows:

Python Code
R Code

# Read entire excel spreadsheet
df = pd.read_excel(filepath + 'Lecture_Data_Excel_a.xlsx', \
     sheet_name = 'Lecture_Data_Excel_a', header = 0, usecols = 'A:D')

Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.

# Let's have a look at it.
print(df)

                 Area  Frequency  Unnamed: 2  Unnamed: 3
0          Accounting         73         NaN         NaN
1             Finance         52         NaN         NaN
2  General management         36         NaN         NaN
3     Marketing sales         64         NaN         NaN
4               other         28         NaN         NaN

# Load the readxl package
library(readxl)

# Filepath
filepath <- 'Lecture_Data/'
# In Windows, you can also specify the absolute path to your data file
# filepath <- 'C:/Dropbox/Towson/Teaching/3_ComputationalEconomics/Lectures/Lecture_Data/'

# Read entire Excel spreadsheet
df <- read_excel(paste0(filepath, 'Lecture_Data_Excel_a.xlsx'), sheet = 'Lecture_Data_Excel_a', col_names = TRUE)

# Let's have a look at it
print(df)

# A tibble: 5 × 4
  Area               Frequency `Unnamed: 2` `Unnamed: 3`
  <chr>                  <dbl> <lgl>        <lgl>       
1 Accounting                73 NA           NA          
2 Finance                   52 NA           NA          
3 General management        36 NA           NA          
4 Marketing sales           64 NA           NA          
5 other                     28 NA           NA

Note

Note how I also specified the excel sheet_name this is convenient in case you have more complex Excel spreadsheets with multiple tabs in it.

The DataFrame has an index for each row and a column header. We can query a DataFrame as follows:

Python Code
R Code

print('Shape', df.shape)
print('-------------------------')
print('Number of rows', len(df))
print('-------------------------')
print('Column headers', df.columns)
print('-------------------------')
print('Data types', df.dtypes)
print('-------------------------')
print('Index', df.index)
print('-------------------------')

Shape (5, 4)
-------------------------
Number of rows 5
-------------------------
Column headers Index(['Area', 'Frequency', 'Unnamed: 2', 'Unnamed: 3'], dtype='object')
-------------------------
Data types Area           object
Frequency       int64
Unnamed: 2    float64
Unnamed: 3    float64
dtype: object
-------------------------
Index RangeIndex(start=0, stop=5, step=1)
-------------------------

# Shape - Number of Rows and Columns
cat("Shape: ", nrow(df), " rows, ", ncol(df), " columns\n")

# Number of Rows
cat("Number of rows: ", nrow(df), "\n")

# Column Headers
cat("Column headers: ", colnames(df), "\n")

# Data Types
cat("Data types: ", sapply(df, class), "\n")

# Index
cat("Index: ", 1:nrow(df), "\n")

Shape:  5  rows,  4  columns
Number of rows:  5 
Column headers:  Area Frequency Unnamed: 2 Unnamed: 3 
Data types:  character numeric logical logical 
Index:  1 2 3 4 5

Drop column 3 and 4 and let's have a look at the data again.

Python Code
R Code

# axis = 1 means that you are dropping a column
df.drop('Unnamed: 2', axis=1, inplace=True)

df.drop('Unnamed: 3', axis=1, inplace=True)

# Let's have a look at it, it's a nested list
print(df)

                 Area  Frequency
0          Accounting         73
1             Finance         52
2  General management         36
3     Marketing sales         64
4               other         28

# Load the dplyr library if you haven't already
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Remove columns by name
df <- df %>% select(!c('Unnamed: 2', 'Unnamed: 3'))

# Print the modified data frame
print(df)

# A tibble: 5 × 2
  Area               Frequency
  <chr>                  <dbl>
1 Accounting                73
2 Finance                   52
3 General management        36
4 Marketing sales           64
5 other                     28

Warning

Sometimes column headers have white spaces in them. It is probably a good idea to remove all white spaces from header names. You can use the replace() command to accomplish this:

Python Code
R Code

df.columns = df.columns.str.replace('\s+', '', regex=True)

# Remove spaces from column names
colnames(df) <- gsub("\\s+", "", colnames(df))

The \s+ is a regular expression which is basically a pattern matching language. The s stands for space and the + means look for one or more spaces and replace them with '', well, nothing i.e. a blank string.

We can make new columns using information from the existing columns.

Note

Making new columns in the DataFrame is very similar to making new variables in Stata using the generate or gen command.

More specifically, we next generate a new variable called relFrequency that contains the relative frequencies.

Python Code
R Code

# Make new column with relative frequency
df['relFrequency'] = df['Frequency']/df['Frequency'].sum()

# Let's have a look at it, it's a nested list
print(df)

                 Area  Frequency  relFrequency
0          Accounting         73      0.288538
1             Finance         52      0.205534
2  General management         36      0.142292
3     Marketing sales         64      0.252964
4               other         28      0.110672

# Make a new column with relative frequency
df$relFrequency <- df$Frequency / sum(df$Frequency)

# Print the data frame
print(df)

# A tibble: 5 × 3
  Area               Frequency relFrequency
  <chr>                  <dbl>        <dbl>
1 Accounting                73        0.289
2 Finance                   52        0.206
3 General management        36        0.142
4 Marketing sales           64        0.253
5 other                     28        0.111

We can also just grab the relative frequencies out of the DataFrame and store it in the familiar a numpy array. This is now a vector. You can only do this if the information in the column is only numbers.

Python Code
R Code

xv = df['relFrequency'].values

print('Array xv is:', xv)

Array xv is: [0.28853755 0.2055336  0.14229249 0.25296443 0.11067194]

xv <- df$relFrequency

cat('Array xv is:', xv, '\n')

Array xv is: 0.2885375 0.2055336 0.1422925 0.2529644 0.1106719

You can now do simple calculations with this vector as you did in the previous chapters such as:

Python Code
R Code

print('The sum of the vector xv is: ', xv.sum())

The sum of the vector xv is:  1.0

# Calculate the sum of the vector xv
sum_xv <- sum(xv)
cat("The sum of the vector xv is:", sum_xv, "\n")

The sum of the vector xv is: 1

Let us make an additional column so we have more data to play with:

Python Code
R Code

df['random'] = df['Frequency']/np.sqrt(df['Frequency'].sum())

# Let's have a look at it.
print(df)

                 Area  Frequency  relFrequency    random
0          Accounting         73      0.288538  4.589471
1             Finance         52      0.205534  3.269212
2  General management         36      0.142292  2.263301
3     Marketing sales         64      0.252964  4.023646
4               other         28      0.110672  1.760345

# Add a new column 'random'
df$random <- df$Frequency / sqrt(sum_xv)

# Print the modified data frame
print(df)

# A tibble: 5 × 4
  Area               Frequency relFrequency random
  <chr>                  <dbl>        <dbl>  <dbl>
1 Accounting                73        0.289     73
2 Finance                   52        0.206     52
3 General management        36        0.142     36
4 Marketing sales           64        0.253     64
5 other                     28        0.111     28

11.3 Some Pandas Tricks

11.3.1 Renaming Variables

For renaming column names we use a dictionary.

Note

Remember, dictionaries are similar to lists but the indexing of the elements of a dictionary is not done with 0,1,2, etc. but we user defined keys so that a dictionary consists of key - value pairs such as:

{'key1': value1, 'key2': value2, ... }

It works as follows. You see how the dictionary that you hand-in here defines the old column name as key and the new column name as the associated value.

Python Code
R Code

df = df.rename(columns={'Area': 'area', 'Frequency': 'absFrequency'})

print(df.head(3))

                 area  absFrequency  relFrequency    random
0          Accounting            73      0.288538  4.589471
1             Finance            52      0.205534  3.269212
2  General management            36      0.142292  2.263301

# Rename columns 'Area' to 'area' and 'Frequency' to 'absFrequency'
colnames(df)[colnames(df) == "Area"] <- "area"
colnames(df)[colnames(df) == "Frequency"] <- "absFrequency"

# Display the first 3 rows of the modified data frame
head(df, 3)

# A tibble: 3 × 4
  area               absFrequency relFrequency random
  <chr>                     <dbl>        <dbl>  <dbl>
1 Accounting                   73        0.289     73
2 Finance                      52        0.206     52
3 General management           36        0.142     36

The head(x) method allows us to print x rows from the top of the DataFrame, whereas the tail(x) method does the same from the bottom of the DataFrame.

11.3.2 Dropping and Adding Columns

We next add a new column at a specific position in the DataFrame. Here it is going to be the third column and since Python indexes starting from zero, this is indicated as loc=2.

Python Code
R Code

# Drop a column first so everything fits
df = df.drop('random', axis=1)

# Create a new column with some missing values
df['team'] = pd.Series([2,4, np.NaN,6,10], index=df.index)

# or insert new column at specific location
df.insert(loc=2, column='position', value=[np.NaN,1, np.NaN,1,3])

print(df)

                 area  absFrequency  position  relFrequency  team
0          Accounting            73       NaN      0.288538   2.0
1             Finance            52       1.0      0.205534   4.0
2  General management            36       NaN      0.142292   NaN
3     Marketing sales            64       1.0      0.252964   6.0
4               other            28       3.0      0.110672  10.0

R indexes starting from one. So the third position is going to be following df[,2]:

# Drop the 'random' column
df$random <- NULL

# Create a new 'team' column with missing values
df$team <- c(2, 4, NA, 6, 10)

# Insert a new 'position' column at the second position with specific values
df <- data.frame(df[,1:2], position = c(NA, 1, NA, 1, 3), df[,3:ncol(df)])

# Print the updated dataframe
print(df)

                area absFrequency position relFrequency team
1         Accounting           73       NA    0.2885375    2
2            Finance           52        1    0.2055336    4
3 General management           36       NA    0.1422925   NA
4    Marketing sales           64        1    0.2529644    6
5              other           28        3    0.1106719   10

11.3.3 Missing Values of NaN's

Count NaNs

Count the number of rows with missing values.

Python Code
R Code

nans = df.shape[0] - df.dropna().shape[0]
print('{} rows have missing values'.format(nans))

2 rows have missing values

nans <- sum(rowSums(is.na(df)) > 0)
cat(nans, 'rows have missing values\n')

2 rows have missing values

Select rows with NaNs or overwrite NaNs

If you want to select the rows with missing values you can index them with the isnull() method.

Python Code
R Code

print(df[df.isnull().any(axis=1)])

                 area  absFrequency  position  relFrequency  team
0          Accounting            73       NaN      0.288538   2.0
2  General management            36       NaN      0.142292   NaN

# Print rows with missing values
print(df[apply(is.na(df), 1, any), ])

                area absFrequency position relFrequency team
1         Accounting           73       NA    0.2885375    2
3 General management           36       NA    0.1422925   NA

Or if you want to search specific columns for missing values you can do

Python Code
R Code

print(df[df['team'].isnull()])

                 area  absFrequency  position  relFrequency  team
2  General management            36       NaN      0.142292   NaN

# Print rows where 'team' has missing values
print(df[is.na(df$team), ])

                area absFrequency position relFrequency team
3 General management           36       NA    0.1422925   NA

Or if you want to grab all the rows without missing observations you can index the DataFrame with the notnull() method.

Python Code
R Code

print(df[df['team'].notnull()])

              area  absFrequency  position  relFrequency  team
0       Accounting            73       NaN      0.288538   2.0
1          Finance            52       1.0      0.205534   4.0
3  Marketing sales            64       1.0      0.252964   6.0
4            other            28       3.0      0.110672  10.0

# Print rows where 'team' is not null
print(df[!is.na(df$team), ])

             area absFrequency position relFrequency team
1      Accounting           73       NA    0.2885375    2
2         Finance           52        1    0.2055336    4
4 Marketing sales           64        1    0.2529644    6
5           other           28        3    0.1106719   10

Delete rows with NaNs

Python Code
R Code

print(df.dropna())

              area  absFrequency  position  relFrequency  team
1          Finance            52       1.0      0.205534   4.0
3  Marketing sales            64       1.0      0.252964   6.0
4            other            28       3.0      0.110672  10.0

# Remove rows with missing values
df_temp <- na.omit(df)

# Print the resulting DataFrame
print(df_temp)

             area absFrequency position relFrequency team
2         Finance           52        1    0.2055336    4
4 Marketing sales           64        1    0.2529644    6
5           other           28        3    0.1106719   10

You can also reassign it to the original DataFrame or assign a new one.

Python Code
R Code

df2 = df.dropna()
print(df2)

              area  absFrequency  position  relFrequency  team
1          Finance            52       1.0      0.205534   4.0
3  Marketing sales            64       1.0      0.252964   6.0
4            other            28       3.0      0.110672  10.0

# Create a new DataFrame without missing values
df2 <- df[complete.cases(df), ]

# Print the resulting DataFrame
print(df2)

             area absFrequency position relFrequency team
2         Finance           52        1    0.2055336    4
4 Marketing sales           64        1    0.2529644    6
5           other           28        3    0.1106719   10

Overwriting NaN with specific values If we want to overwrite the NaN entries with values, we can use the fillna() method. In this example we replace all missing values with zeros.

Python Code
R Code

df.fillna(value=0, inplace=True)
print(df)

                 area  absFrequency  position  relFrequency  team
0          Accounting            73       0.0      0.288538   2.0
1             Finance            52       1.0      0.205534   4.0
2  General management            36       0.0      0.142292   0.0
3     Marketing sales            64       1.0      0.252964   6.0
4               other            28       3.0      0.110672  10.0

# Replace missing values with 0
df[is.na(df)] <- 0

# Print the DataFrame
print(df)

                area absFrequency position relFrequency team
1         Accounting           73        0    0.2885375    2
2            Finance           52        1    0.2055336    4
3 General management           36        0    0.1422925    0
4    Marketing sales           64        1    0.2529644    6
5              other           28        3    0.1106719   10

11.3.4 Dropping and Adding Rows to a DataFrame

We can use the iloc method to add empty rows to the DataFrame.

Python Code
R Code

df.loc[df.iloc[-1].name + 1,:] = np.nan
print(df.tail(3))

              area  absFrequency  position  relFrequency  team
3  Marketing sales          64.0       1.0      0.252964   6.0
4            other          28.0       3.0      0.110672  10.0
5              NaN           NaN       NaN           NaN   NaN

# Create a row with missing values
new_row <- data.frame(area = NA, absFrequency = NA, position = NA, relFrequency = NA, team = NA)

# Add the new row to the DataFrame
df <- rbind(df, new_row)

# Print the last 3 rows
print(tail(df, 3))

             area absFrequency position relFrequency team
4 Marketing sales           64        1    0.2529644    6
5           other           28        3    0.1106719   10
6            <NA>           NA       NA           NA   NA

We can then fill this empty row with data.

Python Code
R Code

df.loc[df.index[-1], 'area'] = 'Economics'
df.loc[df.index[-1], 'absFrequency'] = 25
print(df)

                 area  absFrequency  position  relFrequency  team
0          Accounting          73.0       0.0      0.288538   2.0
1             Finance          52.0       1.0      0.205534   4.0
2  General management          36.0       0.0      0.142292   0.0
3     Marketing sales          64.0       1.0      0.252964   6.0
4               other          28.0       3.0      0.110672  10.0
5           Economics          25.0       NaN           NaN   NaN

# Update values in the last row
df[nrow(df), 'area'] <- 'Economics'
df[nrow(df), 'absFrequency'] <- 25

# Print the updated DataFrame
print(df)

                area absFrequency position relFrequency team
1         Accounting           73        0    0.2885375    2
2            Finance           52        1    0.2055336    4
3 General management           36        0    0.1422925    0
4    Marketing sales           64        1    0.2529644    6
5              other           28        3    0.1106719   10
6          Economics           25       NA           NA   NA

If you want to drop this row you can index it directly with index value 5 (i.e., row 6):

Python Code
R Code

print(df.drop(5), 0)

                 area  absFrequency  position  relFrequency  team
0          Accounting          73.0       0.0      0.288538   2.0
1             Finance          52.0       1.0      0.205534   4.0
2  General management          36.0       0.0      0.142292   0.0
3     Marketing sales          64.0       1.0      0.252964   6.0
4               other          28.0       3.0      0.110672  10.0 0

# Assuming you want to drop the 6th row
df_temp <- df[-6, ]

# Print the updated DataFrame
print(df_temp)

                area absFrequency position relFrequency team
1         Accounting           73        0    0.2885375    2
2            Finance           52        1    0.2055336    4
3 General management           36        0    0.1422925    0
4    Marketing sales           64        1    0.2529644    6
5              other           28        3    0.1106719   10

If you want the change to be permanent you need to either use the df.drop(5, inplace=True) option or you reassign the DataFrame to itself df = df.drop(5). The default dimension for dropping data is the row in the DataFrame, so you do not need to specify the , 0 part in the drop command.

However, if you drop a column, then you need the dimension specifier , 1.

Note

You can use the pd.concat() function to append one or more than one Pandas DataFrames together. There is also a pd.append() function but it is deprecated and will be removed in future releases. So do not use it.

Python Code
R Code

import pandas as pd

#create three DataFrames
df1 = pd.DataFrame({'x': [10, 11, 12],
                    'y': [20, 21, 22]})

df2 = pd.DataFrame({'x': [30, 31, 32],
                    'y': [40, 41, 42]})

df3 = pd.DataFrame({'x': [50, 51, 52],
                    'y': [60, 61, 62]})

# Append all three DataFrames together
combined = pd.concat([df1, df2, df3], ignore_index=True)

# View final DataFrame
print(combined)

# Create three data frames
df1 <- data.frame(x = c(10, 11, 12), y = c(20, 21, 22))
df2 <- data.frame(x = c(30, 31, 32), y = c(40, 41, 42))
df3 <- data.frame(x = c(50, 51, 52), y = c(60, 61, 62))

# Combine all three data frames
combined <- rbind(df1, df2, df3)

# View the final data frame
print(combined)

If we didn’t use the ignore_index argument, the index of the resulting DataFrame would retain the original index values for each individual DataFrame:

Python Code
R Code

# Append all three DataFrames together
combined = pd.concat([df1, df2, df3])

# View final DataFrame
print(combined)

# Create three data frames
df1 <- data.frame(x = c(10, 11, 12), y = c(20, 21, 22))
df2 <- data.frame(x = c(30, 31, 32), y = c(40, 41, 42))
df3 <- data.frame(x = c(50, 51, 52), y = c(60, 61, 62))

# Append all three data frames together
combined <- rbind(df1, df2, df3)

# View the final data frame
print(combined)

11.3.5 Dropping Rows Based on Conditions

Before running a regression, a researcher might like to drop certain observations that are clearly not correct such as negative age observations for instance.

Let us assume that in our current DataFrame we have information that tells us that a position variable must be strictly positive and cannot be zero. This would mean that row 0 and row 2 would need to be dropped from our DataFrame.

Before we mess up our DataFrame, let's make a new one and then drop row 0 and 2 based on the value condition of variable position.

Python Code
R Code

df_temp = df

print(df_temp)

                 area  absFrequency  position  relFrequency  team
0          Accounting          73.0       0.0      0.288538   2.0
1             Finance          52.0       1.0      0.205534   4.0
2  General management          36.0       0.0      0.142292   0.0
3     Marketing sales          64.0       1.0      0.252964   6.0
4               other          28.0       3.0      0.110672  10.0
5           Economics          25.0       NaN           NaN   NaN

df_temp <- df

# View the copied data frame
print(df_temp)

                area absFrequency position relFrequency team
1         Accounting           73        0    0.2885375    2
2            Finance           52        1    0.2055336    4
3 General management           36        0    0.1422925    0
4    Marketing sales           64        1    0.2529644    6
5              other           28        3    0.1106719   10
6          Economics           25       NA           NA   NA

Now drop the rows with a position variable value of zero or less.

Python Code
R Code

df_temp = df_temp[df_temp['position']>0]
print(df_temp)

              area  absFrequency  position  relFrequency  team
1          Finance          52.0       1.0      0.205534   4.0
3  Marketing sales          64.0       1.0      0.252964   6.0
4            other          28.0       3.0      0.110672  10.0

df_temp <- subset(df_temp, position > 0)

# View the filtered data frame
print(df_temp)

             area absFrequency position relFrequency team
2         Finance           52        1    0.2055336    4
4 Marketing sales           64        1    0.2529644    6
5           other           28        3    0.1106719   10

As you can see we accomplished this by "keeping" the observations for which the position value is strictly positive as opposed to "dropping" the negative ones.

And also let's make sure that the original DataFrame is still what it was.

Python Code
R Code

print(df)

                 area  absFrequency  position  relFrequency  team
0          Accounting          73.0       0.0      0.288538   2.0
1             Finance          52.0       1.0      0.205534   4.0
2  General management          36.0       0.0      0.142292   0.0
3     Marketing sales          64.0       1.0      0.252964   6.0
4               other          28.0       3.0      0.110672  10.0
5           Economics          25.0       NaN           NaN   NaN

print(df)

                area absFrequency position relFrequency team
1         Accounting           73        0    0.2885375    2
2            Finance           52        1    0.2055336    4
3 General management           36        0    0.1422925    0
4    Marketing sales           64        1    0.2529644    6
5              other           28        3    0.1106719   10
6          Economics           25       NA           NA   NA

11.3.6 Sorting and Reindexing DataFrames

We next sort the DataFrame by a certain column (from highest to lowest) using the sort_values() DataFrame method.

Warning

The old pandas method/function .sort() was replaced with .sort_values() and does not work anymore in newer versions of pandas.

Python Code
R Code

df.sort_values('absFrequency', ascending=False, inplace=True)
print(df)

                 area  absFrequency  position  relFrequency  team
0          Accounting          73.0       0.0      0.288538   2.0
3     Marketing sales          64.0       1.0      0.252964   6.0
1             Finance          52.0       1.0      0.205534   4.0
2  General management          36.0       0.0      0.142292   0.0
4               other          28.0       3.0      0.110672  10.0
5           Economics          25.0       NaN           NaN   NaN

# Sort the data frame by 'absFrequency' in descending order
df <- df[order(-df$absFrequency), ]

# View the sorted data frame
print(df)

                area absFrequency position relFrequency team
1         Accounting           73        0    0.2885375    2
4    Marketing sales           64        1    0.2529644    6
2            Finance           52        1    0.2055336    4
3 General management           36        0    0.1422925    0
5              other           28        3    0.1106719   10
6          Economics           25       NA           NA   NA

The inplace=True option will immediately overwrite the DataFrame df with the new, sorted, one. If you set inplace=False you would have to assign a new DataFrame in order to see the changes:

Python Code
R Code

df1 = df.sort_values('absFrequency', ascending=False, inplace=False)

# Sort the data frame by 'absFrequency' in descending order
df1 <- df[order(-df$absFrequency), ]

Where df1 would now be the sorted DataFrame and the original df DataFrame would still be unsorted.

Note

However, we now have two objects with identical data in it. This is often redundant and can eventually become a memory problem if the DataFrames are very large. Setting inplace=True is probably a good default setting.

If we want we could then reindex the DataFrame according to the new sort order.

Python Code
R Code

df.index = range(len(df.index))
print(df)

                 area  absFrequency  position  relFrequency  team
0          Accounting          73.0       0.0      0.288538   2.0
1     Marketing sales          64.0       1.0      0.252964   6.0
2             Finance          52.0       1.0      0.205534   4.0
3  General management          36.0       0.0      0.142292   0.0
4               other          28.0       3.0      0.110672  10.0
5           Economics          25.0       NaN           NaN   NaN

# Reset the row names or row indices
rownames(df) <- NULL
print(df)

                area absFrequency position relFrequency team
1         Accounting           73        0    0.2885375    2
2    Marketing sales           64        1    0.2529644    6
3            Finance           52        1    0.2055336    4
4 General management           36        0    0.1422925    0
5              other           28        3    0.1106719   10
6          Economics           25       NA           NA   NA

11.4 Merging DataFrames or Adding Columns

Let us create a new DataFrame that has the variable team in common with the previous DataFrame. This new DataFrame contains additional team information that we would like to merge into the original DataFrame.

First redefine the original DataFrame:

Python Code
R Code

df= pd.DataFrame({'area': ['Accounting','Marketing','Finance','Management', 'other'],\
                  'absFrequency': [73,64,52,36,28], \
                  'position': [0,1,1,0,3], \
                  'relFrequency': [0.28,0.25,0.20,0.14,0.11], \
                  'team': [2,6,4,0,10]})

print(df)

         area  absFrequency  position  relFrequency  team
0  Accounting            73         0          0.28     2
1   Marketing            64         1          0.25     6
2     Finance            52         1          0.20     4
3  Management            36         0          0.14     0
4       other            28         3          0.11    10

# Create the data frame
df <- data.frame(
  area = c('Accounting', 'Marketing', 'Finance', 'Management', 'other'),
  absFrequency = c(73, 64, 52, 36, 28),
  position = c(0, 1, 1, 0, 3),
  relFrequency = c(0.28, 0.25, 0.20, 0.14, 0.11),
  team = c(2, 6, 4, 0, 10)
)

print(df)

        area absFrequency position relFrequency team
1 Accounting           73        0         0.28    2
2  Marketing           64        1         0.25    6
3    Finance           52        1         0.20    4
4 Management           36        0         0.14    0
5      other           28        3         0.11   10

Second, we define the new DataFrame and add some values to it:

Python Code
R Code

df_team = pd.DataFrame({'team': [3,4,5,6,7,8,9,10], \
                        'sales': [500,300,250,450,345,123,432,890]})

print(df_team)

   team  sales
0     3    500
1     4    300
2     5    250
3     6    450
4     7    345
5     8    123
6     9    432
7    10    890

# Create the data frame
df_team <- data.frame(
  team = c(3, 4, 5, 6, 7, 8, 9, 10),
  sales = c(500, 300, 250, 450, 345, 123, 432, 890)
)

print(df_team)

  team sales
1    3   500
2    4   300
3    5   250
4    6   450
5    7   345
6    8   123
7    9   432
8   10   890

This new DataFrame contains some sales information for each team.

Before we merge in the new info we drop the area column so that the DataFrame fits on the screen. We now work with the following two DataFrames.

Python Code
R Code

df = df.drop('area', axis=1)
print('-----------------------------')
print('First, main,  DataFrame: LEFT')
print('-----------------------------')
print(df)
print('')
print('-----------------------------')
print('Second new DataFrame: RIGHT')
print('-----------------------------')
print(df_team)

-----------------------------
First, main,  DataFrame: LEFT
-----------------------------
   absFrequency  position  relFrequency  team
0            73         0          0.28     2
1            64         1          0.25     6
2            52         1          0.20     4
3            36         0          0.14     0
4            28         3          0.11    10

-----------------------------
Second new DataFrame: RIGHT
-----------------------------
   team  sales
0     3    500
1     4    300
2     5    250
3     6    450
4     7    345
5     8    123
6     9    432
7    10    890

# Remove the 'area' column from the df data frame
df <- df[, -which(names(df) == 'area')]

# Display the 'df' data frame
cat('-----------------------------\n')
cat('First, main,  DataFrame: LEFT\n')
cat('-----------------------------\n')
print(df)
cat('\n')

# Display the 'df_team' data frame
cat('-----------------------------\n')
cat('Second new DataFrame: RIGHT\n')
cat('-----------------------------\n')
print(df_team)

-----------------------------
First, main,  DataFrame: LEFT
-----------------------------
  absFrequency position relFrequency team
1           73        0         0.28    2
2           64        1         0.25    6
3           52        1         0.20    4
4           36        0         0.14    0
5           28        3         0.11   10

-----------------------------
Second new DataFrame: RIGHT
-----------------------------
  team sales
1    3   500
2    4   300
3    5   250
4    6   450
5    7   345
6    8   123
7    9   432
8   10   890

When merging two DataFrames we need to find a common variable that appears in both DataFrames along which we can connect them. In our example we use the team column as merge-key or key-variable because it appears in both DataFrames. We have now four different options for merging the two DataFrames.

Note

Merge Summary

The inner merge only keeps observations that are present in both DataFrames.
The left merge keeps all of the observations of the original (or main DataFrame) and adds the new information on the left if available.
The right merge only keeps info from the new DataFrame and adds the info of the original main DataFrame.
The outer merge method merges by team whenever possible but keeps the info from both DataFrames, even for observations where no merge/overlap occurred.

Now we do it step my step. Inner merge first. As you will see now, the inner merge only keeps observations that are present in both DataFrames. This is often too restrictive as many observations will be dropped.

Python Code
R Code

print('Inner Merge')
print('-----------')
print(pd.merge(df, df_team, on='team', how='inner'))

Inner Merge
-----------
   absFrequency  position  relFrequency  team  sales
0            64         1          0.25     6    450
1            52         1          0.20     4    300
2            28         3          0.11    10    890

# Inner Merge
cat('Inner Merge\n')
cat('-----------\n')
inner_merged <- merge(df, df_team, by='team', all=FALSE)
print(inner_merged)

Inner Merge
-----------
  team absFrequency position relFrequency sales
1    4           52        1         0.20   300
2    6           64        1         0.25   450
3   10           28        3         0.11   890

The left merge keeps the original (or main DataFrame) and adds the new information on the left. When it cannot find a team number in the main DataFrame it simply drops the info from the new DataFrame.

Python Code
R Code

print('Left Merge')
print('----------')
print(pd.merge(df, df_team, on='team', how='left'))

Left Merge
----------
   absFrequency  position  relFrequency  team  sales
0            73         0          0.28     2    NaN
1            64         1          0.25     6  450.0
2            52         1          0.20     4  300.0
3            36         0          0.14     0    NaN
4            28         3          0.11    10  890.0

# Left Merge
cat('Left Merge\n')
cat('----------\n')
left_merged <- merge(df, df_team, by='team', all.x=TRUE)
print(left_merged)

Left Merge
----------
  team absFrequency position relFrequency sales
1    0           36        0         0.14    NA
2    2           73        0         0.28    NA
3    4           52        1         0.20   300
4    6           64        1         0.25   450
5   10           28        3         0.11   890

The right merge only keeps info from the new DataFrame and adds the info of the original main DataFrame.

Python Code
R Code

print('Right Merge')
print('-----------')
print(pd.merge(df, df_team, on='team', how='right'))

Right Merge
-----------
   absFrequency  position  relFrequency  team  sales
0           NaN       NaN           NaN     3    500
1          52.0       1.0          0.20     4    300
2           NaN       NaN           NaN     5    250
3          64.0       1.0          0.25     6    450
4           NaN       NaN           NaN     7    345
5           NaN       NaN           NaN     8    123
6           NaN       NaN           NaN     9    432
7          28.0       3.0          0.11    10    890

# Right Merge
cat('Right Merge\n')
cat('-----------\n')
right_merged <- merge(df, df_team, by='team', all.y=TRUE)
print(right_merged)

Right Merge
-----------
  team absFrequency position relFrequency sales
1    3           NA       NA           NA   500
2    4           52        1         0.20   300
3    5           NA       NA           NA   250
4    6           64        1         0.25   450
5    7           NA       NA           NA   345
6    8           NA       NA           NA   123
7    9           NA       NA           NA   432
8   10           28        3         0.11   890

The outer merge method merges by team whenever possible but keeps the info from both DataFrames, even for observations where no merge/overlap occurred.

Python Code
R Code

print('Outer Merge')
print('-----------')
print(pd.merge(df, df_team, on='team', how='outer'))

Outer Merge
-----------
   absFrequency  position  relFrequency  team  sales
0          73.0       0.0          0.28     2    NaN
1          64.0       1.0          0.25     6  450.0
2          52.0       1.0          0.20     4  300.0
3          36.0       0.0          0.14     0    NaN
4          28.0       3.0          0.11    10  890.0
5           NaN       NaN           NaN     3  500.0
6           NaN       NaN           NaN     5  250.0
7           NaN       NaN           NaN     7  345.0
8           NaN       NaN           NaN     8  123.0
9           NaN       NaN           NaN     9  432.0

# Outer Merge
cat('Outer Merge\n')
cat('-----------\n')
outer_merged <- merge(df, df_team, by='team', all=TRUE)
print(outer_merged)

Outer Merge
-----------
   team absFrequency position relFrequency sales
1     0           36        0         0.14    NA
2     2           73        0         0.28    NA
3     3           NA       NA           NA   500
4     4           52        1         0.20   300
5     5           NA       NA           NA   250
6     6           64        1         0.25   450
7     7           NA       NA           NA   345
8     8           NA       NA           NA   123
9     9           NA       NA           NA   432
10   10           28        3         0.11   890

11.5 Converting Types and Replacing Values in DataFrames

11.5.1 Converting Column Types

When converting text into a DataFrame to conduct statistical analysis it is sometimes necessary to convert a numbers column that is still a string, into floats, so that we can do math. Let's drop some more columns first so the DataFrame fits nicely in the output window.

Python Code
R Code

df = df.drop('absFrequency', axis=1)
# Let's have a look at it, it's a nested list
print(df)

   position  relFrequency  team
0         0          0.28     2
1         1          0.25     6
2         1          0.20     4
3         0          0.14     0
4         3          0.11    10

# Drop 'absFrequency' column
df$absFrequency <- NULL

# Print the data frame
print(df)

  position relFrequency team
1        0         0.28    2
2        1         0.25    6
3        1         0.20    4
4        0         0.14    0
5        3         0.11   10

We now generate a column of "words" or "strings" that happen do be numbers. Since they are defined as words, we cannot run statistical analysis on these numbers yet.

Python Code
R Code

df['string'] = pd.Series(['2','40','34','6','10'], index=df.index)

# Print DataFrame
print(df)

# Print summary statistics
print(df.describe())

   position  relFrequency  team string
0         0          0.28     2      2
1         1          0.25     6     40
2         1          0.20     4     34
3         0          0.14     0      6
4         3          0.11    10     10
       position  relFrequency       team
count  5.000000      5.000000   5.000000
mean   1.000000      0.196000   4.400000
std    1.224745      0.071624   3.847077
min    0.000000      0.110000   0.000000
25%    0.000000      0.140000   2.000000
50%    1.000000      0.200000   4.000000
75%    1.000000      0.250000   6.000000
max    3.000000      0.280000  10.000000

# Add a new column named 'string' to your data frame
df$string <- c('2', '40', '34', '6', '10')

# Print the updated data frame
print(df)

# Print summary statistics for the numeric columns
summary(df)

  position relFrequency team string
1        0         0.28    2      2
2        1         0.25    6     40
3        1         0.20    4     34
4        0         0.14    0      6
5        3         0.11   10     10
    position  relFrequency        team         string         
 Min.   :0   Min.   :0.110   Min.   : 0.0   Length:5          
 1st Qu.:0   1st Qu.:0.140   1st Qu.: 2.0   Class :character  
 Median :1   Median :0.200   Median : 4.0   Mode  :character  
 Mean   :1   Mean   :0.196   Mean   : 4.4                     
 3rd Qu.:1   3rd Qu.:0.250   3rd Qu.: 6.0                     
 Max.   :3   Max.   :0.280   Max.   :10.0

We first need to reassign the data type of the string column. We then rename the column and print summary statistics.

Python Code
R Code

# Transform strings into floats, i.e., words into numbers
df['string'] = df['string'].astype(float)

# Rename the column
df = df.rename(columns={'string': 'salary'})

# Print summary statistics
print(df.describe())

       position  relFrequency       team     salary
count  5.000000      5.000000   5.000000   5.000000
mean   1.000000      0.196000   4.400000  18.400000
std    1.224745      0.071624   3.847077  17.343587
min    0.000000      0.110000   0.000000   2.000000
25%    0.000000      0.140000   2.000000   6.000000
50%    1.000000      0.200000   4.000000  10.000000
75%    1.000000      0.250000   6.000000  34.000000
max    3.000000      0.280000  10.000000  40.000000

# Transform strings into numbers
df$salary <- as.numeric(as.character(df$string))

# Rename the column
colnames(df)[colnames(df) == 'string'] <- 'salary'

# Print summary statistics for the numeric columns
summary(df)

    position  relFrequency        team         salary              salary    
 Min.   :0   Min.   :0.110   Min.   : 0.0   Length:5           Min.   : 2.0  
 1st Qu.:0   1st Qu.:0.140   1st Qu.: 2.0   Class :character   1st Qu.: 6.0  
 Median :1   Median :0.200   Median : 4.0   Mode  :character   Median :10.0  
 Mean   :1   Mean   :0.196   Mean   : 4.4                      Mean   :18.4  
 3rd Qu.:1   3rd Qu.:0.250   3rd Qu.: 6.0                      3rd Qu.:34.0  
 Max.   :3   Max.   :0.280   Max.   :10.0                      Max.   :40.0

11.5.2 Replacing Values in a DataFrame Conditional on Criteria

If we need to replace values in a DataFrame based on certain criteria it is often more efficient to use indexing as opposed to loops. Let us first create a DataFrame with some random values:

Python Code
R Code

df1 = pd.DataFrame(np.random.rand(10,5)*100)
print(df1)

           0          1          2          3          4
0  21.651551  20.045638  33.662727  93.198514  86.568045
1  90.834640  10.420275  14.093457  14.044287  14.252533
2   3.306218  46.425814  87.380927  11.623223  31.251596
3   3.830065  41.256834  23.384440  24.823248   5.872269
4   9.606805  58.170679  91.473171   2.374509  61.006675
5  16.074765  33.722850  16.766003  49.042524  28.847451
6  32.572317  54.522931  91.582778   9.422688  88.284203
7  37.438779  78.856242   5.942955  34.583682  94.225941
8  94.182952  87.189188  38.605325  84.628497  30.175344
9  78.336114  16.287495  19.028794  87.068721  58.704766

# Set a seed for reproducibility
set.seed(123)

# Create a data frame with random numbers
df1 <- as.data.frame(matrix(runif(50, min = 0, max = 100), nrow = 10))

# Print the data frame
print(df1)

         V1        V2       V3        V4       V5
1  28.75775 95.683335 88.95393 96.302423 14.28000
2  78.83051 45.333416 69.28034 90.229905 41.45463
3  40.89769 67.757064 64.05068 69.070528 41.37243
4  88.30174 57.263340 99.42698 79.546742 36.88455
5  94.04673 10.292468 65.57058  2.461368 15.24447
6   4.55565 89.982497 70.85305 47.779597 13.88061
7  52.81055 24.608773 54.40660 75.845954 23.30341
8  89.24190  4.205953 59.41420 21.640794 46.59625
9  55.14350 32.792072 28.91597 31.818101 26.59726
10 45.66147 95.450365 14.71136 23.162579 85.78277

We next replace all the values in column 2 that are smaller than 30 with the string Low.

Python Code
R Code

df1[1][(df1[1]<30)] = 'Low'
print(df1)

           0          1          2          3          4
0  21.651551        Low  33.662727  93.198514  86.568045
1  90.834640        Low  14.093457  14.044287  14.252533
2   3.306218  46.425814  87.380927  11.623223  31.251596
3   3.830065  41.256834  23.384440  24.823248   5.872269
4   9.606805  58.170679  91.473171   2.374509  61.006675
5  16.074765   33.72285  16.766003  49.042524  28.847451
6  32.572317  54.522931  91.582778   9.422688  88.284203
7  37.438779  78.856242   5.942955  34.583682  94.225941
8  94.182952  87.189188  38.605325  84.628497  30.175344
9  78.336114        Low  19.028794  87.068721  58.704766

# Create a data frame with random numbers
df1 <- as.data.frame(matrix(runif(50, min = 0, max = 100), nrow = 10))

# Set values less than 30 in the first column to 'Low'
df1[df1[,2] < 30, 2] <- 'Low'

# Print the data frame
print(df1)

          V1               V2          V3       V4        V5
1   4.583117 66.5115194628015 75.44751586 24.36195 13.069569
2  44.220007              Low 62.92211316 66.80556 65.310193
3  79.892485 38.3969637798145 71.01824014 41.76468 34.351647
4  12.189926              Low  0.06247733 78.81958 65.675813
5  56.094798 81.4640038879588 47.53165741 10.28646 32.037324
6  20.653139 44.8516341391951 22.01188852 43.48927 18.769112
7  12.753165 81.0064353048801 37.98165377 98.49570 78.229430
8  75.330786 81.2389509519562 61.27710033 89.30511  9.359499
9  89.504536 79.4342321110889 35.17979092 88.64691 46.677904
10 37.446278 43.9831687603146 11.11354243 17.50527 51.150546

We next replace all the values of column 4 that are larger than 70 with the value 1000.

Python Code
R Code

df1.loc[(df1[3]>70), 3] = 1000
print(df1)

           0          1          2            3          4
0  21.651551        Low  33.662727  1000.000000  86.568045
1  90.834640        Low  14.093457    14.044287  14.252533
2   3.306218  46.425814  87.380927    11.623223  31.251596
3   3.830065  41.256834  23.384440    24.823248   5.872269
4   9.606805  58.170679  91.473171     2.374509  61.006675
5  16.074765   33.72285  16.766003    49.042524  28.847451
6  32.572317  54.522931  91.582778     9.422688  88.284203
7  37.438779  78.856242   5.942955    34.583682  94.225941
8  94.182952  87.189188  38.605325  1000.000000  30.175344
9  78.336114        Low  19.028794  1000.000000  58.704766

# Create a data frame with random numbers
df1 <- as.data.frame(matrix(runif(50, min = 0, max = 100), nrow = 10))

# Set values greater than 70 in the fourth column to 1000
df1[df1[,4] > 70, 4] <- 1000

# Print the data frame
print(df1)

         V1        V2       V3         V4        V5
1  59.99890 93.529980 64.78935   61.92565 31.170220
2  33.28235 30.122890 31.98206 1000.00000 40.947495
3  48.86130  6.072057 30.77200   67.29991  1.046711
4  95.44738 94.772694 21.97676 1000.00000 18.384952
5  48.29024 72.059627 36.94889   52.11357 84.272932
6  89.03502 14.229430 98.42192   65.98384 23.116178
7  91.44382 54.928466 15.42023 1000.00000 23.909996
8  60.87350 95.409124  9.10440 1000.00000  7.669117
9  41.06898 58.548335 14.19069 1000.00000 24.572368
10 14.70947 40.451028 69.00071   43.94315 73.213521

And finally we can combine logical statements and replace values on a combination of conditions. So let's replace all the values in column 5 that are between 20 and 80 with the string middle.

Python Code
R Code

df1.loc[((df1[4]>20) & (df1[4]<80)), 4] = 'Middle'
print(df1)

           0          1          2            3          4
0  21.651551        Low  33.662727  1000.000000  86.568045
1  90.834640        Low  14.093457    14.044287  14.252533
2   3.306218  46.425814  87.380927    11.623223     Middle
3   3.830065  41.256834  23.384440    24.823248   5.872269
4   9.606805  58.170679  91.473171     2.374509     Middle
5  16.074765   33.72285  16.766003    49.042524     Middle
6  32.572317  54.522931  91.582778     9.422688  88.284203
7  37.438779  78.856242   5.942955    34.583682  94.225941
8  94.182952  87.189188  38.605325  1000.000000     Middle
9  78.336114        Low  19.028794  1000.000000     Middle

# Create a data frame with random numbers
df1 <- as.data.frame(matrix(runif(50, min = 0, max = 100), nrow = 10))

# Set values in the fourth column between 20 and 80 to 'Middle'
df1[df1[,5] > 20 & df1[,5] < 80, 5] <- 'Middle'

# Print the data frame
print(df1)

         V1       V2       V3       V4               V5
1  84.74532 50.22996 62.99731 83.97678           Middle
2  49.75273 35.39046 18.38285 31.24482           Middle
3  38.79090 64.99852 86.36441 70.82903 98.5640884377062
4  24.64490 37.47140 74.65680 26.50178           Middle
5  11.10965 35.54454 66.82846 59.43432 93.7314089154825
6  38.99944 53.36879 61.80179 48.12898           Middle
7  57.19353 74.03344 37.22381 26.50327           Middle
8  21.68928 22.11029 52.98357 56.45904           Middle
9  44.47680 41.27461 87.46823 91.31882 15.2346616843715
10 21.79907 26.56867 58.17501 90.18744           Middle

Key Concepts and Summary

DataFrames
Cleaning data

Self-check questions

Download data
Make a new column with a new variable
Sort the DataFrame with respect to value in the third column