10. Working with Data I: Data Cleaning¶
We use the Pandas
package for data work. We can import the Pandas
library using the import
statement. Pandas
allows easy organization of
data in the spirit of DataFrame
concept in R. You can think of a
DataFrame as an array with labels similar to an Excel spreadsheet.
A DataFrame can be created in the following ways:
From another DataFrame
From a
numpy
arrayFrom lists
From another
pandas
data structure calledseries
which is basically a one dimensional DataFrame, i.e., a column or row of a matrixFrom a file like a CSV file etc.
Later in this chapter we will be working with two data sets. You can download them from here:
Or the Excel version of it:
Here is a second data set in comma separated (.csv) format:
10.1. Create Dataframe from Multiple Lists¶
If you scrape information from a website or a pdf document, you often end up
with your data stored across various lists. Sometimes these lists are simply
holding the column information of a specific variable, sometimes the lists are
nested lists, where one list contains the information of all variables per
observation. pandas
is intelligent enough (usually) to handle them all in a
very similar way. I next illustrate the two most often found list structures in
data work.
10.1.1. Every List Holds Information of a Separate Variable¶
Let us assume you observe information about 4! people. The unit of observation is therefore a person. For each person you know the name, age, and the city where they live. This information is stored in 3! separate lists.
There is an easy way to combine these lists into a Dataframe for easy statistical analysis. The process has two steps:
Use the
zip()
command to combine the listsAssign the “zipped” lists to a Pandas Dataframe
import pandas as pd
names = ['Katie', 'Nick', 'James', 'Eva']
ages = [32, 32, 36, 31]
locations = ['London', 'Baltimore', 'Atlanta', 'Madrid']
# Now zip the lists together
zipped = list(zip(names, ages, locations))
# Assign the zipped lists into a Dataframe
myData = pd.DataFrame(zipped, columns=['Name', 'Age', 'Location'])
print(myData)
Name Age Location
0 Katie 32 London
1 Nick 32 Baltimore
2 James 36 Atlanta
3 Eva 31 Madrid
10.1.2. Every List Holds Information of one Unit of Observation¶
In this example, the information is stored per person. So each person is stored in a list that contains all 3! variables. Then these person lists are stored in another list. We now are dealing with a nested list.
The steps to move the information from the lists into a Dataframe are pretty much the same. Pandas is intelligent in that way and recognizes the structure of the data.
Again the process has two steps:
Use the
zip()
command to combine the listsAssign the “zipped” lists to a Pandas Dataframe
import pandas as pd
# This is the structure of your data
data = [['Katie', 32, 'London'], ['Nik', 32, 'Toronto'], ['James', 36, 'Atlanta'], ['Evan', 31, 'Madrid']]
# Assign the multi-dimensional (or nested) list to Dataframe
myData = pd.DataFrame(data, columns=['Name', 'Age', 'Location'])
print(myData)
Name Age Location
0 Katie 32 London
1 Nik 32 Toronto
2 James 36 Atlanta
3 Evan 31 Madrid
10.2. Read Small Data Set from a Comma Separated (.csv) File¶
It is more common that your data is stored in a file such as a text file, Excel spreadsheet, or a file format of some other data management software such as Stata, R, SAS, or SPSS. The Pandas library does provide you with routines (i.e., commands) that will allow you to import data directly from the most common file formats.
In our first example, we will work with data stored as a comma separated file, called
Lecture_Data_Excel_a.csv
.
We first import this file using the pd.read_csv
command from the Pandas
library.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats as st
import math as m
import seaborn as sns
import time # Imports system time module to time your script
plt.close('all') # close all open figures
# Read in small data from .csv file
# Filepath
filepath = 'Lecture_Data/'
# In windows you can also specify the absolute path to your data file
# filepath = 'C:/Dropbox/Towson/Teaching/3_ComputationalEconomics/Lectures/Lecture_Data/'
# ------------- Load data --------------------
df = pd.read_csv(filepath + 'Lecture_Data_Excel_a.csv', dtype={'Frequency': float})
# Let's have a look at it.
print(df)
Area Frequency Unnamed: 2 Unnamed: 3
0 Accounting 73.0 NaN NaN
1 Finance 52.0 NaN NaN
2 General management 36.0 NaN NaN
3 Marketing sales 64.0 NaN NaN
4 other 28.0 NaN NaN
Alternatively we could have use the following command:
df = pd.read_table(filepath + 'Lecture_Data_Excel_a.csv', sep=',')
# Let's have a look at it.
print(df)
Area Frequency Unnamed: 2 Unnamed: 3
0 Accounting 73 NaN NaN
1 Finance 52 NaN NaN
2 General management 36 NaN NaN
3 Marketing sales 64 NaN NaN
4 other 28 NaN NaN
The best way to import Excel files directly is probably to use the built in
importer in Pandas
.
Important
In order for this to work you need to make sure that the xlrd
package is
installed. Open a terminal window and type:
conda install xlrd
Press yes at the prompt and it will install the package.
You can use it as follows:
# Read entire excel spreadsheet
df = pd.read_excel(filepath + 'Lecture_Data_Excel_a.xlsx', \
sheet_name = 'Lecture_Data_Excel_a', header = 0, usecols = 'A:D')
# Let's have a look at it.
print(df)
Area Frequency Unnamed: 2 Unnamed: 3
0 Accounting 73 NaN NaN
1 Finance 52 NaN NaN
2 General management 36 NaN NaN
3 Marketing sales 64 NaN NaN
4 other 28 NaN NaN
Note how I also specified the excel sheet_name this is convenient in case you have more complex Excel spreadsheets with multiple tabs in it.
The DataFrame has an index for each row and a column header. We can query a DataFrame as follows:
print('Shape', df.shape)
print('-------------------------')
print('Number of rows', len(df))
print('-------------------------')
print('Column headers', df.columns)
print('-------------------------')
print('Data types', df.dtypes)
print('-------------------------')
print('Index', df.index)
print('-------------------------')
Shape (5, 4)
-------------------------
Number of rows 5
-------------------------
Column headers Index(['Area', 'Frequency', 'Unnamed: 2', 'Unnamed:
3'], dtype='object')
-------------------------
Data types Area object
Frequency int64
Unnamed: 2 float64
Unnamed: 3 float64
dtype: object
-------------------------
Index RangeIndex(start=0, stop=5, step=1)
-------------------------
Drop column 3 and 4 and let’s have a look at the data again.
# axis = 1 means that you are dropping a column
df.drop('Unnamed: 2', axis=1, inplace=True)
df.drop('Unnamed: 3', axis=1, inplace=True)
# Let's have a look at it, it's a nested list
print(df)
Area Frequency
0 Accounting 73
1 Finance 52
2 General management 36
3 Marketing sales 64
4 other 28
Warning
Sometimes column headers have white spaces in them. It is probably a good idea
to remove all white spaces from header names. You can use the replace()
command to accomplish this:
df.columns = df.columns.str.replace('\s+', '', regex=True)
The \s+
is a regular expression which is basically a pattern matching
language. The s stands for space and the +
means look for one or more
spaces and replace them with ''
, well, nothing i.e. a blank string.
We can make new columns using information from the existing columns.
Note
Making new columns in the DataFrame is very similar to making new
variables in Stata using the generate
or gen
command.
More specifically, we next generate a new variable called relFrequency
that contains the relative
frequencies.
# Make new column with relative frequency
df['relFrequency'] = df['Frequency']/df['Frequency'].sum()
# Let's have a look at it, it's a nested list
print(df)
Area Frequency relFrequency
0 Accounting 73 0.288538
1 Finance 52 0.205534
2 General management 36 0.142292
3 Marketing sales 64 0.252964
4 other 28 0.110672
We can also just grab the relative frequencies out of the DataFrame
and store it
in the familiar a numpy
array. This is now a vector. You can only do this if
the information in the column is only numbers.
xv = df['relFrequency'].values
print('Array xv is:', xv)
Array xv is: [0.28853755 0.2055336 0.14229249 0.25296443 0.11067194]
You can now do simple calculations with this vector as you did in the previous chapters such as:
print('The sum of the vector xv is: ', xv.sum())
The sum of the vector xv is: 1.0
Let us make an additional column so we have more data to play with:
df['random'] = df['Frequency']/np.sqrt(df['Frequency'].sum())
# Let's have a look at it.
print(df)
Area Frequency relFrequency random
0 Accounting 73 0.288538 4.589471
1 Finance 52 0.205534 3.269212
2 General management 36 0.142292 2.263301
3 Marketing sales 64 0.252964 4.023646
4 other 28 0.110672 1.760345
10.3. Some Pandas Tricks¶
10.3.1. Renaming Variables¶
For renaming column names we use a dictionary.
Note
Remember, dictionaries are similar to lists but the indexing of the elements of a dictionary is not done with 0,1,2, etc. but we user defined keys so that a dictionary consists of key - value pairs such as:
{'key1': value1, 'key2': value2, ... }
It works as follows. You see how the dictionary that you hand-in here defines
the old column name as key
and the new column name as the associated
value
.
df = df.rename(columns={'Area': 'area', 'Frequency': 'absFrequency'})
print(df.head(3))
area absFrequency relFrequency random
0 Accounting 73 0.288538 4.589471
1 Finance 52 0.205534 3.269212
2 General management 36 0.142292 2.263301
The head(x)
method allows us to print x
rows from the top of the
DataFrame, whereas the tail(x)
method does the same from the bottom of
the DataFrame.
10.3.2. Dropping and Adding Columns¶
# Drop a column first so everything fits
df = df.drop('random', axis=1)
# Create a new column with some missing values
df['team'] = pd.Series([2,4, np.NaN,6,10], index=df.index)
# or insert new column at specific location
df.insert(loc=2, column='position', value=[np.NaN,1, np.NaN,1,3])
print(df)
area absFrequency position relFrequency team
0 Accounting 73 NaN 0.288538 2.0
1 Finance 52 1.0 0.205534 4.0
2 General management 36 NaN 0.142292 NaN
3 Marketing sales 64 1.0 0.252964 6.0
4 other 28 3.0 0.110672 10.0
10.3.3. Missing Values of NaN’s¶
Count NaNs
Count the number of rows with missing values.
nans = df.shape[0] - df.dropna().shape[0]
print('{} rows have missing values'.format(nans))
2 rows have missing values
Select rows with NaNs or overwrite NaNs
If you want to select the rows with missing values you can index them with the
isnull()
method.
print(df[df.isnull().any(axis=1)])
area absFrequency position relFrequency team
0 Accounting 73 NaN 0.288538 2.0
2 General management 36 NaN 0.142292 NaN
Or if you want to search specific columns for missing values you can do
print(df[df['team'].isnull()])
area absFrequency position relFrequency team
2 General management 36 NaN 0.142292 NaN
Or if you want to grab all the rows without missing observations you can index
the dataframe with the notnull()
method.
print(df[df['team'].notnull()])
area absFrequency position relFrequency team
0 Accounting 73 NaN 0.288538 2.0
1 Finance 52 1.0 0.205534 4.0
3 Marketing sales 64 1.0 0.252964 6.0
4 other 28 3.0 0.110672 10.0
Delete rows with NaNs
print(df.dropna())
area absFrequency position relFrequency team
1 Finance 52 1.0 0.205534 4.0
3 Marketing sales 64 1.0 0.252964 6.0
4 other 28 3.0 0.110672 10.0
You can also reassign it to the orignal DataFrame or assign a new one.
df2 = df.dropna()
print(df2)
area absFrequency position relFrequency team
1 Finance 52 1.0 0.205534 4.0
3 Marketing sales 64 1.0 0.252964 6.0
4 other 28 3.0 0.110672 10.0
Overwriting NaN with specific values
If we want to overwrite the NaN
entries with values we can use the
fillna()
method. In this example we replace all missing values with zeros.
df.fillna(value=0, inplace=True)
print(df)
area absFrequency position relFrequency team
0 Accounting 73 0.0 0.288538 2.0
1 Finance 52 1.0 0.205534 4.0
2 General management 36 0.0 0.142292 0.0
3 Marketing sales 64 1.0 0.252964 6.0
4 other 28 3.0 0.110672 10.0
10.3.4. Dropping and Adding Rows to a DataFrame¶
We can use the append()
method to add rows to the DataFrame.
df = df.append(pd.Series(
[np.nan]*len(df.columns), # Fill cells with NaNs
index=df.columns),
ignore_index=True)
print(df.tail(3))
area absFrequency position relFrequency team
3 Marketing sales 64.0 1.0 0.252964 6.0
4 other 28.0 3.0 0.110672 10.0
5 NaN NaN NaN NaN NaN
We can then fill this empty row with data.
df.loc[df.index[-1], 'area'] = 'Economics'
df.loc[df.index[-1], 'absFrequency'] = 25
print(df)
area absFrequency position relFrequency team
0 Accounting 73.0 0.0 0.288538 2.0
1 Finance 52.0 1.0 0.205534 4.0
2 General management 36.0 0.0 0.142292 0.0
3 Marketing sales 64.0 1.0 0.252964 6.0
4 other 28.0 3.0 0.110672 10.0
5 Economics 25.0 NaN NaN NaN
If you want to drop this row you can:
print(df.drop(5), 0)
area absFrequency position relFrequency team
0 Accounting 73.0 0.0 0.288538 2.0
1 Finance 52.0 1.0 0.205534 4.0
2 General management 36.0 0.0 0.142292 0.0
3 Marketing sales 64.0 1.0 0.252964 6.0
4 other 28.0 3.0 0.110672 10.0 0
If you want the change to be permanent you need to either use the
df.dropb(5, inplace=True)
option or you reassign the DataFrame to itself
df = df.drop(5)
. The default dimension for dropping data is the row in the
DataFrame, so you do not need to specify the , 0
part in the drop command.
However, if you drop a column, then you need the dimension specifier , 1
.
10.3.5. Dropping Rows Based on Conditions¶
Before running a regression, a researcher might like to drop certain observations that are clearly not correct such as negative age observations for instance.
Let us assume that in our current dataframe we have information that tells us
that a position
variable must be strictly positive and cannot be zero. This
would mean that row 0 and row 2 would need to be dropped from our dataframe.
Before we mess up our dataframe, let’s make a new one and then drop row 0 and 2
based on the value condition of variable position
.
df_temp = df
print(df_temp)
area absFrequency position relFrequency team
0 Accounting 73.0 0.0 0.288538 2.0
1 Finance 52.0 1.0 0.205534 4.0
2 General management 36.0 0.0 0.142292 0.0
3 Marketing sales 64.0 1.0 0.252964 6.0
4 other 28.0 3.0 0.110672 10.0
5 Economics 25.0 NaN NaN NaN
Now drop the rows with a position
variable value of zero or less.
df_temp = df_temp[df_temp['position']>0]
print(df_temp)
area absFrequency position relFrequency team
1 Finance 52.0 1.0 0.205534 4.0
3 Marketing sales 64.0 1.0 0.252964 6.0
4 other 28.0 3.0 0.110672 10.0
As you can see we accomplished this by “keeping” the observations for which the
position
value is strictly positive as opposed to “dropping” the negative
ones.
And also let’s make sure that the original dataframe is still what it was.
print(df)
area absFrequency position relFrequency team
0 Accounting 73.0 0.0 0.288538 2.0
1 Finance 52.0 1.0 0.205534 4.0
2 General management 36.0 0.0 0.142292 0.0
3 Marketing sales 64.0 1.0 0.252964 6.0
4 other 28.0 3.0 0.110672 10.0
5 Economics 25.0 NaN NaN NaN
10.3.6. Sorting and Reindexing DataFrames¶
We next sort the DataFrame by a certain column (from highest to lowest) using
the sort_values()
DataFrame method.
Warning
The old pandas
method/function .sort()
was replaced with
.sort_values()
and does not work anymore in newer versions of pandas
.
df.sort_values('absFrequency', ascending=False, inplace=True)
print(df)
area absFrequency position relFrequency team
0 Accounting 73.0 0.0 0.288538 2.0
3 Marketing sales 64.0 1.0 0.252964 6.0
1 Finance 52.0 1.0 0.205534 4.0
2 General management 36.0 0.0 0.142292 0.0
4 other 28.0 3.0 0.110672 10.0
5 Economics 25.0 NaN NaN NaN
The inplace=True
option will immediately overwrite the DataFrame df
with the new, sorted, one. If you set inplace=False
you would have to
assign a new DataFrame in order to see the changes:
df1 = df.sort_values('absFrequency', ascending=False, inplace=False)
Where df1
would now be the sorted DataFrame and the original df
DataFrame would still be unsorted.
Note
However, we now have two objects with
identical data in it. This is often redundant and can eventually become a memory problem if the DataFrames are
very large. Setting inplace=True
is probably a good default setting.
If we want we could then reindex the DataFrame according to the new sort order.
df.index = range(len(df.index))
print(df)
area absFrequency position relFrequency team
0 Accounting 73.0 0.0 0.288538 2.0
1 Marketing sales 64.0 1.0 0.252964 6.0
2 Finance 52.0 1.0 0.205534 4.0
3 General management 36.0 0.0 0.142292 0.0
4 other 28.0 3.0 0.110672 10.0
5 Economics 25.0 NaN NaN NaN
10.3.7. Merging DataFrames or Adding Columns¶
Let us create a new DataFrame that has the variable team
in common with the previous
DataFrame. This new DataFrame contains additional team information that we
would like to merge into the original DataFrame.
First redefine the orginal DataFrame:
df= pd.DataFrame({'area': ['Accounting','Marketing','Finance','Management', 'other'],\
'absFrequency': [73,64,52,36,28], \
'position': [0,1,1,0,3], \
'relFrequency': [0.28,0.25,0.20,0.14,0.11], \
'team': [2,6,4,0,10]})
print(df)
area absFrequency position relFrequency team
0 Accounting 73 0 0.28 2
1 Marketing 64 1 0.25 6
2 Finance 52 1 0.20 4
3 Management 36 0 0.14 0
4 other 28 3 0.11 10
Second, we define the new DataFrame and add some values to it:
df_team = pd.DataFrame({'team': [3,4,5,6,7,8,9,10], \
'sales': [500,300,250,450,345,123,432,890]})
print(df_team)
team sales
0 3 500
1 4 300
2 5 250
3 6 450
4 7 345
5 8 123
6 9 432
7 10 890
This new DataFrame contains some sales information for each team.
Before we merge in the new info we drop the area
column so that
the DataFrame fits on the screen. We now work with the following two
dataframes.
df = df.drop('area', axis=1)
print('-----------------------------')
print('First, main, DataFrame: LEFT')
print('-----------------------------')
print(df)
print('')
print('-----------------------------')
print('Second new DataFrame: RIGHT')
print('-----------------------------')
print(df_team)
-----------------------------
First, main, DataFrame: LEFT
-----------------------------
absFrequency position relFrequency team
0 73 0 0.28 2
1 64 1 0.25 6
2 52 1 0.20 4
3 36 0 0.14 0
4 28 3 0.11 10
-----------------------------
Second new DataFrame: RIGHT
-----------------------------
team sales
0 3 500
1 4 300
2 5 250
3 6 450
4 7 345
5 8 123
6 9 432
7 10 890
When merging two dataframes we need to find a common variable that appears in
both DataFrames along which we can connect them. In our example we use the
team
column as merge-key or key-variable because it appears in both
DataFrames. We have now four different options for merging the two DataFrames.
Note
Merge Summary
The inner merge only keeps observations that are present in both DataFrames.
The left merge keeps all of the observations of the original (or main DataFrame) and adds the new information on the left if available.
The right merge only keeps info from the new DataFrame and adds the info of the original main DataFrame.
The outer merge method merges by team whenever possible but keeps the info from both DataFrames, even for observations where no merge/overlap occurred.
Now we do it step my step. Inner merge first. As you will see now, the inner merge only keeps observations that are present in both DataFrames. This is often too restrictive as many observations will be dropped.
print('Inner Merge')
print('-----------')
print(pd.merge(df, df_team, on='team', how='inner'))
Inner Merge
-----------
absFrequency position relFrequency team sales
0 64 1 0.25 6 450
1 52 1 0.20 4 300
2 28 3 0.11 10 890
The left merge keeps the original (or main DataFrame) and adds the new information on the left. When it cannot find a team number in the main DataFrame it simply drops the info from the new DataFrame.
print('Left Merge')
print('----------')
print(pd.merge(df, df_team, on='team', how='left'))
Left Merge
----------
absFrequency position relFrequency team sales
0 73 0 0.28 2 NaN
1 64 1 0.25 6 450.0
2 52 1 0.20 4 300.0
3 36 0 0.14 0 NaN
4 28 3 0.11 10 890.0
The right merge only keeps info from the new DataFrame and adds the info of the original main DataFrame.
print('Right Merge')
print('-----------')
print(pd.merge(df, df_team, on='team', how='right'))
Right Merge
-----------
absFrequency position relFrequency team sales
0 NaN NaN NaN 3 500
1 52.0 1.0 0.20 4 300
2 NaN NaN NaN 5 250
3 64.0 1.0 0.25 6 450
4 NaN NaN NaN 7 345
5 NaN NaN NaN 8 123
6 NaN NaN NaN 9 432
7 28.0 3.0 0.11 10 890
The outer merge method merges by team whenever possible but keeps the info from both DataFrames, even for observations where no merge/overlap occurred.
print('Outer Merge')
print('-----------')
print(pd.merge(df, df_team, on='team', how='outer'))
Outer Merge
-----------
absFrequency position relFrequency team sales
0 73.0 0.0 0.28 2 NaN
1 64.0 1.0 0.25 6 450.0
2 52.0 1.0 0.20 4 300.0
3 36.0 0.0 0.14 0 NaN
4 28.0 3.0 0.11 10 890.0
5 NaN NaN NaN 3 500.0
6 NaN NaN NaN 5 250.0
7 NaN NaN NaN 7 345.0
8 NaN NaN NaN 8 123.0
9 NaN NaN NaN 9 432.0
10.3.8. Converting Column Types¶
When converting text into a DataFrame to conduct statistical analysis it is sometimes necessary to convert a numbers column that is still a string, into floats, so that we can do math. Let’s drop some more columns first so the DataFrame fits nicely in the output window.
df = df.drop('absFrequency', axis=1)
# Let's have a look at it, it's a nested list
print(df)
position relFrequency team
0 0 0.28 2
1 1 0.25 6
2 1 0.20 4
3 0 0.14 0
4 3 0.11 10
We now generate a column of “words” or “strings” that happen do be numbers. Since they are defined as words, we cannot run statistical analysis on these numbers yet.
df['string'] = pd.Series(['2','40','34','6','10'], index=df.index)
# Print DataFrame
print(df)
# Print summary statistics
print(df.describe())
position relFrequency team string
0 0 0.28 2 2
1 1 0.25 6 40
2 1 0.20 4 34
3 0 0.14 0 6
4 3 0.11 10 10
position relFrequency team
count 5.000000 5.000000 5.000000
mean 1.000000 0.196000 4.400000
std 1.224745 0.071624 3.847077
min 0.000000 0.110000 0.000000
25% 0.000000 0.140000 2.000000
50% 1.000000 0.200000 4.000000
75% 1.000000 0.250000 6.000000
max 3.000000 0.280000 10.000000
We first need to reassign the data type of the string
column. We then
rename the column and print summary statistics.
# Transform strings into floats, i.e., words into numbers
df['string'] = df['string'].astype(float)
# Rename the column
df = df.rename(columns={'string': 'salary'})
# Print summary statistics
print(df.describe())
position relFrequency team salary
count 5.000000 5.000000 5.000000 5.000000
mean 1.000000 0.196000 4.400000 18.400000
std 1.224745 0.071624 3.847077 17.343587
min 0.000000 0.110000 0.000000 2.000000
25% 0.000000 0.140000 2.000000 6.000000
50% 1.000000 0.200000 4.000000 10.000000
75% 1.000000 0.250000 6.000000 34.000000
max 3.000000 0.280000 10.000000 40.000000
10.3.9. Replacing Values in a DataFrame Conditional on Criteria¶
If we need to replace values in a DataFrame based on certain criteria it is often more efficient to use indexing as opposed to loops. Let us first create a DataFrame with some random values:
df1 = pd.DataFrame(np.random.rand(10,5)*100)
print(df1)
0 1 2 3 4
0 76.231650 73.856596 98.223677 37.419310 56.534611
1 93.020350 59.102390 87.029799 6.289979 81.704176
2 67.060606 30.773903 27.633734 35.819372 52.103559
3 80.603039 63.205036 16.050168 67.673033 19.890453
4 53.266469 8.721019 42.579826 36.000875 15.518005
5 20.139680 49.216850 9.924973 31.078271 77.523813
6 37.838850 65.905862 96.325081 99.530884 48.992624
7 52.408115 40.192827 60.132690 21.112231 52.065246
8 93.561547 36.599218 38.663248 56.510712 26.487473
9 8.446148 78.818387 74.815600 40.085175 90.895161
We next replace all the values in column 2
that are smaller than 30 with
the string low.
df1[1][(df1[1]<30)] = 'Low'
print(df1)
0 1 2 3 4
0 76.231650 73.856596 98.223677 37.419310 56.534611
1 93.020350 59.10239 87.029799 6.289979 81.704176
2 67.060606 30.773903 27.633734 35.819372 52.103559
3 80.603039 63.205036 16.050168 67.673033 19.890453
4 53.266469 Low 42.579826 36.000875 15.518005
5 20.139680 49.21685 9.924973 31.078271 77.523813
6 37.838850 65.905862 96.325081 99.530884 48.992624
7 52.408115 40.192827 60.132690 21.112231 52.065246
8 93.561547 36.599218 38.663248 56.510712 26.487473
9 8.446148 78.818387 74.815600 40.085175 90.895161
We next replace all the values of column 4
that are larger than 70 with the
value 1000.
df1.loc[(df1[3]>70), 3] = 1000
print(df1)
0 1 2 3 4
0 76.231650 73.856596 98.223677 37.419310 56.534611
1 93.020350 59.10239 87.029799 6.289979 81.704176
2 67.060606 30.773903 27.633734 35.819372 52.103559
3 80.603039 63.205036 16.050168 67.673033 19.890453
4 53.266469 Low 42.579826 36.000875 15.518005
5 20.139680 49.21685 9.924973 31.078271 77.523813
6 37.838850 65.905862 96.325081 1000.000000 48.992624
7 52.408115 40.192827 60.132690 21.112231 52.065246
8 93.561547 36.599218 38.663248 56.510712 26.487473
9 8.446148 78.818387 74.815600 40.085175 90.895161
And finally we can combine logical statements and replace values on a
combination of conditions. So let’s replace all the values in column 5
that
are between 20 and 80 with the string middle.
df1.loc[((df1[4]>20) & (df1[4]<80)), 4] = 'Middle'
print(df1)
0 1 2 3 4
0 76.231650 73.856596 98.223677 37.419310 Middle
1 93.020350 59.10239 87.029799 6.289979 81.704176
2 67.060606 30.773903 27.633734 35.819372 Middle
3 80.603039 63.205036 16.050168 67.673033 19.890453
4 53.266469 Low 42.579826 36.000875 15.518005
5 20.139680 49.21685 9.924973 31.078271 Middle
6 37.838850 65.905862 96.325081 1000.000000 Middle
7 52.408115 40.192827 60.132690 21.112231 Middle
8 93.561547 36.599218 38.663248 56.510712 Middle
9 8.446148 78.818387 74.815600 40.085175 90.895161