10. Working with Data I: Data Cleaning¶

We use the Pandas package for data work. We can import the Pandas library using the import statement. Pandas allows easy organization of data in the spirit of DataFrame concept in R. You can think of a DataFrame as an array with labels similar to an Excel spreadsheet.

A DataFrame can be created in the following ways:

From another DataFrame

From a numpy array

From lists

From another pandas data structure called series which is basically a one dimensional DataFrame, i.e., a column or row of a matrix

From a file like a CSV file etc.

Later in this chapter we will be working with two data sets. You can download them from here:

Lecture_Data_Excel_a.csv

Or the Excel version of it:

Lecture_Data_Excel_a.xlsx

Here is a second data set in comma separated (.csv) format:

Lecture_Data_Excel_b.csv

10.1. Create Dataframe from Multiple Lists¶

If you scrape information from a website or a pdf document, you often end up with your data stored across various lists. Sometimes these lists are simply holding the column information of a specific variable, sometimes the lists are nested lists, where one list contains the information of all variables per observation. pandas is intelligent enough (usually) to handle them all in a very similar way. I next illustrate the two most often found list structures in data work.

10.1.1. Every List Holds Information of a Separate Variable¶

Let us assume you observe information about 4! people. The unit of observation is therefore a person. For each person you know the name, age, and the city where they live. This information is stored in 3! separate lists.

There is an easy way to combine these lists into a Dataframe for easy statistical analysis. The process has two steps:

Use the zip() command to combine the lists

Assign the “zipped” lists to a Pandas Dataframe

import pandas as pd

names = ['Katie', 'Nick', 'James', 'Eva']
ages = [32, 32, 36, 31]
locations = ['London', 'Baltimore', 'Atlanta', 'Madrid']

# Now zip the lists together
zipped = list(zip(names, ages, locations))

# Assign the zipped lists into a Dataframe
myData = pd.DataFrame(zipped, columns=['Name', 'Age', 'Location'])
print(myData)

    Name  Age   Location
Katie   32     London
 Nick   32  Baltimore
James   36    Atlanta
  Eva   31     Madrid

10.1.2. Every List Holds Information of one Unit of Observation¶

In this example, the information is stored per person. So each person is stored in a list that contains all 3! variables. Then these person lists are stored in another list. We now are dealing with a nested list.

The steps to move the information from the lists into a Dataframe are pretty much the same. Pandas is intelligent in that way and recognizes the structure of the data.

Again the process has two steps:

Use the zip() command to combine the lists

Assign the “zipped” lists to a Pandas Dataframe

import pandas as pd

# This is the structure of your data
data = [['Katie', 32, 'London'], ['Nik', 32, 'Toronto'], ['James', 36, 'Atlanta'], ['Evan', 31, 'Madrid']]

# Assign the multi-dimensional (or nested) list to Dataframe
myData = pd.DataFrame(data, columns=['Name', 'Age', 'Location'])
print(myData)

    Name  Age Location
Katie   32   London
  Nik   32  Toronto
James   36  Atlanta
 Evan   31   Madrid

10.2. Read Small Data Set from a Comma Separated (.csv) File¶

It is more common that your data is stored in a file such as a text file, Excel spreadsheet, or a file format of some other data management software such as Stata, R, SAS, or SPSS. The Pandas library does provide you with routines (i.e., commands) that will allow you to import data directly from the most common file formats.

In our first example, we will work with data stored as a comma separated file, called Lecture_Data_Excel_a.csv. We first import this file using the pd.read_csv command from the Pandas library.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats as st
import math as m
import seaborn as sns
import time  # Imports system time module to time your script

plt.close('all')  # close all open figures

# Read in small data from .csv file
# Filepath
filepath = 'Lecture_Data/'
# In windows you can also specify the absolute path to your data file
# filepath = 'C:/Dropbox/Towson/Teaching/3_ComputationalEconomics/Lectures/Lecture_Data/'

# ------------- Load data --------------------
df = pd.read_csv(filepath + 'Lecture_Data_Excel_a.csv', dtype={'Frequency': float})

# Let's have a look at it.
print(df)

                 Area  Frequency  Unnamed: 2  Unnamed: 3
        Accounting       73.0         NaN         NaN
           Finance       52.0         NaN         NaN
General management       36.0         NaN         NaN
   Marketing sales       64.0         NaN         NaN
             other       28.0         NaN         NaN

Alternatively we could have use the following command:

df = pd.read_table(filepath + 'Lecture_Data_Excel_a.csv', sep=',')

# Let's have a look at it.
print(df)

                 Area  Frequency  Unnamed: 2  Unnamed: 3
        Accounting         73         NaN         NaN
           Finance         52         NaN         NaN
General management         36         NaN         NaN
   Marketing sales         64         NaN         NaN
             other         28         NaN         NaN

The best way to import Excel files directly is probably to use the built in importer in Pandas.

Important

In order for this to work you need to make sure that the xlrd package is installed. Open a terminal window and type:

conda install xlrd

Press yes at the prompt and it will install the package.

You can use it as follows:

# Read entire excel spreadsheet
df = pd.read_excel(filepath + 'Lecture_Data_Excel_a.xlsx', \
     sheet_name = 'Lecture_Data_Excel_a', header = 0, usecols = 'A:D')

# Let's have a look at it.
print(df)

                 Area  Frequency  Unnamed: 2  Unnamed: 3
        Accounting         73         NaN         NaN
           Finance         52         NaN         NaN
General management         36         NaN         NaN
   Marketing sales         64         NaN         NaN
             other         28         NaN         NaN

Note how I also specified the excel sheet_name this is convenient in case you have more complex Excel spreadsheets with multiple tabs in it.

The DataFrame has an index for each row and a column header. We can query a DataFrame as follows:

print('Shape', df.shape)
print('-------------------------')
print('Number of rows', len(df))
print('-------------------------')
print('Column headers', df.columns)
print('-------------------------')
print('Data types', df.dtypes)
print('-------------------------')
print('Index', df.index)
print('-------------------------')

Shape (5, 4)
-------------------------
Number of rows 5
-------------------------
Column headers Index(['Area', 'Frequency', 'Unnamed: 2', 'Unnamed:
3'], dtype='object')
-------------------------
Data types Area           object
Frequency       int64
Unnamed: 2    float64
Unnamed: 3    float64
dtype: object
-------------------------
Index RangeIndex(start=0, stop=5, step=1)
-------------------------

Drop column 3 and 4 and let’s have a look at the data again.

# axis = 1 means that you are dropping a column
df.drop('Unnamed: 2', axis=1, inplace=True)

df.drop('Unnamed: 3', axis=1, inplace=True)

# Let's have a look at it, it's a nested list
print(df)

                 Area  Frequency
        Accounting         73
           Finance         52
General management         36
   Marketing sales         64
             other         28

Warning

Sometimes column headers have white spaces in them. It is probably a good idea to remove all white spaces from header names. You can use the replace() command to accomplish this:

df.columns = df.columns.str.replace('\s+', '', regex=True)

The \s+ is a regular expression which is basically a pattern matching language. The s stands for space and the + means look for one or more spaces and replace them with '', well, nothing i.e. a blank string.

We can make new columns using information from the existing columns.

Note

Making new columns in the DataFrame is very similar to making new variables in Stata using the generate or gen command.

More specifically, we next generate a new variable called relFrequency that contains the relative frequencies.

# Make new column with relative frequency
df['relFrequency'] = df['Frequency']/df['Frequency'].sum()

# Let's have a look at it, it's a nested list
print(df)

                 Area  Frequency  relFrequency
        Accounting         73      0.288538
           Finance         52      0.205534
General management         36      0.142292
   Marketing sales         64      0.252964
             other         28      0.110672

We can also just grab the relative frequencies out of the DataFrame and store it in the familiar a numpy array. This is now a vector. You can only do this if the information in the column is only numbers.

xv = df['relFrequency'].values

print('Array xv is:', xv)

Array xv is: [0.28853755 0.2055336  0.14229249 0.25296443 0.11067194]

You can now do simple calculations with this vector as you did in the previous chapters such as:

print('The sum of the vector xv is: ', xv.sum())

The sum of the vector xv is:  1.0

Let us make an additional column so we have more data to play with:

df['random'] = df['Frequency']/np.sqrt(df['Frequency'].sum())

# Let's have a look at it.
print(df)

                 Area  Frequency  relFrequency    random
        Accounting         73      0.288538  4.589471
           Finance         52      0.205534  3.269212
General management         36      0.142292  2.263301
   Marketing sales         64      0.252964  4.023646
             other         28      0.110672  1.760345

10.3. Some Pandas Tricks¶

10.3.1. Renaming Variables¶

For renaming column names we use a dictionary.

Note

Remember, dictionaries are similar to lists but the indexing of the elements of a dictionary is not done with 0,1,2, etc. but we user defined keys so that a dictionary consists of key - value pairs such as:

{'key1': value1, 'key2': value2, ... }

It works as follows. You see how the dictionary that you hand-in here defines the old column name as key and the new column name as the associated value.

df = df.rename(columns={'Area': 'area', 'Frequency': 'absFrequency'})

print(df.head(3))

                 area  absFrequency  relFrequency    random
        Accounting            73      0.288538  4.589471
           Finance            52      0.205534  3.269212
General management            36      0.142292  2.263301

The head(x) method allows us to print x rows from the top of the DataFrame, whereas the tail(x) method does the same from the bottom of the DataFrame.

10.3.2. Dropping and Adding Columns¶

# Drop a column first so everything fits
df = df.drop('random', axis=1)

# Create a new column with some missing values
df['team'] = pd.Series([2,4, np.NaN,6,10], index=df.index)

# or insert new column at specific location
df.insert(loc=2, column='position', value=[np.NaN,1, np.NaN,1,3])

print(df)

                 area  absFrequency  position  relFrequency  team
        Accounting            73       NaN      0.288538   2.0
           Finance            52       1.0      0.205534   4.0
General management            36       NaN      0.142292   NaN
   Marketing sales            64       1.0      0.252964   6.0
             other            28       3.0      0.110672  10.0

10.3.3. Missing Values of NaN’s¶

Count NaNs

Count the number of rows with missing values.

nans = df.shape[0] - df.dropna().shape[0]
print('{} rows have missing values'.format(nans))

2 rows have missing values

Select rows with NaNs or overwrite NaNs

If you want to select the rows with missing values you can index them with the isnull() method.

print(df[df.isnull().any(axis=1)])

                 area  absFrequency  position  relFrequency  team
0          Accounting            73       NaN      0.288538   2.0
2  General management            36       NaN      0.142292   NaN

Or if you want to search specific columns for missing values you can do

print(df[df['team'].isnull()])

                 area  absFrequency  position  relFrequency  team
2  General management            36       NaN      0.142292   NaN

Or if you want to grab all the rows without missing observations you can index the dataframe with the notnull() method.

print(df[df['team'].notnull()])

              area  absFrequency  position  relFrequency  team
     Accounting            73       NaN      0.288538   2.0
        Finance            52       1.0      0.205534   4.0
Marketing sales            64       1.0      0.252964   6.0
          other            28       3.0      0.110672  10.0

Delete rows with NaNs

print(df.dropna())

              area  absFrequency  position  relFrequency  team
        Finance            52       1.0      0.205534   4.0
Marketing sales            64       1.0      0.252964   6.0
          other            28       3.0      0.110672  10.0

You can also reassign it to the orignal DataFrame or assign a new one.

df2 = df.dropna()
print(df2)

              area  absFrequency  position  relFrequency  team
        Finance            52       1.0      0.205534   4.0
Marketing sales            64       1.0      0.252964   6.0
          other            28       3.0      0.110672  10.0

Overwriting NaN with specific values If we want to overwrite the NaN entries with values we can use the fillna() method. In this example we replace all missing values with zeros.

df.fillna(value=0, inplace=True)
print(df)

                 area  absFrequency  position  relFrequency  team
        Accounting            73       0.0      0.288538   2.0
           Finance            52       1.0      0.205534   4.0
General management            36       0.0      0.142292   0.0
   Marketing sales            64       1.0      0.252964   6.0
             other            28       3.0      0.110672  10.0

10.3.4. Dropping and Adding Rows to a DataFrame¶

We can use the append() method to add rows to the DataFrame.

df = df.append(pd.Series(
                [np.nan]*len(df.columns), # Fill cells with NaNs
                index=df.columns),
                ignore_index=True)

print(df.tail(3))

              area  absFrequency  position  relFrequency  team
Marketing sales          64.0       1.0      0.252964   6.0
          other          28.0       3.0      0.110672  10.0
            NaN           NaN       NaN           NaN   NaN

We can then fill this empty row with data.

df.loc[df.index[-1], 'area'] = 'Economics'
df.loc[df.index[-1], 'absFrequency'] = 25
print(df)

                 area  absFrequency  position  relFrequency  team
        Accounting          73.0       0.0      0.288538   2.0
           Finance          52.0       1.0      0.205534   4.0
General management          36.0       0.0      0.142292   0.0
   Marketing sales          64.0       1.0      0.252964   6.0
             other          28.0       3.0      0.110672  10.0
         Economics          25.0       NaN           NaN   NaN

If you want to drop this row you can:

print(df.drop(5), 0)

                 area  absFrequency  position  relFrequency  team
        Accounting          73.0       0.0      0.288538   2.0
           Finance          52.0       1.0      0.205534   4.0
General management          36.0       0.0      0.142292   0.0
   Marketing sales          64.0       1.0      0.252964   6.0
             other          28.0       3.0      0.110672  10.0 0

If you want the change to be permanent you need to either use the df.dropb(5, inplace=True) option or you reassign the DataFrame to itself df = df.drop(5). The default dimension for dropping data is the row in the DataFrame, so you do not need to specify the , 0 part in the drop command.

However, if you drop a column, then you need the dimension specifier , 1.

10.3.5. Dropping Rows Based on Conditions¶

Before running a regression, a researcher might like to drop certain observations that are clearly not correct such as negative age observations for instance.

Let us assume that in our current dataframe we have information that tells us that a position variable must be strictly positive and cannot be zero. This would mean that row 0 and row 2 would need to be dropped from our dataframe.

Before we mess up our dataframe, let’s make a new one and then drop row 0 and 2 based on the value condition of variable position.

df_temp = df

print(df_temp)

                 area  absFrequency  position  relFrequency  team
        Accounting          73.0       0.0      0.288538   2.0
           Finance          52.0       1.0      0.205534   4.0
General management          36.0       0.0      0.142292   0.0
   Marketing sales          64.0       1.0      0.252964   6.0
             other          28.0       3.0      0.110672  10.0
         Economics          25.0       NaN           NaN   NaN

Now drop the rows with a position variable value of zero or less.

df_temp = df_temp[df_temp['position']>0]
print(df_temp)

              area  absFrequency  position  relFrequency  team
        Finance          52.0       1.0      0.205534   4.0
Marketing sales          64.0       1.0      0.252964   6.0
          other          28.0       3.0      0.110672  10.0

As you can see we accomplished this by “keeping” the observations for which the position value is strictly positive as opposed to “dropping” the negative ones.

And also let’s make sure that the original dataframe is still what it was.

print(df)

                 area  absFrequency  position  relFrequency  team
        Accounting          73.0       0.0      0.288538   2.0
           Finance          52.0       1.0      0.205534   4.0
General management          36.0       0.0      0.142292   0.0
   Marketing sales          64.0       1.0      0.252964   6.0
             other          28.0       3.0      0.110672  10.0
         Economics          25.0       NaN           NaN   NaN

10.3.6. Sorting and Reindexing DataFrames¶

We next sort the DataFrame by a certain column (from highest to lowest) using the sort_values() DataFrame method.

Warning

The old pandas method/function .sort() was replaced with .sort_values() and does not work anymore in newer versions of pandas.

df.sort_values('absFrequency', ascending=False, inplace=True)
print(df)

                 area  absFrequency  position  relFrequency  team
        Accounting          73.0       0.0      0.288538   2.0
   Marketing sales          64.0       1.0      0.252964   6.0
           Finance          52.0       1.0      0.205534   4.0
General management          36.0       0.0      0.142292   0.0
             other          28.0       3.0      0.110672  10.0
         Economics          25.0       NaN           NaN   NaN

The inplace=True option will immediately overwrite the DataFrame df with the new, sorted, one. If you set inplace=False you would have to assign a new DataFrame in order to see the changes:

df1 = df.sort_values('absFrequency', ascending=False, inplace=False)

Where df1 would now be the sorted DataFrame and the original df DataFrame would still be unsorted.

Note

However, we now have two objects with identical data in it. This is often redundant and can eventually become a memory problem if the DataFrames are very large. Setting inplace=True is probably a good default setting.

If we want we could then reindex the DataFrame according to the new sort order.

df.index = range(len(df.index))
print(df)

                 area  absFrequency  position  relFrequency  team
        Accounting          73.0       0.0      0.288538   2.0
   Marketing sales          64.0       1.0      0.252964   6.0
           Finance          52.0       1.0      0.205534   4.0
General management          36.0       0.0      0.142292   0.0
             other          28.0       3.0      0.110672  10.0
         Economics          25.0       NaN           NaN   NaN

10.3.7. Merging DataFrames or Adding Columns¶

Let us create a new DataFrame that has the variable team in common with the previous DataFrame. This new DataFrame contains additional team information that we would like to merge into the original DataFrame.

First redefine the orginal DataFrame:

df= pd.DataFrame({'area': ['Accounting','Marketing','Finance','Management', 'other'],\
                  'absFrequency': [73,64,52,36,28], \
                  'position': [0,1,1,0,3], \
                  'relFrequency': [0.28,0.25,0.20,0.14,0.11], \
                  'team': [2,6,4,0,10]})

print(df)

         area  absFrequency  position  relFrequency  team
Accounting            73         0          0.28     2
 Marketing            64         1          0.25     6
   Finance            52         1          0.20     4
Management            36         0          0.14     0
     other            28         3          0.11    10

Second, we define the new DataFrame and add some values to it:

df_team = pd.DataFrame({'team': [3,4,5,6,7,8,9,10], \
                        'sales': [500,300,250,450,345,123,432,890]})

print(df_team)

   team  sales
   3    500
   4    300
   5    250
   6    450
   7    345
   8    123
   9    432
  10    890

This new DataFrame contains some sales information for each team.

Before we merge in the new info we drop the area column so that the DataFrame fits on the screen. We now work with the following two dataframes.

df = df.drop('area', axis=1)
print('-----------------------------')
print('First, main,  DataFrame: LEFT')
print('-----------------------------')
print(df)
print('')
print('-----------------------------')
print('Second new DataFrame: RIGHT')
print('-----------------------------')
print(df_team)

-----------------------------
First, main,  DataFrame: LEFT
-----------------------------
   absFrequency  position  relFrequency  team
0            73         0          0.28     2
1            64         1          0.25     6
2            52         1          0.20     4
3            36         0          0.14     0
4            28         3          0.11    10

-----------------------------
Second new DataFrame: RIGHT
-----------------------------
   team  sales
0     3    500
1     4    300
2     5    250
3     6    450
4     7    345
5     8    123
6     9    432
7    10    890

When merging two dataframes we need to find a common variable that appears in both DataFrames along which we can connect them. In our example we use the team column as merge-key or key-variable because it appears in both DataFrames. We have now four different options for merging the two DataFrames.

digraph {
"Merge Options" -> "Inner Merge"
"Merge Options" -> "Left Merge"
"Merge Options" -> "Right Merge"
"Merge Options" -> "Outer Merge"
}

Note

Merge Summary

The inner merge only keeps observations that are present in both DataFrames.
The left merge keeps all of the observations of the original (or main DataFrame) and adds the new information on the left if available.
The right merge only keeps info from the new DataFrame and adds the info of the original main DataFrame.
The outer merge method merges by team whenever possible but keeps the info from both DataFrames, even for observations where no merge/overlap occurred.

Now we do it step my step. Inner merge first. As you will see now, the inner merge only keeps observations that are present in both DataFrames. This is often too restrictive as many observations will be dropped.

print('Inner Merge')
print('-----------')
print(pd.merge(df, df_team, on='team', how='inner'))

Inner Merge
-----------
   absFrequency  position  relFrequency  team  sales
0            64         1          0.25     6    450
1            52         1          0.20     4    300
2            28         3          0.11    10    890

The left merge keeps the original (or main DataFrame) and adds the new information on the left. When it cannot find a team number in the main DataFrame it simply drops the info from the new DataFrame.

print('Left Merge')
print('----------')
print(pd.merge(df, df_team, on='team', how='left'))

Left Merge
----------
   absFrequency  position  relFrequency  team  sales
0            73         0          0.28     2    NaN
1            64         1          0.25     6  450.0
2            52         1          0.20     4  300.0
3            36         0          0.14     0    NaN
4            28         3          0.11    10  890.0

The right merge only keeps info from the new DataFrame and adds the info of the original main DataFrame.

print('Right Merge')
print('-----------')
print(pd.merge(df, df_team, on='team', how='right'))

Right Merge
-----------
   absFrequency  position  relFrequency  team  sales
         NaN       NaN           NaN     3    500
        52.0       1.0          0.20     4    300
         NaN       NaN           NaN     5    250
        64.0       1.0          0.25     6    450
         NaN       NaN           NaN     7    345
         NaN       NaN           NaN     8    123
         NaN       NaN           NaN     9    432
        28.0       3.0          0.11    10    890

The outer merge method merges by team whenever possible but keeps the info from both DataFrames, even for observations where no merge/overlap occurred.

print('Outer Merge')
print('-----------')
print(pd.merge(df, df_team, on='team', how='outer'))

Outer Merge
-----------
   absFrequency  position  relFrequency  team  sales
        73.0       0.0          0.28     2    NaN
        64.0       1.0          0.25     6  450.0
        52.0       1.0          0.20     4  300.0
        36.0       0.0          0.14     0    NaN
        28.0       3.0          0.11    10  890.0
         NaN       NaN           NaN     3  500.0
         NaN       NaN           NaN     5  250.0
         NaN       NaN           NaN     7  345.0
         NaN       NaN           NaN     8  123.0
         NaN       NaN           NaN     9  432.0

10.3.8. Converting Column Types¶

When converting text into a DataFrame to conduct statistical analysis it is sometimes necessary to convert a numbers column that is still a string, into floats, so that we can do math. Let’s drop some more columns first so the DataFrame fits nicely in the output window.

df = df.drop('absFrequency', axis=1)
# Let's have a look at it, it's a nested list
print(df)

   position  relFrequency  team
       0          0.28     2
       1          0.25     6
       1          0.20     4
       0          0.14     0
       3          0.11    10

We now generate a column of “words” or “strings” that happen do be numbers. Since they are defined as words, we cannot run statistical analysis on these numbers yet.

df['string'] = pd.Series(['2','40','34','6','10'], index=df.index)

# Print DataFrame
print(df)

# Print summary statistics
print(df.describe())

   position  relFrequency  team string
0         0          0.28     2      2
1         1          0.25     6     40
2         1          0.20     4     34
3         0          0.14     0      6
4         3          0.11    10     10
       position  relFrequency       team
count  5.000000      5.000000   5.000000
mean   1.000000      0.196000   4.400000
std    1.224745      0.071624   3.847077
min    0.000000      0.110000   0.000000
25%    0.000000      0.140000   2.000000
50%    1.000000      0.200000   4.000000
75%    1.000000      0.250000   6.000000
max    3.000000      0.280000  10.000000

We first need to reassign the data type of the string column. We then rename the column and print summary statistics.

# Transform strings into floats, i.e., words into numbers
df['string'] = df['string'].astype(float)

# Rename the column
df = df.rename(columns={'string': 'salary'})

# Print summary statistics
print(df.describe())

       position  relFrequency       team     salary
count  5.000000      5.000000   5.000000   5.000000
mean   1.000000      0.196000   4.400000  18.400000
std    1.224745      0.071624   3.847077  17.343587
min    0.000000      0.110000   0.000000   2.000000
25%    0.000000      0.140000   2.000000   6.000000
50%    1.000000      0.200000   4.000000  10.000000
75%    1.000000      0.250000   6.000000  34.000000
max    3.000000      0.280000  10.000000  40.000000

10.3.9. Replacing Values in a DataFrame Conditional on Criteria¶

If we need to replace values in a DataFrame based on certain criteria it is often more efficient to use indexing as opposed to loops. Let us first create a DataFrame with some random values:

df1 = pd.DataFrame(np.random.rand(10,5)*100)
print(df1)

           0          1          2          3          4
76.231650  73.856596  98.223677  37.419310  56.534611
93.020350  59.102390  87.029799   6.289979  81.704176
67.060606  30.773903  27.633734  35.819372  52.103559
80.603039  63.205036  16.050168  67.673033  19.890453
53.266469   8.721019  42.579826  36.000875  15.518005
20.139680  49.216850   9.924973  31.078271  77.523813
37.838850  65.905862  96.325081  99.530884  48.992624
52.408115  40.192827  60.132690  21.112231  52.065246
93.561547  36.599218  38.663248  56.510712  26.487473
 8.446148  78.818387  74.815600  40.085175  90.895161

We next replace all the values in column 2 that are smaller than 30 with the string low.

df1[1][(df1[1]<30)] = 'Low'
print(df1)

           0          1          2          3          4
76.231650  73.856596  98.223677  37.419310  56.534611
93.020350   59.10239  87.029799   6.289979  81.704176
67.060606  30.773903  27.633734  35.819372  52.103559
80.603039  63.205036  16.050168  67.673033  19.890453
53.266469        Low  42.579826  36.000875  15.518005
20.139680   49.21685   9.924973  31.078271  77.523813
37.838850  65.905862  96.325081  99.530884  48.992624
52.408115  40.192827  60.132690  21.112231  52.065246
93.561547  36.599218  38.663248  56.510712  26.487473
 8.446148  78.818387  74.815600  40.085175  90.895161

We next replace all the values of column 4 that are larger than 70 with the value 1000.

df1.loc[(df1[3]>70), 3] = 1000
print(df1)

           0          1          2            3          4
76.231650  73.856596  98.223677    37.419310  56.534611
93.020350   59.10239  87.029799     6.289979  81.704176
67.060606  30.773903  27.633734    35.819372  52.103559
80.603039  63.205036  16.050168    67.673033  19.890453
53.266469        Low  42.579826    36.000875  15.518005
20.139680   49.21685   9.924973    31.078271  77.523813
37.838850  65.905862  96.325081  1000.000000  48.992624
52.408115  40.192827  60.132690    21.112231  52.065246
93.561547  36.599218  38.663248    56.510712  26.487473
 8.446148  78.818387  74.815600    40.085175  90.895161

And finally we can combine logical statements and replace values on a combination of conditions. So let’s replace all the values in column 5 that are between 20 and 80 with the string middle.

df1.loc[((df1[4]>20) & (df1[4]<80)), 4] = 'Middle'
print(df1)

           0          1          2            3          4
76.231650  73.856596  98.223677    37.419310     Middle
93.020350   59.10239  87.029799     6.289979  81.704176
67.060606  30.773903  27.633734    35.819372     Middle
80.603039  63.205036  16.050168    67.673033  19.890453
53.266469        Low  42.579826    36.000875  15.518005
20.139680   49.21685   9.924973    31.078271     Middle
37.838850  65.905862  96.325081  1000.000000     Middle
52.408115  40.192827  60.132690    21.112231     Middle
93.561547  36.599218  38.663248    56.510712     Middle
 8.446148  78.818387  74.815600    40.085175  90.895161