13. Working with Data from the Web I¶

When you try to scrape or “harvest” data from the web it is useful to know the basics of html which is the markup language that is used to create websites. A browser interprets the html code and shows you the formatted website.

When we download information from the web, we usually download this source code and then run a parser over the html source code to extract information.

Note

You can look at the source code with all the html-tags of a website in your browser. Simply right click (if on Windows) and click on view frame source or view page source, depending on your browser.

13.1. Beautiful Soup Library¶

The Python library BeautifulSoup helps us parsing out the information from the html-data that we get after downloading the web page. The additional library request allows us to download source codes from websites. We therefore first need to import these two libraries.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re

The first package allows us to open web pages and the second, BeautifulSoup parses the html code and stores it in an easily accessible database format, or object. This object has methods that are tailor made to extract information from the html code of the website that we scrape from. The other two packages are Pandas and Regular Expressions. The latter is useful for pattern matching as you will see below. See also the chapter on Regular Expressions in these lecture notes.

13.2. Scraping YouTube¶

13.2.1. Introduction¶

In this exercise we will download all video links from the YouTube starting page. We will extract the title for each video, its number of views, likes and dislikes. We then store everything in a Pandas dataframe for further statistical analysis.

We first define the YouTube URL and assign it to a variable.

# Define url to Youtube
jurl = 'https://www.youtube.com/'

We next open the Youtube site and read the page’s html code and assign it to the variable html.

html = urlopen(jurl).read()

We then assign the html code to the BeautifulSoup data format which allows us to sort through the html code more systematically.

soup = BeautifulSoup(html.decode('utf-8', 'ignore'))

Note

You may see a warning message about a file parser extension. On Windows you could try:

soup = BeautifulSoup(html.decode('utf-8', 'ignore'), "lxml")

or alternatively:

soup = BeautifulSoup(html.decode('utf-8', 'ignore'), "html.parser")

or also:

soup = BeautifulSoup(html.decode('utf-8', 'ignore'), features="lxml")

13.2.2. Extracting Information from html Code¶

Next look at the source code of the YouTube webpage (right click on the webpage and select View page source) and convince yourself that the links to the videos are all part of the CSS class:

<a aria-hidden="true" class="yt-uix-sessionlink spf-link">

If you inspect this soup object it is a bit easier to see that all videos are part of a link class that starts with:

<a aria-hidden="true" class="yt-uix-sessionlink spf-link">

Note

Where <a> is the ancor element that starts a section that contains a link. A typical html code with a link would look something like this:

<a href="http://example.com/">Link to example.com</a>

where href stands for hyper reference (or link) and the </a> at the end of the line closes the link section. That is <a …> is the opening tag and /a> is the closing tag.

Also not that the actual links to the video all start with the html tag:

href="/watch?v= ... "

where instead of the “…” you will see a code for a particular video such as -6BvA4U1dLI.

We next extract all these a sections from the soup object. We limit the extraction of link sections to particular a sections that contain the actual links to videos as opposed to other links that are also part of the html code of YouTube’s starting site. Since we observed above, that the links are actually part of a specific a class that has the attributes ‘aria-hidden’ as ‘true’ and ‘class’ as ‘yt-uix-sessionlink spf-link’ we will filter for these particular a sections next.

linkSections = soup.findAll('a', attrs={'aria-hidden':'true', 'class':'yt-uix-sessionlink spf-link'})

We can now define an empty list so we can store all the links to the videos and then run a loop through the html source code and extract all links that start with href="/watch?v=>. We append all these links to the link_list list.

Let us test this first before we run a loop. Let’s have a look at the first element of our extracted list.

print(linkSections[0])

<a aria-hidden="true" class="yt-uix-sessionlink spf-link" data-
sessionlink="itct=CE8QlDUYDCITCLnPhtOJw-
gCFQq8wQodiykNfzIKZy1oaWdoLXRydloPRkV3aGF0X3RvX3dhdGNo"
href="/watch?v=BqbfT5r-DzA"><div class="yt-thumb video-thumb"><span
class="yt-thumb-simple">
<img alt="" data-
thumb="https://i.ytimg.com/vi/BqbfT5r-DzA/hqdefault.jpg?sqp=-oaymwEiCMQBEG5IWvKriqkDFQgBFQAAAAAYASUAAMhCPQCAokN4AQ==&amp;rs=AOn4CLCXRKXynrkq7kA90QUE7SIWtRpaBw"
data-ytimg="1" height="110" onload=";window.__ytRIL &amp;&amp;
__ytRIL(this)" src="/yts/img/pixel-vfl3z5WfW.gif" width="196"/>
<span aria-hidden="true" class="video-
time">2:10</span></span></div></a>

Next let us have a look at the link to the video in that section.

print(linkSections[0]['href'])

/watch?v=BqbfT5r-DzA

If we combine this with https://www.youtube.com/ that we have already stored in variable jurl we have the complete html link to the video that we can copy/paste into a browser.

link_list = []

for link in linkSections:
    # Store all links in a list
    newLink = link['href']
    link_list.append(newLink)

Let’s print the first 5 entries of the list

print(link_list[0:6])

['/watch?v=BqbfT5r-DzA', '/watch?v=4wS4PYnWSrw',
'/watch?v=l3Rm8avK6kM', '/watch?v=EoLQUdt8xsk',
'/watch?v=2AdUsBwuJSk', '/watch?v=BWb8YqAVTgw']

13.2.3. Storing Information in a DataFrame¶

We then create an empty data frame that has the same number of rows as our list. In addition we add empty columns so that we can store the title, view, like and dislike information later on.

index = range(len(link_list))
columns = ['Links', 'Title', 'Views', 'Likes', 'Dislikes']
df = pd.DataFrame(index=index, columns=columns)

We next assign the link_list with all the video links to the dataframe.

df['Links'] = link_list
df['Links'] = jurl + df['Links']

We then start the loop that runs through our list of video links and opens each one separately in a webpage. We then grab the title, number of views, number of likes and number of dislikes and store these data in the current row of our dataframe.

We use regular expressions to find the numbers of views, likes and dislikes in the text string that is generated by the soup.select() and soup.find_all() functions.

The term r'\d+' is a regular expression syntax. Think of it as a pattern, that we search for in the html-code of the website. The letter d stands for digit. The plus sign that follows it indicates that the numbers we are looking for can be comprised of one or more digits. Have a look at the textbook chapter on regular expressions.

Warning

The following code might not run all the way through. The reason is that sometimes when you try to open a website with a crawler script, you may run into a server side issue where the website cannot be accessed for a split second in which case the line:

html = urlopen(df['Links'][i]).read()

in the script below will return an empty object. The script then tries to read from this object which then ends up in errors such as:

IndexError: list index out of range

We learn how to deal with such issues in the next section.

13.2.4. Extract the View Count for the First Video¶

Let us now start with the link to the first video. We again read in the html code and decode it as a soup object.

i = 0
# Open first youtube video link
html = urlopen(df['Links'][i]).read()

# Assign it to Soup object
soup = BeautifulSoup(html.decode('utf-8', 'ignore'))
# To suppress error specify the filter explicitly.
#soup = BeautifulSoup(html.decode('utf-8', 'ignore'), features="lxml")

We first extract the title. This is such a common thing that it is pre-programmed as a function in of the Beautiful Soup library.

# Extract info and store in dataframe
df['Title'][i] = soup.title.get_text()

We are next looking for the information about the view counts. This requires a bit of detective works. After staring at the html code for a bit we find that the number of views is embedded in an html block with the keywork .watch-view-count'. We use this info and select the block of text where this occurs.

print(soup.select('.watch-view-count'))

[<div class="watch-view-count">448,950 views</div>]

This returns a list, so let us just grab the content (first entry) of this list.

print(soup.select('.watch-view-count')[0])

<div class="watch-view-count">448,950 views</div>

Now let us just get the text between the html tags using the get_text() function.

print(soup.select('.watch-view-count')[0].get_text())

448,950 views

Ok, our number is in there. Let us now split this string into separate strings so that we can grab our number more easily.

print(soup.select('.watch-view-count')[0].get_text().split())

['448,950', 'views']

Almost there, grab the first element of this list.

print(soup.select('.watch-view-count')[0].get_text().split()[0])

448,950

Now we use Regular Expressions and search for everything that is not a number and replace that with nothing. In other words you delete everything that is not a number or digit.

print(re.sub('[^0-9]', '', soup.select('.watch-view-count')[0].get_text().split()[0]))

This is still a string, so let us retype it as an integer number so we can assign it into our dataframe and do math with it.

print(int(re.sub('[^0-9]', '', soup.select('.watch-view-count')[0].get_text().split()[0])))

And finally stick it into our dataframe in the Views column at row zero, which is the first row.

df['Views'][i] = int(re.sub('[^0-9]', '', \
            soup.select('.watch-view-count')[0].get_text().split()[0]))

Let us have a quick look at the dataframe to see whether it has been stored in the correct position.

print(df.head())

                                          Links  \
0  https://www.youtube.com//watch?v=BqbfT5r-DzA
1  https://www.youtube.com//watch?v=4wS4PYnWSrw
2  https://www.youtube.com//watch?v=l3Rm8avK6kM
3  https://www.youtube.com//watch?v=EoLQUdt8xsk
4  https://www.youtube.com//watch?v=2AdUsBwuJSk

                                               Title   Views Likes
Dislikes
0  Trump extends coronavirus guidelines until Apr...  448950   NaN
NaN
1                                                NaN     NaN   NaN
NaN
2                                                NaN     NaN   NaN
NaN
3                                                NaN     NaN   NaN
NaN
4                                                NaN     NaN   NaN
NaN

13.2.5. Extract the Likes and Dislikes for the First Video¶

Next we do a similar procedure extracting the “likes” information. After some digging in the html code of the web page we find that the “likes” information is in a text block with the button tag and the button title “I like this”. We use this in the find_all() method of the soup object we have created above.

# Extracting number of likes
print(soup.find_all('button', \
  attrs={'title': 'I like this'}))

[<button aria-label="like this video along with 4,457 other people"
class="yt-uix-button yt-uix-button-size-default yt-uix-button-opacity
yt-uix-button-has-icon no-icon-markup like-button-renderer-like-button
like-button-renderer-like-button-unclicked yt-uix-clickcard-target yt-
uix-tooltip" data-force-position="true" data-orientation="vertical"
data-position="bottomright" onclick=";return false;" title="I like
this" type="button"><span class="yt-uix-button-
content">4,457</span></button>]

We almost have it. Let us get rid of the comma using the replace() function. This function only works on strings, so we need to re-type the number we just had as a string using the str() function.

print(str(soup.find_all('button', \
  attrs={'title': 'I like this'})).replace(",",""))

[<button aria-label="like this video along with 4457 other people"
class="yt-uix-button yt-uix-button-size-default yt-uix-button-opacity
yt-uix-button-has-icon no-icon-markup like-button-renderer-like-button
like-button-renderer-like-button-unclicked yt-uix-clickcard-target yt-
uix-tooltip" data-force-position="true" data-orientation="vertical"
data-position="bottomright" onclick=";return false;" title="I like
this" type="button"><span class="yt-uix-button-
content">4457</span></button>]

Let us store this whole thing as variable a.

a = str(soup.find_all('button', \
  attrs={'title': 'I like this'})).replace(",","")

Now let us just get the digits so that we can build our number again. Remember right now it is a string (i.e., a “word”) that we cannot do math with. So let’s look for all the digits of the number using Regular Expressions and then change or re-type it into a floating point number (i.e., a decimal number) using the float() function.

print(re.findall(r'\d+', a))
print(float(re.findall(r'\d+', a)[0]))

['4457', '4457']
4457.0

Here is the entire thing for “likes” and “dislikes” which works in an identical way.

# Extracting number of likes
a = str(soup.find_all('button', \
  attrs={'title': 'I like this'})).replace(",","")
df['Likes'][i] = float(re.findall(r'\d+', a)[0])

# Extracting number of dislikes
a = str(soup.find_all('button', \
  attrs={'title': 'I dislike this'})).replace(",","")
df['Dislikes'][i] =  float(re.findall(r'\d+', a)[0])

Let us have a look at the dataframe again.

print(df.head())

                                          Links  \
0  https://www.youtube.com//watch?v=BqbfT5r-DzA
1  https://www.youtube.com//watch?v=4wS4PYnWSrw
2  https://www.youtube.com//watch?v=l3Rm8avK6kM
3  https://www.youtube.com//watch?v=EoLQUdt8xsk
4  https://www.youtube.com//watch?v=2AdUsBwuJSk

                                               Title   Views Likes
Dislikes
0  Trump extends coronavirus guidelines until Apr...  448950  4457
794
1                                                NaN     NaN   NaN
NaN
2                                                NaN     NaN   NaN
NaN
3                                                NaN     NaN   NaN
NaN
4                                                NaN     NaN   NaN
NaN

Finally, here is the entire code that extracts all the information for the first video:

# First video i.e., first element of the Links column in dataframe
i = 0

# Open first youtube video link
html = urlopen(df['Links'][i]).read()

# Assign it to Soup object
soup = BeautifulSoup(html.decode('utf-8', 'ignore'), features="lxml")

# Extract info and store in dataframe
df['Title'][i] = soup.title.get_text()

# Extract number of views
df['Views'][i] = int(re.sub('[^0-9]', '', \
    soup.select('.watch-view-count')[0].get_text().split()[0]))

# Extracting number of likes
a = str(soup.find_all('button', \
    attrs={'title': 'I like this'})).replace(",","")
df['Likes'][i] = float(re.findall(r'\d+', a)[0])

# Extracting number of dislikes
a = str(soup.find_all('button', \
    attrs={'title': 'I dislike this'})).replace(",","")
df['Dislikes'][i] =  float(re.findall(r'\d+', a)[0])

13.2.6. Extract Remaining Video Info Using Loop¶

We finally scrape all the other videos in the same way. In order to make this a bit nicer we simply put it into a loop.

for i in range(len(1, link_list)):
    if i<5 or i>len(link_list)-5:
        print('{} out of {}'.format(i, len(link_list)))

    # Open first youtube video link
    html = urlopen(df['Links'][i]).read()

    # Assign it to Soup object
    soup = BeautifulSoup(html.decode('utf-8', 'ignore'))

    # Extract info and store in dataframe
    df['Title'][i] = soup.title.get_text()
    df['Views'][i] = int(re.sub('[^0-9]', '', \
      soup.select('.watch-view-count')[0].get_text().split()[0]))

    # Extracting number of likes
    a = str(soup.find_all('button', \
      attrs={'title': 'I like this'})).replace(",","")
    df['Likes'][i] = float(re.findall(r'\d+', a)[0])

    # Extracting number of dislikes
    a = str(soup.find_all('button', \
      attrs={'title': 'I dislike this'})).replace(",","")
    df['Dislikes'][i] =  float(re.findall(r'\d+', a)[0])

---------------------------------------------------------------------------TypeError
Traceback (most recent call last)<ipython-input-1-003d55c96af3> in
<module>
----> 1 for i in range(len(1, link_list)):
      2     if i<5 or i>len(link_list)-5:
      3         print('{} out of {}'.format(i, len(link_list)))
      4
      5     # Open first youtube video link
TypeError: len() takes exactly one argument (2 given)

This may have resulted in an error if there was a server issues for one of the many videos we tried to scrape. Do not worry. We will fix this now.

13.3. How to Deal with Errors¶

Sometimes a website is down or cannot be read for some reason. In this case the line in the above script that opens or loads the webpage, html = urlopen(df['Links'][i]).read(), may result in an empty object so that html would not be defined. The next line in the script above that uses the html object would then break the code and throw an error.

In order to circumvent that we could put the entire code-block into a try-except statement. In this case the Python interpreter will try to load the content of the website and extract all the info from the website. However, if the interpreter is not able to load the website then, instead of breaking the code, it will simply jump into an alternate branch (the Except part) and continue running the commands that are there.

for i in range(len(link_list)):
    if i<5 or i>len(link_list)-5:
        print('{} out of {}'.format(i, len(link_list)))

    # Sometimes a website is down or cannot be read for some reason
    # The
    try:
        # Open first youtube video link
        html = urlopen(df['Links'][i]).read()

        # Assign it to Soup object
        soup = BeautifulSoup(html.decode('utf-8', 'ignore'))

        # Extract info and store in dataframe
        df['Title'][i] = soup.title.get_text()
        df['Views'][i] = int(re.sub('[^0-9]', '', \
          soup.select('.watch-view-count')[0].get_text().split()[0]))

        # Extracting number of likes
        a = str(soup.find_all('button', \
          attrs={'title': 'I like this'})).replace(",","")
        df['Likes'][i] = float(re.findall(r'\d+', a)[0])

        # Extracting number of dislikes
        a = str(soup.find_all('button', \
          attrs={'title': 'I dislike this'})).replace(",","")
        df['Dislikes'][i] =  float(re.findall(r'\d+', a)[0])

    except Exception as e:
        print('Something is wrong with link {}'.format(i))
        print('Probably a server side issue!')
        # The next line prints the error message
        print(e)

0 out of 37
1 out of 37
Something is wrong with link 1
Probably a server side issue!
list index out of range
2 out of 37
3 out of 37
4 out of 37
Something is wrong with link 4
Probably a server side issue!
list index out of range
Something is wrong with link 10
Probably a server side issue!
list index out of range
Something is wrong with link 24
Probably a server side issue!
list index out of range
33 out of 37
Something is wrong with link 33
Probably a server side issue!
list index out of range
34 out of 37
35 out of 37
36 out of 37

If you write your code in this way, your program will never break in case you hit a bad link. It will simply print the exception message and then continue with the next link from the link_list.

We finally sort the data according to number of views, starting with the most viewed video and print the first couple of entries:

Warning

In Pandas there was a change from df.sort() to df.sort_values(). Make sure you use the latter command.

print(df.sort_values('Views', ascending = False).head())

                                           Links  \
 https://www.youtube.com//watch?v=OPdbdjctx2I
 https://www.youtube.com//watch?v=NaY91YjVbEM
 https://www.youtube.com//watch?v=cLmCJKT5ssw
https://www.youtube.com//watch?v=sX8GgDgjq00
https://www.youtube.com//watch?v=GYMF5mhuBUE

                                                Title      Views
Likes  \
 Jimmy and Kevin Hart Ride a Roller Coaster - Y...  106701515
824491
 Musical Genre Challenge with Ariana Grande - Y...   68500194
2407e+06
 Jack Black Performs His Legendary Sax-A-Boom w...   47298301
099e+06
Tiniest Puppy Loves To Race Around On His Whee...   37824437
593040
Whisper Challenge Veterans Day Reunion Surpris...   34100237
728669

   Dislikes
   14032
   21610
   17020
  14018
  10213

13.4. More Tutorials¶

Here are additional web tutorials about Python web scraping that you can check out:

Scraping URLs from the Pycon 2014 Conference by Miguel Grinberg

A youtube tutorial for web scraping with Python

Dan Nguyens blog entry for scraping public data