Webscraping

Contents

Webscraping#

We’ll need some packages to start, requests, beautifulsoup4 and selenium. Requesting elements from a static web page is very straightforward. Let’s take an example by trying to grab and plot the table of multiple Olympic medalists from Wikipedia then create a barplot of which sports have the most multiple medal winners.

First we have to grab the data from the url, then pass it to beautifulsoup4, which parses the html, then pass it to pandas. First let’s import the packages we need.

import requests as rq
import bs4
import pandas as pd

We then need to read the web page into data.

url = 'https://en.wikipedia.org/wiki/List_of_multiple_Olympic_gold_medalists'
page = rq.get(url)
## print out the first 200 characters just to see what it looks like
page.text[0 : 99]
'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-l'

Now let’s read the page into bs4. Then we want to find the tables in the page. We add the class and wikitable information to specify which tables that we want. If you want to find classes, you can use a web tool, like selectorgadget or viewing the page source.

bs4page = bs4.BeautifulSoup(page.text, 'html.parser')
tables = bs4page.find_all('table',{'class':"wikitable"})

Now we should take the html that we’ve saved, then read it into pandas. Fortunately, pandas has a read_html method. So, we convert our tables to strings then read it in. Since there’s multiple tables, we grab the first one.

from io import StringIO
# Read the table from the StringIO object into pandas
# Note most recent version of pandas won't accept a string as input, it needs to be passed through stringio
medals = pd.read_html(StringIO(str(tables[0])))[0]
medals = medals.dropna()
medals.head()
No. Athlete Nation Sport Years Games Gender Gold Silver Bronze Total
0 1 Michael Phelps United States Swimming 2000–2016 Summer M 23.0 3.0 2.0 28.0
1 2 Larisa Latynina Soviet Union Gymnastics 1956–1964 Summer F 9.0 5.0 4.0 18.0
2 3 Paavo Nurmi Finland Athletics 1920–1928 Summer M 9.0 3.0 0.0 12.0
3 4 Mark Spitz United States Swimming 1968–1972 Summer M 9.0 1.0 1.0 11.0
4 5 Carl Lewis United States Athletics 1984–1996 Summer M 9.0 1.0 0.0 10.0

Now we’re in a position to build our plot. Let’s look at the count of 4 or more medal winers by sport and games.

medals[['Sport', 'Games']].value_counts().plot.bar();
_images/ea98b25f8adbabf2ef1af69200f8f575222eacd241c509a198e1cde85e18320d.png

Selenium#

If the page has javacript, your basic web scraping may not work. In this case, you not only need to get and parse the page, but also to interact with the javascript. For this, enter Selenium. This is a python browser that allows you to automate web navigation. For this class, we’re going to work on static web pages, so won’t need Selenium.