Webscraping

Webscraping#

We’ll need some packages to start, requests, beautifulsoup4 and selenium. Requesting elements from a static web page is very straightforward. Let’s take an example by trying to grab and plot the table of multiple Olympic medalists from Wikipedia then create a barplot of which sports have the most multiple medal winners.

First we have to grab the data from the url, then pass it to beautifulsoup4, which parses the html, then pass it to pandas. First let’s import the packages we need.

import requests as rq
import bs4
import pandas as pd

We then need to read the web page into data.

url = 'https://en.wikipedia.org/wiki/List_of_multiple_Olympic_gold_medalists'
page = rq.get(url)
## print out the first 200 characters just to see what it looks like
page.text[0 : 99]

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-l'

Now let’s read the page into bs4. Then we want to find the tables in the page. We add the class and wikitable information to specify which tables that we want. If you want to find classes, you can use a web tool, like selectorgadget or viewing the page source.

bs4page = bs4.BeautifulSoup(page.text, 'html.parser')
tables = bs4page.find_all('table',{'class':"wikitable"})

Now we should take the html that we’ve saved, then read it into pandas. Fortunately, pandas has a read_html method. So, we convert our tables to strings then read it in. Since there’s multiple tables, we grab the first one.

from io import StringIO
# Read the table from the StringIO object into pandas
# Note most recent version of pandas won't accept a string as input, it needs to be passed through stringio
medals = pd.read_html(StringIO(str(tables[0])))[0]
medals = medals.dropna()
medals.head()

	No.	Athlete	Nation	Sport	Years	Games	Gender	Gold	Silver	Bronze	Total
0	1	Michael Phelps	United States	Swimming	2000–2016	Summer	M	23.0	3.0	2.0	28.0
1	2	Larisa Latynina	Soviet Union	Gymnastics	1956–1964	Summer	F	9.0	5.0	4.0	18.0
2	3	Paavo Nurmi	Finland	Athletics	1920–1928	Summer	M	9.0	3.0	0.0	12.0
3	4	Mark Spitz	United States	Swimming	1968–1972	Summer	M	9.0	1.0	1.0	11.0
4	5	Carl Lewis	United States	Athletics	1984–1996	Summer	M	9.0	1.0	0.0	10.0

Now we’re in a position to build our plot. Let’s look at the count of 4 or more medal winers by sport and games.

medals[['Sport', 'Games']].value_counts().plot.bar();

_images/ea98b25f8adbabf2ef1af69200f8f575222eacd241c509a198e1cde85e18320d.png

Selenium#

If the page has javacript, your basic web scraping may not work. In this case, you not only need to get and parse the page, but also to interact with the javascript. For this, enter Selenium. This is a python browser that allows you to automate web navigation. For this class, we’re going to work on static web pages, so won’t need Selenium.

Webscraping

Contents

Webscraping#

Selenium#