{ "cells": [ { "cell_type": "markdown", "id": "1c287ce1-a43c-4f15-81ba-1a3526660991", "metadata": {}, "source": [ "# Webscraping\n", "\n", "We'll need some packages to start, `requests`, `beautifulsoup4` and `selenium`. Requesting elements from a static web page is very straightforward. Let's take an example by trying to grab and plot the table of multiple Olympic medalists from Wikipedia then create a barplot of which sports have the most multiple medal winners. \n", "\n", "First we have to grab the data from the url, then pass it to beautifulsoup4, which parses the html, then pass it to pandas. First let's import the packages we need." ] }, { "cell_type": "code", "execution_count": 1, "id": "3634b055-263c-428b-9983-7aa2d8c5a9ba", "metadata": {}, "outputs": [], "source": [ "import requests as rq\n", "import bs4\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "90f9e575-beec-4ead-afd8-a5d3098d2f8e", "metadata": {}, "source": [ "We then need to read the web page into data." ] }, { "cell_type": "code", "execution_count": 2, "id": "dddc9158-aa75-4dd7-84ac-63076e7628a7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\n\n", "\n", "
\n", " | No. | \n", "Athlete | \n", "Nation | \n", "Sport | \n", "Years | \n", "Games | \n", "Gender | \n", "Gold | \n", "Silver | \n", "Bronze | \n", "Total | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "Michael Phelps | \n", "United States | \n", "Swimming | \n", "2000–2016 | \n", "Summer | \n", "M | \n", "23.0 | \n", "3.0 | \n", "2.0 | \n", "28.0 | \n", "
1 | \n", "2 | \n", "Larisa Latynina | \n", "Soviet Union | \n", "Gymnastics | \n", "1956–1964 | \n", "Summer | \n", "F | \n", "9.0 | \n", "5.0 | \n", "4.0 | \n", "18.0 | \n", "
2 | \n", "3 | \n", "Paavo Nurmi | \n", "Finland | \n", "Athletics | \n", "1920–1928 | \n", "Summer | \n", "M | \n", "9.0 | \n", "3.0 | \n", "0.0 | \n", "12.0 | \n", "
3 | \n", "4 | \n", "Mark Spitz | \n", "United States | \n", "Swimming | \n", "1968–1972 | \n", "Summer | \n", "M | \n", "9.0 | \n", "1.0 | \n", "1.0 | \n", "11.0 | \n", "
4 | \n", "5 | \n", "Carl Lewis | \n", "United States | \n", "Athletics | \n", "1984–1996 | \n", "Summer | \n", "M | \n", "9.0 | \n", "1.0 | \n", "0.0 | \n", "10.0 | \n", "