{"cells":[{"cell_type":"markdown","metadata":{"id":"J7BmBhj_hd5n"},"source":["# Advanced web scraping\n","\n","## Before you begin\n","\n","Before you start webscraping make sure to consider what you're\n","doing. Does your scraping violate a terms of service? Will it inconvenience the site, other users? Per Uncle Ben: WGPCGR.\n","\n","Also, before you begin web scraping, look for a download data option\n","or existing solution. Probably someone has run up against the same\n","problem and worked it out. For example, we're going to scrape some\n","wikipedia tables, which there's a million other solutions for,\n","including a wikipedia\n","[api](https://www.mediawiki.org/wiki/API:Main_page).\n","\n","## Basic web scraping\n","\n","We covered this last chapter. However, let's do an example of static page parsing just to get\n","started.  Consider scraping the table of top 10 heat waves from\n","[wikipedia](https://en.wikipedia.org/wiki/List_of_natural_disasters_by_death_toll). First,\n","we open the url, then parse it using BeautifulSoup, then load it into\n","a pandas dataframe.\n"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":363},"id":"Elc1V_7fhd5s","executionInfo":{"status":"ok","timestamp":1713234049778,"user_tz":240,"elapsed":1350,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"b8b7e530-7c98-4f6b-c3bc-fc442463121a"},"source":["from urllib.request import urlopen\n","from bs4 import BeautifulSoup as bs\n","import pandas as pd\n","url = \"https://en.wikipedia.org/wiki/List_of_natural_disasters_by_death_toll\"\n","html = urlopen(url)\n","parsed = bs(html, 'html.parser').findAll(\"table\")\n","pd.read_html(str(parsed))[11]"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["   Rank  Death toll                                   Event  \\\n","0   1.0        1300           The Daulatpur–Saturia tornado   \n","1   2.0         751          The Tri-State tornado outbreak   \n","2   3.0         681                      1973 Dhaka tornado   \n","3   4.0         660              1969 East Pakistan tornado   \n","4   5.0         600             The Valletta, Malta tornado   \n","5   6.0         500               The 1851 Sicily tornadoes   \n","6   6.0         500                   Narail-Magura tornado   \n","7   6.0         500              Madaripur-Shibchar tornado   \n","8   9.0         400  The 1984 Soviet Union tornado outbreak   \n","9  10.0         317               The Great Natchez Tornado   \n","\n","                                            Location          Date  \n","0                              Manikganj, Bangladesh          1989  \n","1          United States (Missouri–Illinois–Indiana)          1925  \n","2                                         Bangladesh          1973  \n","3                     East Pakistan (now Bangladesh)          1969  \n","4                                              Malta  1551 or 1556  \n","5                   Sicily, Two Sicilies (now Italy)          1851  \n","6  Jessore, East Pakistan, Pakistan (now Bangladesh)          1964  \n","7                                         Bangladesh          1977  \n","8  Soviet Union (Volga Federal District, Central ...          1984  \n","9              United States (Mississippi–Louisiana)          1840  "],"text/html":["\n","  <div id=\"df-48024194-eb3d-4987-86df-edc45cfde9cf\" class=\"colab-df-container\">\n","    <div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>Rank</th>\n","      <th>Death toll</th>\n","      <th>Event</th>\n","      <th>Location</th>\n","      <th>Date</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>1.0</td>\n","      <td>1300</td>\n","      <td>The Daulatpur–Saturia tornado</td>\n","      <td>Manikganj, Bangladesh</td>\n","      <td>1989</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>2.0</td>\n","      <td>751</td>\n","      <td>The Tri-State tornado outbreak</td>\n","      <td>United States (Missouri–Illinois–Indiana)</td>\n","      <td>1925</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>3.0</td>\n","      <td>681</td>\n","      <td>1973 Dhaka tornado</td>\n","      <td>Bangladesh</td>\n","      <td>1973</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>4.0</td>\n","      <td>660</td>\n","      <td>1969 East Pakistan tornado</td>\n","      <td>East Pakistan (now Bangladesh)</td>\n","      <td>1969</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>5.0</td>\n","      <td>600</td>\n","      <td>The Valletta, Malta tornado</td>\n","      <td>Malta</td>\n","      <td>1551 or 1556</td>\n","    </tr>\n","    <tr>\n","      <th>5</th>\n","      <td>6.0</td>\n","      <td>500</td>\n","      <td>The 1851 Sicily tornadoes</td>\n","      <td>Sicily, Two Sicilies (now Italy)</td>\n","      <td>1851</td>\n","    </tr>\n","    <tr>\n","      <th>6</th>\n","      <td>6.0</td>\n","      <td>500</td>\n","      <td>Narail-Magura tornado</td>\n","      <td>Jessore, East Pakistan, Pakistan (now Bangladesh)</td>\n","      <td>1964</td>\n","    </tr>\n","    <tr>\n","      <th>7</th>\n","      <td>6.0</td>\n","      <td>500</td>\n","      <td>Madaripur-Shibchar tornado</td>\n","      <td>Bangladesh</td>\n","      <td>1977</td>\n","    </tr>\n","    <tr>\n","      <th>8</th>\n","      <td>9.0</td>\n","      <td>400</td>\n","      <td>The 1984 Soviet Union tornado outbreak</td>\n","      <td>Soviet Union (Volga Federal District, Central ...</td>\n","      <td>1984</td>\n","    </tr>\n","    <tr>\n","      <th>9</th>\n","      <td>10.0</td>\n","      <td>317</td>\n","      <td>The Great Natchez Tornado</td>\n","      <td>United States (Mississippi–Louisiana)</td>\n","      <td>1840</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>\n","    <div class=\"colab-df-buttons\">\n","\n","  <div class=\"colab-df-container\">\n","    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-48024194-eb3d-4987-86df-edc45cfde9cf')\"\n","            title=\"Convert this dataframe to an interactive table.\"\n","            style=\"display:none;\">\n","\n","  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n","    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n","  </svg>\n","    </button>\n","\n","  <style>\n","    .colab-df-container {\n","      display:flex;\n","      gap: 12px;\n","    }\n","\n","    .colab-df-convert {\n","      background-color: #E8F0FE;\n","      border: none;\n","      border-radius: 50%;\n","      cursor: pointer;\n","      display: none;\n","      fill: #1967D2;\n","      height: 32px;\n","      padding: 0 0 0 0;\n","      width: 32px;\n","    }\n","\n","    .colab-df-convert:hover {\n","      background-color: #E2EBFA;\n","      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n","      fill: #174EA6;\n","    }\n","\n","    .colab-df-buttons div {\n","      margin-bottom: 4px;\n","    }\n","\n","    [theme=dark] .colab-df-convert {\n","      background-color: #3B4455;\n","      fill: #D2E3FC;\n","    }\n","\n","    [theme=dark] .colab-df-convert:hover {\n","      background-color: #434B5C;\n","      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n","      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n","      fill: #FFFFFF;\n","    }\n","  </style>\n","\n","    <script>\n","      const buttonEl =\n","        document.querySelector('#df-48024194-eb3d-4987-86df-edc45cfde9cf button.colab-df-convert');\n","      buttonEl.style.display =\n","        google.colab.kernel.accessAllowed ? 'block' : 'none';\n","\n","      async function convertToInteractive(key) {\n","        const element = document.querySelector('#df-48024194-eb3d-4987-86df-edc45cfde9cf');\n","        const dataTable =\n","          await google.colab.kernel.invokeFunction('convertToInteractive',\n","                                                    [key], {});\n","        if (!dataTable) return;\n","\n","        const docLinkHtml = 'Like what you see? Visit the ' +\n","          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n","          + ' to learn more about interactive tables.';\n","        element.innerHTML = '';\n","        dataTable['output_type'] = 'display_data';\n","        await google.colab.output.renderOutput(dataTable, element);\n","        const docLink = document.createElement('div');\n","        docLink.innerHTML = docLinkHtml;\n","        element.appendChild(docLink);\n","      }\n","    </script>\n","  </div>\n","\n","\n","<div id=\"df-7ee094ca-6b72-48e9-8bf4-8bade75b4a2f\">\n","  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-7ee094ca-6b72-48e9-8bf4-8bade75b4a2f')\"\n","            title=\"Suggest charts\"\n","            style=\"display:none;\">\n","\n","<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n","     width=\"24px\">\n","    <g>\n","        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n","    </g>\n","</svg>\n","  </button>\n","\n","<style>\n","  .colab-df-quickchart {\n","      --bg-color: #E8F0FE;\n","      --fill-color: #1967D2;\n","      --hover-bg-color: #E2EBFA;\n","      --hover-fill-color: #174EA6;\n","      --disabled-fill-color: #AAA;\n","      --disabled-bg-color: #DDD;\n","  }\n","\n","  [theme=dark] .colab-df-quickchart {\n","      --bg-color: #3B4455;\n","      --fill-color: #D2E3FC;\n","      --hover-bg-color: #434B5C;\n","      --hover-fill-color: #FFFFFF;\n","      --disabled-bg-color: #3B4455;\n","      --disabled-fill-color: #666;\n","  }\n","\n","  .colab-df-quickchart {\n","    background-color: var(--bg-color);\n","    border: none;\n","    border-radius: 50%;\n","    cursor: pointer;\n","    display: none;\n","    fill: var(--fill-color);\n","    height: 32px;\n","    padding: 0;\n","    width: 32px;\n","  }\n","\n","  .colab-df-quickchart:hover {\n","    background-color: var(--hover-bg-color);\n","    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n","    fill: var(--button-hover-fill-color);\n","  }\n","\n","  .colab-df-quickchart-complete:disabled,\n","  .colab-df-quickchart-complete:disabled:hover {\n","    background-color: var(--disabled-bg-color);\n","    fill: var(--disabled-fill-color);\n","    box-shadow: none;\n","  }\n","\n","  .colab-df-spinner {\n","    border: 2px solid var(--fill-color);\n","    border-color: transparent;\n","    border-bottom-color: var(--fill-color);\n","    animation:\n","      spin 1s steps(1) infinite;\n","  }\n","\n","  @keyframes spin {\n","    0% {\n","      border-color: transparent;\n","      border-bottom-color: var(--fill-color);\n","      border-left-color: var(--fill-color);\n","    }\n","    20% {\n","      border-color: transparent;\n","      border-left-color: var(--fill-color);\n","      border-top-color: var(--fill-color);\n","    }\n","    30% {\n","      border-color: transparent;\n","      border-left-color: var(--fill-color);\n","      border-top-color: var(--fill-color);\n","      border-right-color: var(--fill-color);\n","    }\n","    40% {\n","      border-color: transparent;\n","      border-right-color: var(--fill-color);\n","      border-top-color: var(--fill-color);\n","    }\n","    60% {\n","      border-color: transparent;\n","      border-right-color: var(--fill-color);\n","    }\n","    80% {\n","      border-color: transparent;\n","      border-right-color: var(--fill-color);\n","      border-bottom-color: var(--fill-color);\n","    }\n","    90% {\n","      border-color: transparent;\n","      border-bottom-color: var(--fill-color);\n","    }\n","  }\n","</style>\n","\n","  <script>\n","    async function quickchart(key) {\n","      const quickchartButtonEl =\n","        document.querySelector('#' + key + ' button');\n","      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n","      quickchartButtonEl.classList.add('colab-df-spinner');\n","      try {\n","        const charts = await google.colab.kernel.invokeFunction(\n","            'suggestCharts', [key], {});\n","      } catch (error) {\n","        console.error('Error during call to suggestCharts:', error);\n","      }\n","      quickchartButtonEl.classList.remove('colab-df-spinner');\n","      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n","    }\n","    (() => {\n","      let quickchartButtonEl =\n","        document.querySelector('#df-7ee094ca-6b72-48e9-8bf4-8bade75b4a2f button');\n","      quickchartButtonEl.style.display =\n","        google.colab.kernel.accessAllowed ? 'block' : 'none';\n","    })();\n","  </script>\n","</div>\n","\n","    </div>\n","  </div>\n"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"dataframe","summary":"{\n  \"name\": \"pd\",\n  \"rows\": 10,\n  \"fields\": [\n    {\n      \"column\": \"Rank\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 2.859681411936962,\n        \"min\": 1.0,\n        \"max\": 10.0,\n        \"num_unique_values\": 8,\n        \"samples\": [\n          2.0,\n          6.0,\n          1.0\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Death toll\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 272,\n        \"min\": 317,\n        \"max\": 1300,\n        \"num_unique_values\": 8,\n        \"samples\": [\n          751,\n          500,\n          1300\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Event\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 10,\n        \"samples\": [\n          \"The 1984 Soviet Union tornado outbreak\",\n          \"The Tri-State tornado outbreak\",\n          \"The 1851 Sicily tornadoes\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Location\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 9,\n        \"samples\": [\n          \"Soviet Union (Volga Federal District, Central Federal District, and Northwestern Federal District in Russia)\",\n          \"United States (Missouri\\u2013Illinois\\u2013Indiana)\",\n          \"Sicily, Two Sicilies (now Italy)\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"Date\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 10,\n        \"samples\": [\n          \"1984\",\n          \"1925\",\n          \"1851\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"}},"metadata":{},"execution_count":17}]},{"cell_type":"markdown","metadata":{"id":"4zuwwymUhd55"},"source":["The workflow as as follows:\n","\n","+ We used the developer console on the webpage to inspect the page and its properties.\n","+ We opened the url with `urlopen`\n","+ We parsed the webpage with `BeautifulSoup` then used the method `findAll` on that to search for every table\n","+ Pandas has a utility that converts a html tables into a dataframe. In this case it creates a list of tables, where the 12th one is the heatwaves. Note it needs the data to be converted to a string before proceeding.\n","\n","This variation of web scraping couldn't be easier. However, what if the content we're interested in only exists after interacting with the page? Then we need a more sophisticated solution.\n","\n","## Form filling\n","Web scraping can require posting to forms, such as logins. This can be\n","done directly with python / R without elaborate programming, for\n","example using the `requests` library. However, make sure you aren't\n","violating a web site's TOS and also make sure you're not posting your\n","password to github as you commit scraping code. In general, don't\n","create a security hole for your account by web scraping it. Again,\n","also check to make sure that the site doesn't have an API with an\n","authentication solution already before writing the code to post\n","authentication. Many websites that want you to programmatically grab\n","the data build an API.\n","\n","## Programmatically web browsing\n","\n","Some web scraping requires us to interact with the webpage. This\n","requires a much more advanced solution where we programmatically use a\n","web browser to interact with the page. I'm using selenium and\n","chromedriver. To do this, I had to download\n","[chromedriver](https://chromedriver.chromium.org/downloads) and set it\n","so that it was in my unix `PATH`.\n","\n","Then, the following code opened up a browser then closed it.\n","\n","```\n","from selenium import webdriver\n","driver = webdriver.Chrome()\n","driver.quit()\n","```\n","\n"]},{"cell_type":"markdown","metadata":{"id":"SINJzeYahd56"},"source":["If all went well, a chrome window appeared then closed. That's the\n","browser we're going to program. If you look closely at the browser\n","before you close it, there's a banner up to that says \"Chrome is being\n","controlled by automated test software.\" Let's go through the example\n","on the selenium docs\n","[here](https://www.selenium.dev/documentation/webdriver/getting_started/first_script/). First\n","let's vist a few pages. We'll go to my totally awesome web page that I\n","meticulously maintain every day then duckduckgo. We'll wait a few\n","seconds in between.  My site is created and hosted by google sites,\n","which seems reasonable that they would store a cookie so that I can\n","log in and edit my site (which I almost never do). Duckduckgo is a\n","privacy browser, so let's check to see if they create a cookie. (Hint,\n","I noticed that selenium doesn't like redirects, so use the actual page\n","url.)\n","\n","Let's go to my website and get the cookies that it saves.\n","```\n","driver.get(\"https://sites.google.com/view/bcaffo/home\")\n","print(driver.get_cookies())\n","```\n","You should see the browser go to my web page. Now let's enter in a web form.\n","First, let's go to duckduck go. (Note it appears duckduckgo changed it's page so that this no longer works.)\n","\n","```\n","## Let's get rid of all cookies before we visit duckduckgo\n","driver.delete_all_cookies()\n","driver.get(\"https://duckduckgo.com/\")\n","print(driver.get_cookies())\n","```\n","\n","Now let's find the page elements that we'd like to interact\n","with. There's a text box that we want to submit a search command into\n","and a button that we'll need to press. When I go to duckduckgo and press\n","CTRL-I, I find that the search box is:\n","\n","```\n","<input id=\"search_form_input_homepage\" class=\"js-search-input search__input--adv\" type=\"text\" autocomplete=\"off\" name=\"q\" tabindex=\"1\" value=\"\" autocapitalize=\"off\" autocorrect=\"off\" placeholder=\"Search the web without being tracked\">\n","```\n","\n","Notice, the `name=\"q\"` html name for the search form. When I dig around and find the submit button, it's code is:\n","\n","```\n","<input id=\"search_button_homepage\" class=\"search__button  js-search-button\" type=\"submit\" tabindex=\"2\" value=\"S\">\n","```\n","\n","Notice its `id` is `search_button_homepage`. Let's find these elements.\n","\n","```\n","search_box = driver.find_element(by=By.NAME, value=\"q\")\n","search_button = driver.find_element(by=By.ID, value=\"search_button_homepage\")\n","```\n","\n","Now let's send the info and press submit.\n","\n","```\n","search_box.send_keys(\"Selenium\")\n","search_button.click()\n","driver.implicitly_wait(10)\n","driver.save_screenshot(\"assets/images/webscraping.png\")\n","page_source = driver.page_source\n","driver.close()\n","```\n","\n","Here, we saved the page_source as a variable that then can be parsed\n","with other html parses (like bs4). Play around with the methods\n","associated with `driver` and navigate the web. You'll see that\n","selenium is pretty incredible.\n","\n","Since I wrote this, duckduckgo changed their page, so the code no longer works.\n","\n","\n","\n","\n"]},{"cell_type":"markdown","source":["## Doing it in colab\n","\n","Getting this to work on Colab is a little harder, since we can't launch a browser. I found the instructures [here](https://nariyoo.com/python-how-to-run-selenium-in-google-colab/) useful (https://nariyoo.com/python-how-to-run-selenium-in-google-colab/)"],"metadata":{"id":"qbv98fDoplY8"}}],"metadata":{"kernelspec":{"name":"python3","language":"python","display_name":"Python 3"},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":0}