{"cells":[{"cell_type":"markdown","metadata":{"id":"J7BmBhj_hd5n"},"source":["# Advanced web scraping\n","\n","## Before you begin\n","\n","Before you start webscraping make sure to consider what you're\n","doing. Does your scraping violate a terms of service? Will it inconvenience the site, other users? Per Uncle Ben: WGPCGR.\n","\n","Also, before you begin web scraping, look for a download data option\n","or existing solution. Probably someone has run up against the same\n","problem and worked it out. For example, we're going to scrape some\n","wikipedia tables, which there's a million other solutions for,\n","including a wikipedia\n","[api](https://www.mediawiki.org/wiki/API:Main_page).\n","\n","## Basic web scraping\n","\n","We covered this last chapter. However, let's do an example of static page parsing just to get\n","started. Consider scraping the table of top 10 heat waves from\n","[wikipedia](https://en.wikipedia.org/wiki/List_of_natural_disasters_by_death_toll). First,\n","we open the url, then parse it using BeautifulSoup, then load it into\n","a pandas dataframe.\n"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":363},"id":"Elc1V_7fhd5s","executionInfo":{"status":"ok","timestamp":1713234049778,"user_tz":240,"elapsed":1350,"user":{"displayName":"Brian Caffo","userId":"07979705296072332292"}},"outputId":"b8b7e530-7c98-4f6b-c3bc-fc442463121a"},"source":["from urllib.request import urlopen\n","from bs4 import BeautifulSoup as bs\n","import pandas as pd\n","url = \"https://en.wikipedia.org/wiki/List_of_natural_disasters_by_death_toll\"\n","html = urlopen(url)\n","parsed = bs(html, 'html.parser').findAll(\"table\")\n","pd.read_html(str(parsed))[11]"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" Rank Death toll Event \\\n","0 1.0 1300 The Daulatpur–Saturia tornado \n","1 2.0 751 The Tri-State tornado outbreak \n","2 3.0 681 1973 Dhaka tornado \n","3 4.0 660 1969 East Pakistan tornado \n","4 5.0 600 The Valletta, Malta tornado \n","5 6.0 500 The 1851 Sicily tornadoes \n","6 6.0 500 Narail-Magura tornado \n","7 6.0 500 Madaripur-Shibchar tornado \n","8 9.0 400 The 1984 Soviet Union tornado outbreak \n","9 10.0 317 The Great Natchez Tornado \n","\n"," Location Date \n","0 Manikganj, Bangladesh 1989 \n","1 United States (Missouri–Illinois–Indiana) 1925 \n","2 Bangladesh 1973 \n","3 East Pakistan (now Bangladesh) 1969 \n","4 Malta 1551 or 1556 \n","5 Sicily, Two Sicilies (now Italy) 1851 \n","6 Jessore, East Pakistan, Pakistan (now Bangladesh) 1964 \n","7 Bangladesh 1977 \n","8 Soviet Union (Volga Federal District, Central ... 1984 \n","9 United States (Mississippi–Louisiana) 1840 "],"text/html":["\n","
\n"," | Rank | \n","Death toll | \n","Event | \n","Location | \n","Date | \n","
---|---|---|---|---|---|
0 | \n","1.0 | \n","1300 | \n","The Daulatpur–Saturia tornado | \n","Manikganj, Bangladesh | \n","1989 | \n","
1 | \n","2.0 | \n","751 | \n","The Tri-State tornado outbreak | \n","United States (Missouri–Illinois–Indiana) | \n","1925 | \n","
2 | \n","3.0 | \n","681 | \n","1973 Dhaka tornado | \n","Bangladesh | \n","1973 | \n","
3 | \n","4.0 | \n","660 | \n","1969 East Pakistan tornado | \n","East Pakistan (now Bangladesh) | \n","1969 | \n","
4 | \n","5.0 | \n","600 | \n","The Valletta, Malta tornado | \n","Malta | \n","1551 or 1556 | \n","
5 | \n","6.0 | \n","500 | \n","The 1851 Sicily tornadoes | \n","Sicily, Two Sicilies (now Italy) | \n","1851 | \n","
6 | \n","6.0 | \n","500 | \n","Narail-Magura tornado | \n","Jessore, East Pakistan, Pakistan (now Bangladesh) | \n","1964 | \n","
7 | \n","6.0 | \n","500 | \n","Madaripur-Shibchar tornado | \n","Bangladesh | \n","1977 | \n","
8 | \n","9.0 | \n","400 | \n","The 1984 Soviet Union tornado outbreak | \n","Soviet Union (Volga Federal District, Central ... | \n","1984 | \n","
9 | \n","10.0 | \n","317 | \n","The Great Natchez Tornado | \n","United States (Mississippi–Louisiana) | \n","1840 | \n","