Skip to content

Commit

Permalink
web scraping tweaks
Browse files Browse the repository at this point in the history
  • Loading branch information
zstumgoren committed Apr 7, 2024
1 parent 9cbca85 commit 9b987fe
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 5 deletions.
4 changes: 2 additions & 2 deletions content/web_scraping/skip_scraping_cheat.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@
"- Clicking on the web request for the API call\n",
"- Heading over to the `Headers` tab for the web request\n",
"\n",
"In the information panel, you should see a downright awful URL. It contains a boatload of URL parameters after the `?` in the form `key=value` pairs, separated by ampersands (`&`). These are variables of sorts that instruct the API on what data to return. Normally, these parameters are configured by a web form filled out by a human visiting the website.\n",
"In the information panel, you should see a downright awful URL. It contains a boatload of URL parameters after the `?` in the form of `key=value` pairs, separated by ampersands (`&`). These are variables of sorts that instruct the API on what data to return. Normally, these parameters are configured by a web form filled out by a human visiting the website.\n",
"\n",
"If you look close, you may notice that the URL parameters include one particularly interesting morsel: `pageSize=20`\n",
"\n",
Expand All @@ -103,7 +103,7 @@
"\n",
"There was no need to scrape the search page, fill out a form, get the results back, and then page through the search results, extracting data points from HTML along the way. If that sounds painful and error-prone, you have good instincts. It's a workable solution, but in this case it's total overkill.\n",
"\n",
"Instead, we gave the site a phsyical exam (sorry, had to sneak one more in...) and realized that we could skip the scraping entirely and just grab the data.\n",
"Instead, we gave the site a [phsyical exam](dissecting_websites.ipynb) and realized that we could skip the scraping entirely and just grab the data.\n",
"\n",
"If you've never dissected a website like this before, all of the above likely seems like magic. It might even feel like this process would take just as long as writing a web scraper. But you'd be wrong. As you gain comfort with dissecting websites, the techniques described here will take you minutes -- perhaps even seconds -- on many sites.\n",
"\n",
Expand Down
6 changes: 3 additions & 3 deletions content/web_scraping/wysiwyg_scraping.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"\n",
"Why bring this up?\n",
"\n",
"Because it's a useful analogy for web pages. Back in the days of yore, many (perhaps most?) websites followed the WYSIWYG principle. These were simpler times, when the the content displayed on a web page closely matched the HTML in the underlying document for a page.\n",
"Because it's a useful analogy for web pages. Back in the days of yore, many (perhaps most?) websites followed the WYSIWYG principle. These were simpler times, when the content displayed on a web page closely matched the HTML in the underlying document for a page.\n",
"\n",
"If your web browser showed a table of data, it was quite likely that you'd find a `<table>` element somewhere in the page's HTML. \n",
"\n",
Expand Down Expand Up @@ -424,7 +424,7 @@
" fields[6].text.strip()\n",
" ]\n",
" # Mash up the headers with the field values into a dictionary\n",
" # - zip creates pairs each column header with the corresponding field in a two-element list\n",
" # - zip pairs each column header with the corresponding field in a two-element list\n",
" # - dict transforms the list of column/value pairs into a dictionary\n",
" bank_data = dict(zip(column_names, field_values))\n",
" all_banks.append(bank_data)\n",
Expand Down Expand Up @@ -456,7 +456,7 @@
"id": "d57695a8-f62b-4c6e-9615-0afdd73bc3c3",
"metadata": {},
"source": [
"Does that number match the count on the FDIC site? And in their download CSV?"
"Does that number match the count on the FDIC site? And in their downloadable CSV?"
]
},
{
Expand Down

0 comments on commit 9b987fe

Please sign in to comment.