Skip to content

Commit

Permalink
Tighten up top matter and other fixes for website personaliites NB
Browse files Browse the repository at this point in the history
  • Loading branch information
zstumgoren committed Apr 8, 2024
1 parent 57d6a31 commit 560c325
Showing 1 changed file with 7 additions and 10 deletions.
17 changes: 7 additions & 10 deletions content/web_scraping/website_personalities.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,13 @@
"\n",
"Every website has a personality.\n",
"\n",
"\n",
"On a technical level, web scraping typically involves gathering web pages or other files from a website. This process can be automated by understanding the anatomy of a site -- how pages are structured, URL patterns, and other \"personality traits\" of a site.\n",
"\n",
"Scraping can be more or less difficult depending on the nature of the site. \n",
"Scraping can be more or less difficult depending on the \"personality traits\" of a website. \n",
"\n",
"A friendly site with no dynamic content and predictable URL patterns could be a quick job.\n",
"\n",
"A not-so-friendly site might \"feature\" web forms, randomized URLs, cookies or sessions, dynamically generated content, password-based logins, CAPTCHAs, etc.\n",
"A not-so-friendly site might \"feature\" web forms, randomized URLs, cookies/sessions, dynamically generated content, password-based logins, CAPTCHAs, etc.\n",
"\n",
"Sites often use a combination of these strategies, so it's important to spend time [learning how a site works](dissecting_websites.ipynb) so you can devise an appropriate scraping strategy.\n",
"Sites often use a combination of these strategies, so it's important to spend time [understanding its anatomy](dissecting_websites.ipynb) so you can devise an appropriate scraping strategy.\n",
"\n",
"Below are some high-level challenges and related technical strategies for common scraping scenarios. Keep in mind that you may run into sites that require you to combine approaches -- e.g. basic scraping techniques with more advanced stateful web scraping.\n",
"\n",
Expand Down Expand Up @@ -53,7 +50,7 @@
"> <https://catalog.data.gov/dataset/national-student-loan-data-system>\n",
"\n",
"Some sites use so-called [query strings](https://en.wikipedia.org/wiki/Query_string), which are\n",
"extra search parameters added to a URL as one or more `key=value` pairs. The pairs follow a question mark and are separated by ampersands (`&`). Here are two examples:\n",
"extra search parameters added to a URL as one or more `key=value` pairs. The pairs follow a question mark (`?`) and are separated by ampersands (`&`). Here are two examples:\n",
"\n",
"> <https://www.whitehouse.gov/?s=coronavirus>\n",
"> <https://www.governmentjobs.com/careers/santaclara?department%5B0%5D=County%20Counsel&department%5B1%5D=County%20Executive&sort=Salary%7CDescending>\n",
Expand All @@ -65,7 +62,7 @@
"\n",
"Often, you will have to fill out a search form to locate target data. Such forms can be handled in a few ways, depending on the nature of the site. If the form generates a predictable URL (perhaps using URL parameters), you can dig into the form options in the HTML and figure out how to dynamically construct the URL. You can test this by manually filling out and submitting the form and examining the URL of the resulting page.\n",
"\n",
"The website where officials in East Brandywine, PA post meeting documents is a good example.\n",
"The website where officials in East Brandywine, PA [post meeting documents](https://www.ebrandywine.org/AgendaCenter) is a good example.\n",
"\n",
"<img alt=\"Agenda website with form\" src=\"../files/scraping_agendas_pa.png\" style=\"vertical-align:bottom; border:2px solid #555; margin: 10px;\" width=\"350\">\n",
"\n",
Expand All @@ -89,12 +86,12 @@
"\n",
"Many sites use Javascript to dynamically add or transform page content ***after*** the page has loaded. This means that what you see in the source HTML using `View Page Source` will not match what you see in the browser (or the Elements tab of Chrome Developer Tools).\n",
"\n",
"Scraping such a page requires using a library such as [Playwright](drive_the_browser_robot.ipynb) or [Selenium](https://selenium-python.readthedocs.io/index.html), which use the \"web driver\" technology behind browsers such as Firefox to automate browser interactions.\n",
"Scraping such a page may require using a library such as [Playwright](drive_the_browser_robot.ipynb) or [Selenium](https://selenium-python.readthedocs.io/index.html), which use the \"web driver\" technology behind browsers such as Firefox to automate browser interactions.\n",
"\n",
"These tools give you access to the [Document Object Model](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction)\n",
"(DOM) -- the content as seen by a real web browser. The DOM is the internal representation of a page that reflects both the static HTML delivered to your browser *and* elements/styles dynamically added or manipulated by Javascript.\n",
"\n",
"Playwright/Selenium allow you to automate interactions with the browser -- the same as a human. These tools can be programmed to scroll down a page, step through a paginated list of results, take screenshots and download PDFs.\n",
"Playwright/Selenium allow you to automate interactions with the browser -- the same as a human. These tools can be programmed to scroll down a page, step through a paginated list of results, take screenshots, download PDFs, and much more.\n",
"\n",
"> Check out [Drive the Browser, Robot](drive_the_browser_robot.ipynb) for a tutorial using `Playwright`.\n",
"\n",
Expand Down

0 comments on commit 560c325

Please sign in to comment.