Skip to content

Commit

Permalink
clean up website personalities notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
zstumgoren committed Apr 7, 2024
1 parent 9307ff2 commit 718fa48
Showing 1 changed file with 9 additions and 7 deletions.
16 changes: 9 additions & 7 deletions content/web_scraping/website_personalities.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@
"\n",
"A not-so-friendly site might \"feature\" web forms, randomized URLs, cookies or sessions, dynamically generated content, password-based logins, CAPTCHAs, etc.\n",
"\n",
"Sites often use a combination of these strategies, so it's important to spend time learning how a site works so you can devise an appropriate scraping strategy.\n",
"Sites often use a combination of these strategies, so it's important to spend time [learning how a site works](dissecting_websites.ipynb) so you can devise an appropriate scraping strategy.\n",
"\n",
"Below are some high-level challenges and related technical strategies for common scraping scenarios. Keep in mind that you may run into sites that require you to combine approaches -- e.g. basic scraping techniques with more advanced stateful web scraping.\n",
"\n",
"## Avoiding a scrape\n",
"\n",
"Some seemingly complex might be quite easy to \"scrape\". \n",
"Some seemingly complex sites might be quite easy to \"scrape\". \n",
"\n",
"> Check out [Skip the Scraping: Cheat with JSON](skip_scraping_cheat.ipynb).\n",
"\n",
Expand All @@ -34,7 +34,7 @@
"\n",
"The site doesn't use forms or require logins.\n",
"\n",
"It does not dynamically generate content, and does not use sessions/cookies. \n",
"It does not dynamically generate content (that you care about), and does not use sessions/cookies. \n",
"\n",
"In such cases, you can likely get away with simply using the [requests](https://requests.readthedocs.io/en/latest/) library to grab HTML pages and the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to parse and extract data from each page's HTML.\n",
"\n",
Expand Down Expand Up @@ -65,19 +65,21 @@
"\n",
"Often, you will have to fill out a search form to locate target data. Such forms can be handled in a few ways, depending on the nature of the site. If the form generates a predictable URL (perhaps using URL parameters), you can dig into the form options in the HTML and figure out how to dynamically construct the URL. You can test this by manually filling out and submitting the form and examining the URL of the resulting page.\n",
"\n",
"The website where officials in East Brandywine, PA is a good example.\n",
"The website where officials in East Brandywine, PA post meeting documents is a good example.\n",
"\n",
"<img alt=\"Agenda website with form\" src=\"../files/scraping_agendas_pa.png\" style=\"vertical-align:bottom; border:2px solid #555; margin: 10px;\" width=\"350\">\n",
"\n",
"Many web forms use POST requests, where the form information is sent as part of the body of the web request (as opposed to embedded in the URL). \n",
"\n",
"In such cases, you can use a tool such as [requests.post](https://docs.python-requests.org/en/latest/user/quickstart/#more-complicated-post-requests) or Selenium to [fill out and submit](https://selenium-python.readthedocs.io/locating-elements.html#locating-by-id) the form.\n",
"In such cases, you can use a tool such as [requests.post](https://docs.python-requests.org/en/latest/user/quickstart/#more-complicated-post-requests) or Playwright/Selenium to [fill out and submit](https://selenium-python.readthedocs.io/locating-elements.html#locating-by-id) the form.\n",
"\n",
"> Check out [Drive the Browser, Robot](drive_the_browser_robot.ipynb) for a scraping example that fills out form fields.\n",
"\n",
"### Logging in\n",
"\n",
"Sites that require logins can often be handled by simply passing in your login credentials as part of a web form (see `Web Forms` above). \n",
"\n",
"The requests library provides several ways to [authenticate](https://docs.python-requests.org/en/latest/user/authentication/), or you can use a browser automation library such as [Playwright](https://playwright.dev/python/).\n",
"The `requests` library provides several ways to [authenticate](https://docs.python-requests.org/en/latest/user/authentication/), or you can use a browser automation library such as [Playwright](https://playwright.dev/python/).\n",
"\n",
"> Check out [Driver the Browser, Robot](drive_the_browser_robot.ipynb).\n",
"\n",
Expand Down Expand Up @@ -105,7 +107,7 @@
"Scraping a session based-site requires you to manage the session in your code. The\n",
"requests library has support for [managing sessions](https://requests.readthedocs.io/en/latest/user/advanced/#session-objects).\n",
"\n",
"Alternatively, you can use a browswer-automation library such as [Playwright](https://playwright.dev/python/) or [Selenium](https://selenium-python.readthedocs.io/getting-started.html) to mimic a browser and get session management for \"free\".\n",
"Alternatively, you can use a browser-automation library such as [Playwright](https://playwright.dev/python/) or [Selenium](https://selenium-python.readthedocs.io/getting-started.html) to mimic a browser and get session management for \"free\".\n",
"\n",
"> Check out [Drive the Browser, Robot](drive_the_browser_robot.ipynb) for a tutorial using `Playwright`."
]
Expand Down

0 comments on commit 718fa48

Please sign in to comment.