Merge pull request #1 from shivendrra/dev

pulling new build changes from "dev" branch
shivendrra · Jul 25, 2024 · 16c155d · 16c155d
2 parents 50a998a + bf9e04f
commit 16c155d
Show file tree

Hide file tree

Showing 33 changed files with 546 additions and 863 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,12 @@
+*.pyc
+*.pyo
+__pycache__/
+*.py[cod]
+*.exe
+
+build
+.vscode
+
 # extras
 *.env
 Datasets

diff --git a/README.md b/README.md
@@ -1,153 +1,132 @@
 # web-graze
 
 ## Introduction
-this repo contains codes that would help you to scrape data from various sites on the internet like wikipedia, britannca, youtube, etc.
-
-## How to use
-### Britannica Scrapper
-It scrapes web-pages of [britannica.com](https://www.britannica.com/) to generate the data.
-
-```python
-from britannica import Scrapper
-bs = Scrapper(search_queries=['antarctica', 'america', 'continents'], max_limit=10)
-bs(out_file='../scrapped_data.txt.')
-```
-
-I've made a sample `search_queries.json` that contains few keywords that could be used to scraped the pages. You can use your own, though.
+This repository contains a collection of scripts to scrape content from various sources like YouTube, Wikipedia, and Britannica. It includes functionality to download video captions from YouTube, scrape Wikipedia articles, and fetch content from Britannica.
+
+## Table of Contents
+- [Installation](#installation)
+- [Usage](#usage)
+  - [YouTube Scraper](#youtube-scraper)
+  - [Wikipedia Scraper](#wikipedia-scraper)
+  - [Britannica Scraper](#britannica-scraper)
+- [Configuration](#configuration)
+- [Logging](#logging)
+
+## Installation
+
+1. **Clone the repository:**
+   ```sh
+   git clone https://github.com/yourusername/web-scraper-suite.git
+   cd web-scraper-suite
+   ```
+
+2. **Create and activate a virtual environment:**
+   ```sh
+   python -m venv venv
+   source venv/bin/activate   # On Windows: venv\Scripts\activate
+   ```
+
+3. **Install the required packages:**
+   ```sh
+   pip install -r requirements.txt
+   ```
+
+## Usage
+
+### YouTube Scraper
+
+The YouTube scraper fetches video captions from a list of channels.
+
+#### Configuration
+- Add your YouTube API key to a `.env` file:
+  ```env
+  yt_key=YOUR_API_KEY
+  ```
+
+- Create a `channelIds.json` file with the list of channel IDs:
+  ```json
+  [
+    "UC_x5XG1OV2P6uZZ5FSM9Ttw",
+    "UCJ0-OtVpF0wOKEqT2Z1HEtA"
+  ]
+  ```
+
+#### Running the Scraper
 
 ```python
-from britannica import searchQueries
+from dotenv import load_dotenv
+load_dotenv()
+api_key = os.getenv('yt_key')
 
-queries = searchQueries()
-print(queries())
-```
-
-### Wikipedia Scrapper
-It scrapes web-pages from [wikipedia.com](https://en.wikipedia.org/) to generate the data for later uses.
-It has one more feature `extra_urls=True`, this, if True will fetch new urls present on the initial query-web-pages, and will scrape those pages too.
+from graze import youtube
 
-```python
-from wikipedia import WikiScraper
-scrape = WikiScraper()
-scrape(search_queries=["Antarctica", "Colonization", "World war"], out_file=out_file, extra_url=True)
+scraper = youtube(api_key=api_key, filepath='./output.txt')
+scraper()
 ```
 
-I've included sample `search_queries` that can be used to scrape certain data. You're free to use your own queries.
+### Wikipedia Scraper
 
-```python
-from wikipedia import WikiQueries
+The Wikipedia scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.
 
-queries = WikiQueries()
-print(queries())
-```
+#### Configuration
+- Define your search queries in `queries.py`:
+  ```python
+  class WikiQueries:
+      def __init__(self):
+          self.search_queries = ["topic1", "topic2", "topic3"]
+
+      def __call__(self):
+          return self.search_queries
+  ```
 
-If you're downloading XML dumps from Wikipedia eg. dump: [Dump Page for March 2024](https://dumps.wikimedia.org/wikidatawiki/20240301/). Use `xml_parser.py` to convert .xml to a .txt file containing all target urls and then run `WikiXMLScrapper()` to generate a large .txt file.
+#### Running the Scraper
 
 ```python
-from wikipedia import WikiXMLScraper
+from graze import wikipedia
 
-scraper = WikiXMLScraper()
-url_file = 'extracted_urls.txt'
-output_file = 'Datasets/wiki_110k.txt'
-
-start_time = timeit.default_timer()
-scraper.scrape_from_file(url_file, output_file, batch_size=500)
-print(f"Total time taken: {timeit.default_timer() - start_time:.2f} mins")
+wiki = wikipedia()
+wiki(out_file='./output.txt')
 ```
 
-### Transcripts Collector
-It uses [Youtube V3 api](https://developers.google.com/youtube/v3/docs) to fetch uploaded videos by a particular channel and then generates `video_ids` which then is used to generate transcripts using [youtube-transcripts-api](https://github.com/jdepoix/youtube-transcript-api/tree/master). `max_results` can be set up to 100, not more than that.
+### Britannica Scraper
 
-```python
-import os
-api_key = os.getenv('yt_secret_key')
-out_file = 'transcripts.txt'
+The Britannica scraper fetches content based on search queries and writes it to a file.
 
-from youtube_transcripts import TranscriptsCollector
-ts = TranscriptsCollector(api_key=api_key)
-ts(channel_ids=["UCb_MAhL8Thb3HJ_wPkH3gcw"], target_file=out_file, max_results=100)
-```
+#### Configuration
+- Define your search queries in `queries.py`:
+  ```python
+  class BritannicaQueries:
+      def __init__(self):
+          self.search_queries = ["topic1", "topic2", "topic3"]
+
+      def __call__(self):
+          return self.search_queries
+  ```
 
-I've included list of more than 100 YouTube channels' ids in `channel_ids.json`. You can use those or you can use according to your convinience. These `channel_ids` can generate upto 4gbs of transcripts from over ~200k videos.
-
-It would take a lot of time though; for me, it took around ~55hrs to fetch transcripts from 167k videos.
+#### Running the Scraper
 
 ```python
-// channel_ids.json
-
-[
-  "UCb_MAhL8Thb3HJ_wPkH3gcw",
-  "UCA295QVkf9O1RQ8_-s3FVXg",
-  "UCpFFItkfZz1qz5PpHpqzYBw",
-  ....
-  "UCiMhD4jzUqG-IgPzUmmytRQ",
-  "UCB0JSO6d5ysH2Mmqz5I9rIw",
-  "UC-lHJZR3Gqxm24_Vd_AJ5Yw"
-]
+from graze import britannica
+
+scraper = britannica(max_limit=20)
+scraper(out_file='./output.txt')
 ```
 
-Or use the `snippet.py` to import it in your code directly, check it if you want to add new channel ids, or if you're curious to see the channel names.
+## Configuration
 
-```python
-# importing snippets
+- **API Keys and other secrets:** Ensure that your API keys and other sensitive data are stored securely and not hard-coded into your scripts.
 
-from youtube_transcripts import SampleSnippets
+- **Search Queries:** The search queries for Wikipedia and Britannica scrapers are defined in `queries.py`.
 
-snippets = SampleSnippets()
-print(snippets())
-```
+## Logging
 
-## File Structure
-```
-.
-├── britannica
-│   ├── __init__.py
-│   ├── main.py
-│   ├── queries.py
-│   ├── requirements.txt
-│   ├── search_queries.json
-│   ├── URLFetcher.py
-├── javascript
-│   ├── customLinkFinder.js
-│   ├── customSearch.js
-│   ├── customWebScrapper.js
-│   ├── googleCustomSearch.js
-│   ├── sample.js
-│   ├── webDataScrapping.js
-├── run.py
-│   ├── run_britannica.py
-│   ├── run_transcripts.py
-│   ├── run_wiki.py
-├── wikipedia
-│   ├── __init__.py
-│   ├── fetch_urls.py
-│   ├── main.py
-│   ├── queries.py
-│   ├── requirements.txt
-│   ├── search_queries.json
-├── youtube_transcripts
-│   ├── __init__.py
-│   ├── basic.py
-│   ├── channe_ids_snippet.json
-│   ├── channel_ids.json
-│   ├── main.py
-│   ├── requirements.txt
-│   ├── snippets.py
-│   ├── version2.py
-├── .gitignore
-├── CONTRIBUTING.md
-├── LargeDataCollector.ipynb
-├── LICENSE
-├── README.md
-├── test.py
-├── wiki_extractor.py
-├── xml_parser.py
-```
+The YouTube scraper logs errors to `youtube_fetch.log`. Make sure to check this file for detailed error messages and troubleshooting information.
 
 ## Contribution
 Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.
 
 Check out [CONTRIBUTING.md](https://github.com/shivendrra/web-graze/blob/main/CONTRIBUTING.md) for more details
 
 ## License
-MIT License
+
+This project is licensed under the MIT License.
diff --git a/britannica/__init__.py b/britannica/__init__.py
diff --git a/britannica/queries.py b/britannica/queries.py
diff --git a/graze/__init__.py b/graze/__init__.py
@@ -0,0 +1,3 @@
+from .youtube.base import Youtube as youtube
+from .britannica.main import Britannica as britannica
+from .wikipedia.main import WikiScraper as wikipedia
diff --git a/britannica/URLFetcher.py → graze/britannica/base.py b/britannica/URLFetcher.py → graze/britannica/base.py
@@ -1,7 +1,6 @@
 import requests
 from bs4 import BeautifulSoup
-from tqdm import tqdm
-
+import time
 class BritannicaUrls:
   def __init__(self, search_queries, max_limit):
     self.max_limit = max_limit
@@ -13,23 +12,25 @@ def build_url(self, query, pageNo):
     url = f"https://www.britannica.com/search?query={formattedQuery}&page={pageNo}"
     return url
 
-  def get_target_url(self, targets):
-    r = requests.get(targets, headers=self.headers)
-    list_url = []
-
-    if r.status_code == 200:
-      html_content = r.content
-      soup = BeautifulSoup(html_content, 'html.parser')
-      fetched_urls = soup.find_all('a', attrs={'class': 'font-weight-bold font-18'})
-      list_url.extend([url.get('href') for url in fetched_urls])
-      return list_url
-
-    else:
-      print(f"skipping this {targets}")
+  def get_target_url(self, target_url):
+    while True:
+      r = requests.get(target_url, headers=self.headers)
+      if r.status_code == 200:
+        html_content = r.content
+        soup = BeautifulSoup(html_content, 'html.parser')
+        fetched_urls = soup.find_all('a', class_='md-crosslink')
+        list_url = [url.get('href') for url in fetched_urls]
+        return list_url
+
+      elif r.status_code == 429:
+        print(f"Rate limit exceeded. Waiting 30secs before retrying: {target_url}")
+        time.sleep(30)
+      else:
+        print(f"Skipping this URL due to status code {r.status_code}: {target_url}")
+        return []
 
   def generate_urls(self, progress_bar=None):
     page_urls = []
-    total_iterations = len(self.search_queries) * self.max_limit
     current_iteration = 0
 
     for query in self.search_queries:
@@ -38,13 +39,9 @@ def generate_urls(self, progress_bar=None):
         target_url = self.build_url(query, pageNo)
         pageNo += 1
         new_url = self.get_target_url(target_url)
-        page_urls.extend(new_url)
-
-        # Update the progress bar
+        if new_url:
+          page_urls.extend(new_url)
         current_iteration += 1
         if progress_bar:
           progress_bar.update(1)
-    return page_urls
-
-if __name__ == '__main__':
-  bs = BritannicaUrls(search_queries=['antarctica', 'usa'], max_limit=10)
+    return page_urls