Skip to content

Commit

Permalink
Merge pull request #1 from shivendrra/dev
Browse files Browse the repository at this point in the history
pulling new build changes from "dev" branch
  • Loading branch information
shivendrra authored Jul 25, 2024
2 parents 50a998a + bf9e04f commit 16c155d
Show file tree
Hide file tree
Showing 33 changed files with 546 additions and 863 deletions.
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
*.pyc
*.pyo
__pycache__/
*.py[cod]
*.exe

build
.vscode

# extras
*.env
Datasets
Expand Down
215 changes: 97 additions & 118 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,153 +1,132 @@
# web-graze

## Introduction
this repo contains codes that would help you to scrape data from various sites on the internet like wikipedia, britannca, youtube, etc.

## How to use
### Britannica Scrapper
It scrapes web-pages of [britannica.com](https://www.britannica.com/) to generate the data.

```python
from britannica import Scrapper
bs = Scrapper(search_queries=['antarctica', 'america', 'continents'], max_limit=10)
bs(out_file='../scrapped_data.txt.')
```

I've made a sample `search_queries.json` that contains few keywords that could be used to scraped the pages. You can use your own, though.
This repository contains a collection of scripts to scrape content from various sources like YouTube, Wikipedia, and Britannica. It includes functionality to download video captions from YouTube, scrape Wikipedia articles, and fetch content from Britannica.

## Table of Contents
- [Installation](#installation)
- [Usage](#usage)
- [YouTube Scraper](#youtube-scraper)
- [Wikipedia Scraper](#wikipedia-scraper)
- [Britannica Scraper](#britannica-scraper)
- [Configuration](#configuration)
- [Logging](#logging)

## Installation

1. **Clone the repository:**
```sh
git clone https://github.com/yourusername/web-scraper-suite.git
cd web-scraper-suite
```

2. **Create and activate a virtual environment:**
```sh
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

3. **Install the required packages:**
```sh
pip install -r requirements.txt
```

## Usage

### YouTube Scraper

The YouTube scraper fetches video captions from a list of channels.

#### Configuration
- Add your YouTube API key to a `.env` file:
```env
yt_key=YOUR_API_KEY
```

- Create a `channelIds.json` file with the list of channel IDs:
```json
[
"UC_x5XG1OV2P6uZZ5FSM9Ttw",
"UCJ0-OtVpF0wOKEqT2Z1HEtA"
]
```

#### Running the Scraper

```python
from britannica import searchQueries
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('yt_key')

queries = searchQueries()
print(queries())
```

### Wikipedia Scrapper
It scrapes web-pages from [wikipedia.com](https://en.wikipedia.org/) to generate the data for later uses.
It has one more feature `extra_urls=True`, this, if True will fetch new urls present on the initial query-web-pages, and will scrape those pages too.
from graze import youtube

```python
from wikipedia import WikiScraper
scrape = WikiScraper()
scrape(search_queries=["Antarctica", "Colonization", "World war"], out_file=out_file, extra_url=True)
scraper = youtube(api_key=api_key, filepath='./output.txt')
scraper()
```

I've included sample `search_queries` that can be used to scrape certain data. You're free to use your own queries.
### Wikipedia Scraper

```python
from wikipedia import WikiQueries
The Wikipedia scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.

queries = WikiQueries()
print(queries())
```
#### Configuration
- Define your search queries in `queries.py`:
```python
class WikiQueries:
def __init__(self):
self.search_queries = ["topic1", "topic2", "topic3"]

def __call__(self):
return self.search_queries
```

If you're downloading XML dumps from Wikipedia eg. dump: [Dump Page for March 2024](https://dumps.wikimedia.org/wikidatawiki/20240301/). Use `xml_parser.py` to convert .xml to a .txt file containing all target urls and then run `WikiXMLScrapper()` to generate a large .txt file.
#### Running the Scraper

```python
from wikipedia import WikiXMLScraper
from graze import wikipedia

scraper = WikiXMLScraper()
url_file = 'extracted_urls.txt'
output_file = 'Datasets/wiki_110k.txt'

start_time = timeit.default_timer()
scraper.scrape_from_file(url_file, output_file, batch_size=500)
print(f"Total time taken: {timeit.default_timer() - start_time:.2f} mins")
wiki = wikipedia()
wiki(out_file='./output.txt')
```

### Transcripts Collector
It uses [Youtube V3 api](https://developers.google.com/youtube/v3/docs) to fetch uploaded videos by a particular channel and then generates `video_ids` which then is used to generate transcripts using [youtube-transcripts-api](https://github.com/jdepoix/youtube-transcript-api/tree/master). `max_results` can be set up to 100, not more than that.
### Britannica Scraper

```python
import os
api_key = os.getenv('yt_secret_key')
out_file = 'transcripts.txt'
The Britannica scraper fetches content based on search queries and writes it to a file.

from youtube_transcripts import TranscriptsCollector
ts = TranscriptsCollector(api_key=api_key)
ts(channel_ids=["UCb_MAhL8Thb3HJ_wPkH3gcw"], target_file=out_file, max_results=100)
```
#### Configuration
- Define your search queries in `queries.py`:
```python
class BritannicaQueries:
def __init__(self):
self.search_queries = ["topic1", "topic2", "topic3"]

def __call__(self):
return self.search_queries
```

I've included list of more than 100 YouTube channels' ids in `channel_ids.json`. You can use those or you can use according to your convinience. These `channel_ids` can generate upto 4gbs of transcripts from over ~200k videos.

It would take a lot of time though; for me, it took around ~55hrs to fetch transcripts from 167k videos.
#### Running the Scraper

```python
// channel_ids.json

[
"UCb_MAhL8Thb3HJ_wPkH3gcw",
"UCA295QVkf9O1RQ8_-s3FVXg",
"UCpFFItkfZz1qz5PpHpqzYBw",
....
"UCiMhD4jzUqG-IgPzUmmytRQ",
"UCB0JSO6d5ysH2Mmqz5I9rIw",
"UC-lHJZR3Gqxm24_Vd_AJ5Yw"
]
from graze import britannica

scraper = britannica(max_limit=20)
scraper(out_file='./output.txt')
```

Or use the `snippet.py` to import it in your code directly, check it if you want to add new channel ids, or if you're curious to see the channel names.
## Configuration

```python
# importing snippets
- **API Keys and other secrets:** Ensure that your API keys and other sensitive data are stored securely and not hard-coded into your scripts.

from youtube_transcripts import SampleSnippets
- **Search Queries:** The search queries for Wikipedia and Britannica scrapers are defined in `queries.py`.

snippets = SampleSnippets()
print(snippets())
```
## Logging

## File Structure
```
.
├── britannica
│ ├── __init__.py
│ ├── main.py
│ ├── queries.py
│ ├── requirements.txt
│ ├── search_queries.json
│ ├── URLFetcher.py
├── javascript
│ ├── customLinkFinder.js
│ ├── customSearch.js
│ ├── customWebScrapper.js
│ ├── googleCustomSearch.js
│ ├── sample.js
│ ├── webDataScrapping.js
├── run.py
│ ├── run_britannica.py
│ ├── run_transcripts.py
│ ├── run_wiki.py
├── wikipedia
│ ├── __init__.py
│ ├── fetch_urls.py
│ ├── main.py
│ ├── queries.py
│ ├── requirements.txt
│ ├── search_queries.json
├── youtube_transcripts
│ ├── __init__.py
│ ├── basic.py
│ ├── channe_ids_snippet.json
│ ├── channel_ids.json
│ ├── main.py
│ ├── requirements.txt
│ ├── snippets.py
│ ├── version2.py
├── .gitignore
├── CONTRIBUTING.md
├── LargeDataCollector.ipynb
├── LICENSE
├── README.md
├── test.py
├── wiki_extractor.py
├── xml_parser.py
```
The YouTube scraper logs errors to `youtube_fetch.log`. Make sure to check this file for detailed error messages and troubleshooting information.

## Contribution
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.

Check out [CONTRIBUTING.md](https://github.com/shivendrra/web-graze/blob/main/CONTRIBUTING.md) for more details

## License
MIT License

This project is licensed under the MIT License.
3 changes: 0 additions & 3 deletions britannica/__init__.py

This file was deleted.

50 changes: 0 additions & 50 deletions britannica/queries.py

This file was deleted.

3 changes: 3 additions & 0 deletions graze/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .youtube.base import Youtube as youtube
from .britannica.main import Britannica as britannica
from .wikipedia.main import WikiScraper as wikipedia
43 changes: 20 additions & 23 deletions britannica/URLFetcher.py → graze/britannica/base.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

import time
class BritannicaUrls:
def __init__(self, search_queries, max_limit):
self.max_limit = max_limit
Expand All @@ -13,23 +12,25 @@ def build_url(self, query, pageNo):
url = f"https://www.britannica.com/search?query={formattedQuery}&page={pageNo}"
return url

def get_target_url(self, targets):
r = requests.get(targets, headers=self.headers)
list_url = []

if r.status_code == 200:
html_content = r.content
soup = BeautifulSoup(html_content, 'html.parser')
fetched_urls = soup.find_all('a', attrs={'class': 'font-weight-bold font-18'})
list_url.extend([url.get('href') for url in fetched_urls])
return list_url

else:
print(f"skipping this {targets}")
def get_target_url(self, target_url):
while True:
r = requests.get(target_url, headers=self.headers)
if r.status_code == 200:
html_content = r.content
soup = BeautifulSoup(html_content, 'html.parser')
fetched_urls = soup.find_all('a', class_='md-crosslink')
list_url = [url.get('href') for url in fetched_urls]
return list_url

elif r.status_code == 429:
print(f"Rate limit exceeded. Waiting 30secs before retrying: {target_url}")
time.sleep(30)
else:
print(f"Skipping this URL due to status code {r.status_code}: {target_url}")
return []

def generate_urls(self, progress_bar=None):
page_urls = []
total_iterations = len(self.search_queries) * self.max_limit
current_iteration = 0

for query in self.search_queries:
Expand All @@ -38,13 +39,9 @@ def generate_urls(self, progress_bar=None):
target_url = self.build_url(query, pageNo)
pageNo += 1
new_url = self.get_target_url(target_url)
page_urls.extend(new_url)

# Update the progress bar
if new_url:
page_urls.extend(new_url)
current_iteration += 1
if progress_bar:
progress_bar.update(1)
return page_urls

if __name__ == '__main__':
bs = BritannicaUrls(search_queries=['antarctica', 'usa'], max_limit=10)
return page_urls
Loading

0 comments on commit 16c155d

Please sign in to comment.