-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from shivendrra/dev
pulling new build changes from "dev" branch
- Loading branch information
Showing
33 changed files
with
546 additions
and
863 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,12 @@ | ||
*.pyc | ||
*.pyo | ||
__pycache__/ | ||
*.py[cod] | ||
*.exe | ||
|
||
build | ||
.vscode | ||
|
||
# extras | ||
*.env | ||
Datasets | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,153 +1,132 @@ | ||
# web-graze | ||
|
||
## Introduction | ||
this repo contains codes that would help you to scrape data from various sites on the internet like wikipedia, britannca, youtube, etc. | ||
|
||
## How to use | ||
### Britannica Scrapper | ||
It scrapes web-pages of [britannica.com](https://www.britannica.com/) to generate the data. | ||
|
||
```python | ||
from britannica import Scrapper | ||
bs = Scrapper(search_queries=['antarctica', 'america', 'continents'], max_limit=10) | ||
bs(out_file='../scrapped_data.txt.') | ||
``` | ||
|
||
I've made a sample `search_queries.json` that contains few keywords that could be used to scraped the pages. You can use your own, though. | ||
This repository contains a collection of scripts to scrape content from various sources like YouTube, Wikipedia, and Britannica. It includes functionality to download video captions from YouTube, scrape Wikipedia articles, and fetch content from Britannica. | ||
|
||
## Table of Contents | ||
- [Installation](#installation) | ||
- [Usage](#usage) | ||
- [YouTube Scraper](#youtube-scraper) | ||
- [Wikipedia Scraper](#wikipedia-scraper) | ||
- [Britannica Scraper](#britannica-scraper) | ||
- [Configuration](#configuration) | ||
- [Logging](#logging) | ||
|
||
## Installation | ||
|
||
1. **Clone the repository:** | ||
```sh | ||
git clone https://github.com/yourusername/web-scraper-suite.git | ||
cd web-scraper-suite | ||
``` | ||
|
||
2. **Create and activate a virtual environment:** | ||
```sh | ||
python -m venv venv | ||
source venv/bin/activate # On Windows: venv\Scripts\activate | ||
``` | ||
|
||
3. **Install the required packages:** | ||
```sh | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Usage | ||
|
||
### YouTube Scraper | ||
|
||
The YouTube scraper fetches video captions from a list of channels. | ||
|
||
#### Configuration | ||
- Add your YouTube API key to a `.env` file: | ||
```env | ||
yt_key=YOUR_API_KEY | ||
``` | ||
|
||
- Create a `channelIds.json` file with the list of channel IDs: | ||
```json | ||
[ | ||
"UC_x5XG1OV2P6uZZ5FSM9Ttw", | ||
"UCJ0-OtVpF0wOKEqT2Z1HEtA" | ||
] | ||
``` | ||
|
||
#### Running the Scraper | ||
|
||
```python | ||
from britannica import searchQueries | ||
from dotenv import load_dotenv | ||
load_dotenv() | ||
api_key = os.getenv('yt_key') | ||
|
||
queries = searchQueries() | ||
print(queries()) | ||
``` | ||
|
||
### Wikipedia Scrapper | ||
It scrapes web-pages from [wikipedia.com](https://en.wikipedia.org/) to generate the data for later uses. | ||
It has one more feature `extra_urls=True`, this, if True will fetch new urls present on the initial query-web-pages, and will scrape those pages too. | ||
from graze import youtube | ||
|
||
```python | ||
from wikipedia import WikiScraper | ||
scrape = WikiScraper() | ||
scrape(search_queries=["Antarctica", "Colonization", "World war"], out_file=out_file, extra_url=True) | ||
scraper = youtube(api_key=api_key, filepath='./output.txt') | ||
scraper() | ||
``` | ||
|
||
I've included sample `search_queries` that can be used to scrape certain data. You're free to use your own queries. | ||
### Wikipedia Scraper | ||
|
||
```python | ||
from wikipedia import WikiQueries | ||
The Wikipedia scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file. | ||
|
||
queries = WikiQueries() | ||
print(queries()) | ||
``` | ||
#### Configuration | ||
- Define your search queries in `queries.py`: | ||
```python | ||
class WikiQueries: | ||
def __init__(self): | ||
self.search_queries = ["topic1", "topic2", "topic3"] | ||
|
||
def __call__(self): | ||
return self.search_queries | ||
``` | ||
|
||
If you're downloading XML dumps from Wikipedia eg. dump: [Dump Page for March 2024](https://dumps.wikimedia.org/wikidatawiki/20240301/). Use `xml_parser.py` to convert .xml to a .txt file containing all target urls and then run `WikiXMLScrapper()` to generate a large .txt file. | ||
#### Running the Scraper | ||
|
||
```python | ||
from wikipedia import WikiXMLScraper | ||
from graze import wikipedia | ||
|
||
scraper = WikiXMLScraper() | ||
url_file = 'extracted_urls.txt' | ||
output_file = 'Datasets/wiki_110k.txt' | ||
|
||
start_time = timeit.default_timer() | ||
scraper.scrape_from_file(url_file, output_file, batch_size=500) | ||
print(f"Total time taken: {timeit.default_timer() - start_time:.2f} mins") | ||
wiki = wikipedia() | ||
wiki(out_file='./output.txt') | ||
``` | ||
|
||
### Transcripts Collector | ||
It uses [Youtube V3 api](https://developers.google.com/youtube/v3/docs) to fetch uploaded videos by a particular channel and then generates `video_ids` which then is used to generate transcripts using [youtube-transcripts-api](https://github.com/jdepoix/youtube-transcript-api/tree/master). `max_results` can be set up to 100, not more than that. | ||
### Britannica Scraper | ||
|
||
```python | ||
import os | ||
api_key = os.getenv('yt_secret_key') | ||
out_file = 'transcripts.txt' | ||
The Britannica scraper fetches content based on search queries and writes it to a file. | ||
|
||
from youtube_transcripts import TranscriptsCollector | ||
ts = TranscriptsCollector(api_key=api_key) | ||
ts(channel_ids=["UCb_MAhL8Thb3HJ_wPkH3gcw"], target_file=out_file, max_results=100) | ||
``` | ||
#### Configuration | ||
- Define your search queries in `queries.py`: | ||
```python | ||
class BritannicaQueries: | ||
def __init__(self): | ||
self.search_queries = ["topic1", "topic2", "topic3"] | ||
|
||
def __call__(self): | ||
return self.search_queries | ||
``` | ||
|
||
I've included list of more than 100 YouTube channels' ids in `channel_ids.json`. You can use those or you can use according to your convinience. These `channel_ids` can generate upto 4gbs of transcripts from over ~200k videos. | ||
|
||
It would take a lot of time though; for me, it took around ~55hrs to fetch transcripts from 167k videos. | ||
#### Running the Scraper | ||
|
||
```python | ||
// channel_ids.json | ||
|
||
[ | ||
"UCb_MAhL8Thb3HJ_wPkH3gcw", | ||
"UCA295QVkf9O1RQ8_-s3FVXg", | ||
"UCpFFItkfZz1qz5PpHpqzYBw", | ||
.... | ||
"UCiMhD4jzUqG-IgPzUmmytRQ", | ||
"UCB0JSO6d5ysH2Mmqz5I9rIw", | ||
"UC-lHJZR3Gqxm24_Vd_AJ5Yw" | ||
] | ||
from graze import britannica | ||
|
||
scraper = britannica(max_limit=20) | ||
scraper(out_file='./output.txt') | ||
``` | ||
|
||
Or use the `snippet.py` to import it in your code directly, check it if you want to add new channel ids, or if you're curious to see the channel names. | ||
## Configuration | ||
|
||
```python | ||
# importing snippets | ||
- **API Keys and other secrets:** Ensure that your API keys and other sensitive data are stored securely and not hard-coded into your scripts. | ||
|
||
from youtube_transcripts import SampleSnippets | ||
- **Search Queries:** The search queries for Wikipedia and Britannica scrapers are defined in `queries.py`. | ||
|
||
snippets = SampleSnippets() | ||
print(snippets()) | ||
``` | ||
## Logging | ||
|
||
## File Structure | ||
``` | ||
. | ||
├── britannica | ||
│ ├── __init__.py | ||
│ ├── main.py | ||
│ ├── queries.py | ||
│ ├── requirements.txt | ||
│ ├── search_queries.json | ||
│ ├── URLFetcher.py | ||
├── javascript | ||
│ ├── customLinkFinder.js | ||
│ ├── customSearch.js | ||
│ ├── customWebScrapper.js | ||
│ ├── googleCustomSearch.js | ||
│ ├── sample.js | ||
│ ├── webDataScrapping.js | ||
├── run.py | ||
│ ├── run_britannica.py | ||
│ ├── run_transcripts.py | ||
│ ├── run_wiki.py | ||
├── wikipedia | ||
│ ├── __init__.py | ||
│ ├── fetch_urls.py | ||
│ ├── main.py | ||
│ ├── queries.py | ||
│ ├── requirements.txt | ||
│ ├── search_queries.json | ||
├── youtube_transcripts | ||
│ ├── __init__.py | ||
│ ├── basic.py | ||
│ ├── channe_ids_snippet.json | ||
│ ├── channel_ids.json | ||
│ ├── main.py | ||
│ ├── requirements.txt | ||
│ ├── snippets.py | ||
│ ├── version2.py | ||
├── .gitignore | ||
├── CONTRIBUTING.md | ||
├── LargeDataCollector.ipynb | ||
├── LICENSE | ||
├── README.md | ||
├── test.py | ||
├── wiki_extractor.py | ||
├── xml_parser.py | ||
``` | ||
The YouTube scraper logs errors to `youtube_fetch.log`. Make sure to check this file for detailed error messages and troubleshooting information. | ||
|
||
## Contribution | ||
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate. | ||
|
||
Check out [CONTRIBUTING.md](https://github.com/shivendrra/web-graze/blob/main/CONTRIBUTING.md) for more details | ||
|
||
## License | ||
MIT License | ||
|
||
This project is licensed under the MIT License. |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
from .youtube.base import Youtube as youtube | ||
from .britannica.main import Britannica as britannica | ||
from .wikipedia.main import WikiScraper as wikipedia |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.