-
-
Notifications
You must be signed in to change notification settings - Fork 109
Home
Welcome to URS - a comprehensive Reddit scraping command-line tool written in Python.
This wiki merely serves as a repository statistics page and an archive for all iterations of URS. This exists for me to see the evolution of my programming/skills and for anyone who is also curious how this repository has evolved since its inception.
I found this dope statistics tool called Star Chart and wanted to display it somewhere in this repository. It plots the repository's stars over time, which is such a cool feature and definitely something I am very interested in seeing.
Additionally, I found another statistics tool called Spark, which displays GitHub stars velocity of this repo for the entire lifetime.
I will also display the hit count. Maybe one day this repository will blow up again because of Reddit events such as the r/wallstreetbets fiasco that occurred in late January 2021.
I would love to revisit these statistics if something like that happens again, so consider the media above as future-proofing this wiki.
This is a table displaying the differences among the major iterations of URS.
v1.0.0 | v2.0.0 | v3.0.0 | |
---|---|---|---|
CLI? | No | Yes | Yes |
What Does It Scrape? | Subreddits Only | Subreddits Only | Subreddits, Redditors, Post Comments |
Export Options | CSV | CSV | CSV, JSON |
READMEs | README | README | README |
Scraper | reddit_scraper.py | scraper.py | scraper.py |
Requirements Text File | N/A | requirements.txt | requirements.txt |
Here I am listing additional changes that were built on top of v3.0.0. This is basically a modified version of the Releases document.
- User Interface
- Analytical tools
- Word frequencies generator.
- Wordcloud generator.
- Analytical tools
- Source code
- CLI
- Flags
-
-e
- Display additional example usage. -
--check
- Runs a quick check for PRAW credentials and displays the rate limit table after validation. -
--rules
- Include the Subreddit's rules in the scrape data (for JSON only). This data is included in thesubreddit_rules
field. -
-f
- Word frequencies generator. -
-wc
- Wordcloud generator. -
--nosave
- Only display the wordcloud; do not save to file.
-
- Added additional verbose feedback if invalid arguments are given.
- Flags
- Log decorators
- Added new decorator to log individual argument errors.
- Added new decorator to log when no Reddit objects are left to scrape after failing validation check.
- Added new decorator to log when an invalid file is passed into the analytical tools.
- Added new decorator to log when the
scrapes
directory is missing, which would cause the newmake_analytics_directory()
method inDirInit.py
to fail.- This decorator is also defined in the same file to avoid a circular import error.
- ASCII art
- Added new art for the word frequencies and wordcloud generators.
- Added new error art displayed when a problem arises while exporting data.
- Added new error art displayed when Reddit object validation is completed and there are no objects left to scrape.
- Added new error art displayed when an invalid file is passed into the analytical tools.
- CLI
-
README
- Added new Contact section and moved contact badges into it.
- Apparently it was not obvious enough in previous versions since users did not send emails to the address specifically created for URS-related inquiries.
- Added new sections for the analytical tools.
- Updated demo GIFs
- Moved all GIFs to a separate branch to avoid unnecessary clones.
- Hosting static images on Imgur.
- Added new Contact section and moved contact badges into it.
- Tests
- Added additional tests for analytical tools.
- User interface
- JSON is now the default export option.
--csv
flag is required to export to CSV instead. - Improved JSON structure.
- PRAW scraping export structure:
- Scrape details are now included at the top of each exported file in the
scrape_details
field.- Subreddit scrapes - Includes
subreddit
,category
,n_results_or_keywords
, andtime_filter
. - Redditor scrapes - Includes
redditor
andn_results
. - Submission comments scrapes - Includes
submission_title
,n_results
, andsubmission_url
.
- Subreddit scrapes - Includes
- Scrape data is now stored in the
data
field.- Subreddit scrapes -
data
is a list containing submission objects. - Redditor scrapes -
data
is an object containing additional nested dictionaries:-
information
- a dictionary denoting Redditor metadata, -
interactions
- a dictionary denoting Redditor interactions (submissions and/or comments). Each interaction follows the Subreddit scrapes structure.
-
- Submission comments scrapes -
data
is an list containing additional nested dictionaries.- Raw comments contains dictionaries of
comment_id: SUBMISSION_METADATA
. - Structured comments follows the structure seen in raw comments, but includes an extra
replies
field in the submission metadata, holding a list of additional nested dictionaries ofcomment_id: SUBMISSION_METADATA
. This pattern repeats down to third level replies.
- Raw comments contains dictionaries of
- Subreddit scrapes -
- Scrape details are now included at the top of each exported file in the
- Word frequencies export structure:
- The original scrape data filepath is included in the
raw_file
field. -
data
is a dictionary containingword: frequency
.
- The original scrape data filepath is included in the
- PRAW scraping export structure:
- Log:
-
scrapes.log
is now namedurs.log
. - Validation of Reddit objects is now included - invalid Reddit objects will be logged as a warning.
- Rate limit information is now included in the log.
-
- JSON is now the default export option.
- Source code
- Moved PRAW scrapers into its own package.
- Scrape settings for the basic Subreddit scraper is now cleaned within
Basic.py
, further streamlining conditionals inSubreddit.py
andExport.py
. - Returning final scrape settings dictionary from all scrapers after execution for logging purposes, further streamlining the
LogPRAWScraper
class inLogger.py
. - Passing the submission URL instead of the exception into the
not_found
list for submission comments scraping.- This is a part of a bug fix that is listed in the Fixed section.
- ASCII art:
- Modified the args error art to display specific feedback when invalid arguments are passed.
- Upgraded from relative to absolute imports.
- Replaced old header comments with docstring comment block.
- Upgraded method comments to Numpy/Scipy docstring format.
-
README
- Moved Releases section into its own document.
- Deleted all media from master branch.
- Tests
- Updated absolute imports to match new directory structure.
- Updated a few tests to match new changes made in the source code.
- Community documents
- Updated
PULL_REQUEST_TEMPLATE
:- Updated section for listing changes that have been made to match new Releases syntax.
- Wrapped New Dependencies in a code block.
- Updated
STYLE_GUIDE
:- Created new rules for method comments.
- Added
Releases
:- Moved Releases section from main
README
to a separate document.
- Moved Releases section from main
- Updated
- Source code
- PRAW scraper settings
- Bug: Invalid Reddit objects (Subreddits, Redditors, or submissions) and their respective scrape settings would be added to the scrape settings dictionary even after failing validation.
- Behavior: URS would try to scrape invalid Reddit objects, then throw an error mid-scrape because it is unable to pull data via PRAW.
-
Fix: Returning the invalid objects list from each scraper into
GetPRAWScrapeSettings.get_settings()
to circumvent this issue.
- Basic Subreddit scraper
-
Bug: The time filter
all
would be applied to categories that do not support time filter use, resulting in errors while scraping. - Behavior: URS would throw an error when trying to export the file, resulting in a failed run.
-
Fix: Added a conditional to check if the category allows for a time filter, and applies either the
all
time filter orNone
accordingly.
-
Bug: The time filter
- PRAW scraper settings
- User interface
- Removed the
--json
flag since it is now the default export option.
- Removed the
- User interface
- Scrapes will now be exported to sub-folders within the date directory.
-
comments
,redditors
, andsubreddits
directories are now created for you when you run each scraper. Scrape results will now be stored within these directories.
-
- Scrapes will now be exported to sub-folders within the date directory.
-
README
- Added new Derivative Projects section.
- Source code
- Minor code reformatting and refactoring.
- The forbidden access message that may appear when running the Redditor scraper is now yellow to avoid confusion.
- Updated
README
andSTYLE_GUIDE
.- Uploaded new demo GIFs.
- Made a minor change to PRAW credentials guide.
- User interface
- Added time filters for Subreddit categories (Controversial, Top, Search).
- Source code
- Changed how arguments are processed in the CLI.
- Performed DRY code review.
-
README
- Updated
README
to reflect new changes.
- Updated
- Community documents
- Updated
STYLE_GUIDE
.- Made minor formatting changes to scripts to reflect new rules.
- Updated
v3.1.0 - Major Code Refactor, Logging, Introducing the scrapes
Directory, and Aesthetic Changes - June 22, 2020
-
User interface
- Scrapes will now be exported to the
scrapes/
directory within a subdirectory corresponding to the date of the scrape. These directories are automatically created for you when you run URS.
- Scrapes will now be exported to the
-
Source code
- Major code refactor. Applied OOP concepts to existing code and rewrote methods in attempt to improve readability, maintenance, and scalability.
- Added log decorators that record what is happening during each scrape, which scrapes were ran, and any errors that might arise during runtime in the log file
scrapes.log
. The log is stored in the same subdirectory corresponding to the date of the scrape. - Added color to terminal output.
- Integrating Travis CI and Codecov.
- Source code
- Replaced bulky titles with minimalist titles for a cleaner look.
- Improved naming convention for scripts.
- Community documents
- Updated the following documents:
BUG_REPORT
CONTRIBUTING
FEATURE_REQUEST
PULL_REQUEST_TEMPLATE
STYLE_GUIDE
- Updated the following documents:
-
README
- Numerous changes, the most significant change was splitting and storing walkthroughs in
docs/
.
- Numerous changes, the most significant change was splitting and storing walkthroughs in