Skip to content

A small Python-based project to provide an all-in-one tool to help archive social media data.

Notifications You must be signed in to change notification settings

thezoid/social-scraper

Repository files navigation

social-scraper discord Tips

master Linux Mac Windows

dev Linux Mac Windows

A quick python project to allow the downloading of content from specificed social sites.

DISCLAIMER: This project is to be used only for archival purposes.

Requirements

Requires instaloader, youtube_dl, and praw. praw erquires additional setup; h/e youtube_dl and instaloader work OOTB.

How to register app for praw

Twitter needs a dev app registered

 pip install instaloader
 pip install youtube_dl
 pip install praw
 pip install wget
 pip install tweepy

How to Use

  1. Make sure you have installed all the listed requirements above
  2. Customize settings.json to include all of your appropriate information. USe the tables below if you are unsure of what values you should use.
  3. Run scraper.py through your favorite method
    • NOTE: It is recommended to run this through the command line to more easily observe any output that may come up

Customization

Before putting the bot to work, you need to configure settings.json so that the script will function correctly. Be sure not to commit or otherwise save your sensitive information in a public place (keys, secrets, etc.).

Key Description Default
subRedditSkip Whether to skip processing Subreddits false
redditorSkip Whether to skip processing Redditors false
instaSkip Whether to skip processing Instagram false
twitSkip Whether to skip processing Twitter false
red_agentName* Your reddit app agent name N/A
red_clientID* Your reddit app client ID N/A
red_clientSecret* Your Reddit app client secret N/A
twit_consKey* Your Twitter developer app consumer API key N/A
twit_consSec* Your Twitter developer app consumer secret N/A
twit_bearerTok* Your Twitter developer app bearer token N/A
destination The location to write scraped content to N/A
loggingLevel Set the level of logging in the script such that
  • 0 = SILENT
  • 1 = ERROR
  • 2 = WARNING
  • 3 = INFO
1
imageDomains A list of allowed GIF domains to download from "i.redd.it","i.imgur.com"
gifDomains A list of allowed GIF domains to download from "gfycat.com"
videoDomains A list of allowed video domains to download from "v.redd.it","gfycat.com"
scrapeList A list of Reddit users or subreddits to scrape media from. Must assign as r/subredditName or u/redditorName N/A
instaList A list of Instagram handles to scrape media from N/A
twitterList A list of Twitter handles to scrape media from N/A

*If you update these in your settings, please do not commit it to your local repository! I do not take responsibility for any data that may leak through your commits!

Known restrictions

  • When doing large scrapes from public Instagrams, you will be rate limited to pulling 12 images per profile
    • Instaloader does this as a precaution
    • Support will not be offered for this and developer(s) will not work around it
  • Some posts will fail to download due to content no longer existing

Support

Join my Discord and join the Programmer's Parlor. #code-talk can be used to discuss this project, and code in general. Assistance may be provided on a case by case instance; however no offical or 24/7 support will be provided. Do not ping mods or admins for assitance for code.