Skip to content

A Docker pipeline which uses SQL and no-SQL databases as well as dedicated Python libraries for browsing among the most recent tweets from the New York Times and doing some NLP on them.

License

Notifications You must be signed in to change notification settings

fra-mari/NYTimes_Docker_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The New York Times Docker Pipeline


Running on Telegram @NYTtopic


This code maintains a simple Telegram bot which collects fresh updates from the Twitter account of The New York Times and allows the user to look for recent articles on topics of their choice.
Hosted on Amazon EC2, the NYTtopic Bot consists of a pipeline of Docker containers:

This code maintains a simple Telegram bot which collects fresh updates from the Twitter account of The New York Times and allows the user to look for recent articles on topics of their choice.
Hosted on Amazon EC2, the NYTtopic Bot consists of a pipeline of Docker containers:

➤ a first container runs a Python module which leverages Tweepy for accessing The New York Times's profile via the Twitter API, creating a stream of tweets and storing these into a Mongo database (second container);

➤ the third container carries out ETL tasks. It uses SpaCy to perform named-entity recognition (NER) on the text of each tweet extracted from MongoDB. These tags are then formatted as #hashtags, and all the data are eventually stored into a PostgreSQL database (fourth container);

➤ the fifth container feeds all the data into the Telegram bot, which is controlled and kept online using a library called Python Telegram Bot;

➤ the sixth and last container runs once per week, removing the records older than a year from both databases, so as to prevent them from growing too large.

I hope this bot will be useful anytime you are looking for high quality information.



Used Technology



Guest Star


Instructions For Using This Code Locally

📌  STEP 1: Obtain credentials for the Twitter API and the Telegram Bot API

  • Open profiles on Twitter and Telegram if you do not already have them.

  • Four authentication keys are needed to access Twitter's Streaming API: API Key, API Secret, Access Token and Access Token Secret:

    • You can obtain them by registering an application on apps.twitter.com.
    • Once in possession of the access keys, store them locally as environment variables with the following names: API_KEY, API_SECRET, ACCESS_TOKEN, SECRET_ACCESS_TOKEN.
  • Authentication to Telegram Bot Api is coparatively easier, as you only need one Access Token:

    • To generate it, you have to chat with BotFather on Telegram (no kidding!) and follow a few simple steps (to prevent overlapping, please make sure you do not choose NYTtopic as a name for your bot 🙏🏻 ).
    • Once again, store the token as an environment variable. Call it TOKEN_TELEGRAM.

📌  STEP 2: Run the pipeline with Docker

  • Clone this repository and install Docker if needed.
  • Go into the folder NYTopic_twitter_to_telegram:
    • run docker-compose build and wait for Docker to set up everything for you;
    • run docker-compose up. The bot should start responding within a few seconds.
  • Open a Telegram chat with your new bot and start browsing The New York Times!

To Do

  • Add a container for removing old records from Mongo and Postgres.
  • Provide the user with links to similar content in other newspapers.
  • Make hashtag-based queries possible, so as to return all the available articles related to a precise topic in a single message.

About

A Docker pipeline which uses SQL and no-SQL databases as well as dedicated Python libraries for browsing among the most recent tweets from the New York Times and doing some NLP on them.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published