Skip to content

Scraping Tweets from Twitter using Python, Kafka and MongoDB

License

Notifications You must be signed in to change notification settings

Bhavan-Naik/Twitter_Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter Data Pipeline


Requirements and References:

Apache Kafka 
https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-20-04. Version: https://archive.apache.org/dist/kafka/2.1.1/kafka_2.11-2.1.1.tgz

twint
https://github.com/twintproject/twint

CMAK 
https://github.com/yahoo/CMAK

MongoDB 
https://linuxhint.com/install_mongodb_ubuntu_20_04/

MongoDB-Compass 
https://docs.mongodb.com/compass/current/install/

Java 11+

Python 3.6+

Run the twitter_shell.sh file in order to install the basic packages, including twint

Execution steps:

Step 1:

Checking running status of Kafka and MongoDB:

$sudo systemctl start kafka

$sudo systemctl status kafka

$sudo systemctl start mongodb

$sudo systemctl status mongodb

Step 2:

Open first terminal

Navigate to your "CMAK" directory and run the following commands:

$cd target/universal/cmak-3.0.0.5

$bin/cmak -java-home /usr/lib/jvm/java-11-openjdk-amd64/

Step 3:

Open second terminal

Navigate to "kafka" home directory:

$bin/zookeeper-shell.sh localhost:2181

Once the zookeeper shell opens and starts blinking for next commands:

$ls /kafka-manager

$create /kafka-manager/mutex ""

$create /kafka-manager/mutex/locks ""

$create /kafka-manager/mutex/leases ""

Go to web browser (localhost:9000) and add cluster with following details name:(any_name),host: localhost:2181, kafka-version:2.1.1 and save.

Step 4:

Open third terminal

Navigate to your "kafka" home directory:

$bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic_name

Navigate to your twitter folder:

$python3 producer_filename.py --broker-list localhost:9092 --topic topic_name > /dev/null

Step 5:

Open fourth terminal:

$mongodb-compass

Connect to your particular database.

Open fifth terminal and navigate to twitter folder:

$python3 consumer_filename.py --bootstrap-server localhost:9092 --topic topic_name --from-beginning

"Ctrl+C" after all the tweets have been consumed by the consumer.

Releases

No releases published

Packages

No packages published