Detecting Offensive Language in Tweets

This project aims to detect offensive language in tweets using ML Classification Algorithms. A training and predicting pipeline is implemented to contrast performance of various popular classification algorithms and determine the best suited model.

Data

Data is taken from two sources:

Hate Speech Twitter Annotations
- Publication: Z. Waseem and D. Hovy. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In NAACL SRW, pages 88–93, 2016
- Authors: Waseem, Zeerak and Hovy, Dirk
- GitHub Link: https://github.com/zeerakw/hatespeech
- Description: The dataset contains about 17,000 Tweet ID's labeled for racism and sexism. We downloaded this dataset and querried a Twitter API to scrape the actual tweets from Twitter. Retrieval of about tweets 5,900 failed either because the tweet was deleted or the account was deactivated.
Hate Speech and Offensive Language Detection
- Publication: Automated Hate Speech Detection and the Problem of Offensive Language
- Authors: Davidson, Thomas and Warmsley, Dana and Macy, Michael and Weber, Ingmar
- GitHub Link: https://github.com/t-davidson/hate-speech-and-offensive-language
- Description: The dataset has about 25,000 Tweets annotated by crowd sourcing. As per the number of users labeling the Tweets, each is put in one of three classes - hate speech, offensive language and neither. We downloaded the dataset in Python from GitHub as a csv file.
  The data from both sources are clubbed. Here is the distribution of the tweets into the two classes:

Code

The code is distributed into two Juypyter Notebooks which can be viewed in rendered format on the links:

cyberbullying_wrangling.ipynb | Link
cyberbullying_v2.ipynb | Link

Summary

After tuning hyper-parameters to optimize the algorithms, Stochastic Gradient Descent was found to be the best suited algorithm, taking both performance and time complexity into account. Following performance metrics were achievec:

Accuracy: 92.81 %
Precision: 96.97 %
Recall: 91.94 %
F1-Score: 94.39 %

License

The contents of this repository are covered under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
LICENSE		LICENSE
README.md		README.md
algo_comparison.png		algo_comparison.png
algo_time.png		algo_time.png
cyberbullying_v2.html		cyberbullying_v2.html
cyberbullying_v2.ipynb		cyberbullying_v2.ipynb
cyberbullying_wrangling.html		cyberbullying_wrangling.html
cyberbullying_wrangling.ipynb		cyberbullying_wrangling.ipynb
distribution_of_tweets.png		distribution_of_tweets.png
labeled_tweets.csv		labeled_tweets.csv
public_data_labeled.csv		public_data_labeled.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting Offensive Language in Tweets

Data

Code

Summary

License

About

Releases

Packages

Languages

License

dhavalpotdar/detecting-offensive-language-in-tweets

Folders and files

Latest commit

History

Repository files navigation

Detecting Offensive Language in Tweets

Data

Code

Summary

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages