This repository houses all of my Apache Spark projects. They were all completed using Databricks.
Natural Language Processing
All NLP projects were completed using John Snow's open source NLP Algorithms. You can find more information here: https://www.johnsnowlabs.com/.
Project Name | Sentence Embedder/ Encoder | Transformer Used | Accuracy | Macro Precision | Macro Recall | Macro F1-Score |
---|---|---|---|---|---|---|
Clickbait Classification (Part 1) | Universal Sentence Encoder |
Classifier DL Approach |
0.97 | 0.97 | 0.97 | 0.97 |
Clickbait Classification (Part 2) | Regular Built-In Tokenizer | BERT Sequence Classifier |
1.0 | 1.0 | 1.0 | 1.0 |
Clickbait Classification (Part 3) | BERT Sequence Classifier |
Classifier DL Approach |
0.98 | 0.98 | 0.98 | 0.98 |
Is There Depression in This Reddit Post? | Universal Sentence Encoder |
Classifier DL Approach |
0.97 | 0.97 | 0.97 | 0.97 |
Onion Or Not | Universal Sentence Encoder |
Classifier DL Approach |
0.87 | 0.87 | 0.87 | 0.87 |
Onion Or Not with Extra Stages1 | Universal Sentence Encoder |
Classifier DL Approach |
0.86 | 0.86 | 0.86 | 0.86 |
Real vs Fake News (Pretrained Model)2 | Universal Sentence Encoder |
Classifier DL Model 3 |
0.608 | 0.605 | 0.610 | 0.608 |
Real vs Fake News (Deep Learning Approach)2 | Universal Sentence Encoder |
Classifier DL Approach |
0.975 | 0.975 | 0.975 | 0.975 |
Sarcasm Detection | Universal Sentence Encoder |
Classifier DL Approach |
0.89 | 0.89 | 0.89 | 0.89 |
Spam Filter | Universal Sentence Encoder |
Classifier DL Approach |
0.98 | 0.97 | 0.96 | 0.96 |
Project Name | Sentence Embedder/ Encoder | Transformer Used | Accuracy | Macro Precision | Macro Recall | Macro F1-Score |
---|---|---|---|---|---|---|
CNN News Articles | Sentence Embeddings 4 |
Classifier Deep Learning Approach |
0.72 | 0.87 | 0.47 | 0.55 |
CNN News Articles v2 | Sentence Embeddings 4 |
Classifier Deep Learning Approach |
0.75 | 0.83 | 0.47 | 0.55 |
Cancer Classification (After Removing Class Imbalance)2 | Universal Sentence Encoder |
Classifier Deep Learning Approach |
0.854 | 0.853 | 0.862 | 0.854 |
Cancer Classification (Without Removing Class Imbalance)2 | Universal Sentence Encoder |
Classifier Deep Learning Approach |
0.863 | 0.862 | 0.862 | 0.863 |
Cyberbullying Classification | Universal Sentence Encoder |
Classifier Deep Learning Approach |
0.82 | 0.81 | 0.82 | 0.81 |
Ford Sentence Classification | Universal Sentence Encoder |
Classifier Deep Learning Approach |
0.74 | 0.73 | 0.73 | 0.73 |
IMDb Genres | Sentence Embeddings 4 |
Classifier Deep Learning Approach |
0.66 | 0.66 | 0.66 | 0.65 |
Project Name | Sentence Embedder/ Encoder | Transformer Used | Accuracy | Micro Precision | Micro Recall | Micro F1 Score | Subset Accuracy | Hamming Loss |
---|---|---|---|---|---|---|---|---|
GoEmotions | Universal Sentence Encoder |
Multi Classifier DL Approach |
0.934 | 0.965 | 0.965 | 0.965 | 0.125 | 0.973 |
Research Articles | Universal Sentence Encoder |
Multi Classifier DL Approach |
0.941 | 0.964 | 0.964 | 0.964 | 0.792 | 0.179 |
uHack Reviews | Universal Sentence Encoder |
Multi Classifier DL Approach |
0.913 | 0.951 | 0.952 | 0.952 | 0.389 | 0.581 |
Project Name | Transformer Used | Accuracy | F1 | Weighted Precision | Weighted Recall |
---|---|---|---|---|---|
All Languages | Language Detector DL 5 |
0.980 | 0.986 | 0.991 | 0.980 |
Top 5 Languages | Language Detector DL |
0.990 | 0.992 | 0.995 | 0.990 |
At the time that I completed these projects, I was unable to find the Rouge metric code for Apache Spark. I have since used the Rouge metric with text summarization projects. I encourage you to view those projects.
Project Name | Sentence Encoder/Embeddings | Transformer Used | Accuracy | Macro Precision | Macro Recall | Macro F1 |
---|---|---|---|---|---|---|
Sentiment Analysis of Reviews | Universal Sentence Encoder |
Sentiment DL Model 6 |
0.81 | 0.55 | 0.87 | 0.55 |
Sentiment Analysis of Nearly 600,000 Tweets2 | Universal Sentence Encoder |
Classifier DL Approach |
0.749 | 0.748 | 0.749 | 0.748 |
Twitter Sentiment Analysis | Universal Sentence Encoder |
Sentiment DL Model 6 |
0.50 | 0.45 | 0.50 | 0.42 |
Project Name | Transformer Used | Rouge1 | Rouge | RougeL | RougeLsum |
---|---|---|---|---|---|
CNN News Articles | T5 Transformer (t5_small ) |
36.4 | 22.2 | 30.7 | 30.7 |
Structured Data
Project Name | Classifier Used | Accuracy | Macro F1 | Macro Precision | Macro Recall | Best Algorithm |
---|---|---|---|---|---|---|
Banking Campaign2 | GBTClassifier (Gradient Boosted Tree) |
0.885 | 0.885 | 0.887 | 0.885 | - |
Car Insurance Claim Predictor2 | MultilayerPerceptronClassifier |
0.502 | 0.335 | 0.252 | 0.502 | - |
Car Insurance Claim Predictor (Class Imbalance Removed)2 | RandomForestClassifier |
0.627 | 0.626 | 0.628 | 0.627 | - |
Car Insurance Claim Predictor2 | RandomForestClassifier |
0.935 | 0.904 | 0.875 | 0.935 | - |
Diabetes Health Indicators (v1) | - | 0.90 | 0.90 | 0.90 | 0.90 | GBTClassifier |
Diabetes Health Indicators (v2) | - | 0.90 | 0.90 | 0.90 | 0.90 | GBTClassifier |
Project Name | Accuracy | Macro F1 | Macro Precision | Macro Recall | Best Algorithm |
---|---|---|---|---|---|
Mobile Phone Price Classification (v1)7 | 0.88 | 0.88 | 0.89 | 0.89 | DecisionTreeClassifier |
Mobile Phone Price Classification (v2)7 | 0.88 | 0.88 | 0.89 | 0.89 | DecisionTreeClassifier |
Project Name | Algorithm Used | Root Mean Squared Error (RMSE) |
---|---|---|
Predict AI & ML Salaries Predictor | GBTRegressor |
$58932.71 |
Absenteeism at Work | GBTRegressor |
0.789 hours |
Email Click Through Rate Predictor | GBTRegressor |
0.044 |
Data-Related Salaries Predictor | GBTRegressor |
$57073 |
Computer Vision/Image Classification
Project Name | Pretrained Model or Untuned Checkpoint | Accuracy | F1 Score | Weighted Precision | Weighted Recall |
---|---|---|---|---|---|
Is It a Cat or a Dog? | Untuned Checkpoint | 0.975 | 0.982229 | 0.990196 | 0.975 |
Is It a Cat or a Dog? | Pretrained Model | 0.995 | 0.995 | 0.99505 | 0.995 |
Planes, Cars, & Boats2 | Untuned Checkpoint | 0.89 | 0.935331 | 0.994898 | 0.89 |
All Projects in Scala
Project Name | Transformer Used | Accuracy | F1 Score | Precision | Recall | PR Score | ROC Score |
---|---|---|---|---|---|---|---|
Fake Job Postings | ClassifierDLApproach |
- | - | - | - | - | - |
Onion or Not | ClassifierDLApproach |
0.854 | 0.854 | 0.855 | 0.854 | 0.823 | 0.854 |
Spam Filter | ClassifierDLModel 8 |
0.986 | 0.986 | 0.986 | 0.987 | 0.936 | 0.965 |
The Scala version of the Machine Translation projects were completed at about the same time as the Python version, so I was unable to find the Rouge metric code for these projects using Apache Spark. I have since used the Rouge metric with text summarization projects. I encourage you to view those projects.
Footnotes:
- I noticed that some of the HTML file versions of projects were 'too large to load', so I included the Python Notebook (ipynb) versions as well.
- Unfortunately, I forgot to evaluate the training datasets to compare with the testing datasets to make sure that I did not overtrain projects. Due to the large number of projects, it is impractical to go back and retrain all of them just to include the evaluation of the training datasets. I will make sure to include that information going forward in new Spark projects.
- If there is a topic that you are interesting in seeing if I have completed any work with, feel free to reach out to me and ask.
-
The extra stages included in this project are: Sentence Detector (Deep Learning Model), Tokenizer (regular built in Tokenizer), Stop Words Cleaner, Spell Checker, Lemmatizer, and Token Assembler. Whereas most of my projects had three (3) stages, this project had nine (9). ↩
-
Regrettably, I did not include Macro Averaged versions of the metrics in these projects. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10
-
The pretrained model used was: classifierdl_use_fakenews. ↩
-
Almost every instance of the SentenceEmbeddings sentence embedding is preceeded by a DistilBert Embeddings Stage (DistilBertEmbeddings). ↩ ↩2 ↩3
-
The pretrained model used was: ld_wiki_tatoeba_cnn_95. ↩
-
Even though the Decision Tree Classifier performed best (I ran the project a couple times to check), I understand that it is likely due to lucky sampling. There is likelihood for bias in the outcome of the Decision Tree Classifier. ↩ ↩2
-
The pretrained model was: classifierdl_use_spam. ↩