👻 Predicting the Scariest Monster - Nvidia Hackathon

Link to Colab Notebook:

Our Submission Scores

Project Overview

This project presents a solution for the ODSC 2024 NVIDIA Hackathon, where data scientists are challenged to predict the "Scariest Monster" using a massive dataset filled with 12 million entries, each described by 106 anonymous features. The ultimate goal is to forecast the number of votes each monster received in a global terror poll, utilizing GPU-accelerated data processing and machine learning techniques.

Dataset

The competition dataset includes:

12 million monster entries
106 anonymous features (a mix of categorical and numerical)
Target variable 'y': Number of votes each monster received in the global terror poll
Dataset size: Approximately 8-10GB

Approach

Our approach to tackling this challenge involves the following steps:

Data Loading and Preprocessing:
- Loading the data using cuDF (RAPIDS NVIDIA API) for GPU-accelerated processing.
- Performing basic Exploratory Data Analysis (EDA) to understand the dataset.
- Dropping categorical columns to avoid creating sparse matrices.
- Applying mean imputation for numerical columns.
- Removing outliers and performing robust normalization for stability.
Memory-Efficient Train-Test Split:
- Creating a custom train-test split method to handle memory constraints effectively.
- Using a random shuffled column for efficient data shuffling and splitting.
Model Training:
- Implementing a Random Forest Regressor using the RAPIDS cuML library for GPU-accelerated processing.
Post-processing:
- Applying inverse robust scaling to calculate the final RMSE value.
Prediction and Submission:
- Generating predictions on the test set.
- Preparing the submission file in accordance with the competition guidelines.

Technologies Used

Used A100 hardware acceleration
Python 3.x
RAPIDS cuDF for GPU-accelerated data processing
RAPIDS cuML for GPU-accelerated machine learning
Scikit-learn for preprocessing and metrics
Google Colab Notebook for interactive development

Results

The model's performance is evaluated based on Root Mean Squared Error (RMSE), with lower scores indicating better performance.

Getting Started

Clone this repository:

https://github.com/Parag000/Nvidia-Data-Science-Competition.git

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Nvidia-Data-Science-Competition.ipynb		Nvidia-Data-Science-Competition.ipynb
README.md		README.md
leaderboard.png		leaderboard.png
submisssion-score.png		submisssion-score.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👻 Predicting the Scariest Monster - Nvidia Hackathon

Link to Colab Notebook:

Our Submission Scores

Project Overview

Dataset

Approach

Technologies Used

Results

Getting Started

Leaderboard

About

Releases

Packages

Languages

Parag000/GPU-Accelerated-ML-For-Big-Data-Processing

Folders and files

Latest commit

History

Repository files navigation

👻 Predicting the Scariest Monster - Nvidia Hackathon

Link to Colab Notebook:

Our Submission Scores

Project Overview

Dataset

Approach

Technologies Used

Results

Getting Started

Leaderboard

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages