Skip to content

This project predicts the "Scariest Monster" using a dataset of 12 million entries and 106 features. Utilizing GPU-accelerated processing and the Random Forest Regressor using the Nvidia Rapids API. The goal is to minimize RMSE for accurate predictions

Notifications You must be signed in to change notification settings

Parag000/GPU-Accelerated-ML-For-Big-Data-Processing

Repository files navigation

👻 Predicting the Scariest Monster - Nvidia Hackathon

Link to Colab Notebook:

Open this notebook in Google Colab

Our Submission Scores

Submission Score

Project Overview

This project presents a solution for the ODSC 2024 NVIDIA Hackathon, where data scientists are challenged to predict the "Scariest Monster" using a massive dataset filled with 12 million entries, each described by 106 anonymous features. The ultimate goal is to forecast the number of votes each monster received in a global terror poll, utilizing GPU-accelerated data processing and machine learning techniques.

Dataset

The competition dataset includes:

  • 12 million monster entries
  • 106 anonymous features (a mix of categorical and numerical)
  • Target variable 'y': Number of votes each monster received in the global terror poll
  • Dataset size: Approximately 8-10GB

Approach

Our approach to tackling this challenge involves the following steps:

  1. Data Loading and Preprocessing:

    • Loading the data using cuDF (RAPIDS NVIDIA API) for GPU-accelerated processing.
    • Performing basic Exploratory Data Analysis (EDA) to understand the dataset.
    • Dropping categorical columns to avoid creating sparse matrices.
    • Applying mean imputation for numerical columns.
    • Removing outliers and performing robust normalization for stability.
  2. Memory-Efficient Train-Test Split:

    • Creating a custom train-test split method to handle memory constraints effectively.
    • Using a random shuffled column for efficient data shuffling and splitting.
  3. Model Training:

    • Implementing a Random Forest Regressor using the RAPIDS cuML library for GPU-accelerated processing.
  4. Post-processing:

    • Applying inverse robust scaling to calculate the final RMSE value.
  5. Prediction and Submission:

    • Generating predictions on the test set.
    • Preparing the submission file in accordance with the competition guidelines.

Technologies Used

  • Used A100 hardware acceleration
  • Python 3.x
  • RAPIDS cuDF for GPU-accelerated data processing
  • RAPIDS cuML for GPU-accelerated machine learning
  • Scikit-learn for preprocessing and metrics
  • Google Colab Notebook for interactive development

Results

The model's performance is evaluated based on Root Mean Squared Error (RMSE), with lower scores indicating better performance.

Getting Started

  1. Clone this repository:
    https://github.com/Parag000/Nvidia-Data-Science-Competition.git
    
    

Leaderboard

Leaderboard

About

This project predicts the "Scariest Monster" using a dataset of 12 million entries and 106 features. Utilizing GPU-accelerated processing and the Random Forest Regressor using the Nvidia Rapids API. The goal is to minimize RMSE for accurate predictions

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published