This project aims to perform real-time weather analysis using stream and batch processing techniques. It involves ingesting weather data from a CSV file, streaming it to Apache Kafka, storing it in MySQL, and then processing it using Apache Spark. The analysis includes finding the minimum, maximum, and average values of various weather parameters.
The CSV file contains the following columns:
- Formatted Date
- Summary
- Precip Type
- Temperature (C)
- Apparent Temperature (C)
- Humidity
- Wind Speed (km/h)
- Wind Bearing (degrees)
- Visibility (km)
- Loud Cover
- Pressure (millibars)
- Daily Summary
This project's architecture includes the following components:
- Producer (Stream): Reads the weather data from the CSV file and produces it to Apache Kafka.
- Apache Kafka: Acts as a message broker, receiving weather data from the producer and making it available to consumers.
- MySQL Database: Stores the weather data for both real-time and batch processing.
- Consumer (Stream): Subscribes to the Kafka topic and performs real-time processing of the weather data using Apache Spark Streaming.
- Consumer (Batch): Retrieves data from the MySQL database and conducts batch processing using Apache Spark.
- Producer (Batch): Reads weather data from the MySQL database and produces it to a Kafka topic for further analysis.
- Data Ingestion: The producer (stream) reads weather data from the CSV file and produces it to a Kafka topic.
- Streaming Data Processing: The streaming consumer subscribes to the Kafka topic, processes incoming data using Apache Spark Streaming, and computes real-time metrics such as minimum, maximum, and average values.
- Data Storage: The processed data is stored in the MySQL database for future reference and batch processing.
- Batch Processing (Producer): The producer (batch) retrieves weather data from the MySQL database and produces it to a Kafka topic.
- Batch Processing (Consumer): The batch consumer retrieves data from the Kafka topic, conducts batch processing using Apache Spark, and computes metrics similar to the streaming process.
- Comparison: The results from both streaming and batch processing are compared to determine which approach yields better insights or performance.
-
Start Kafka and Spark Servers:
- Run
docker-compose up
command to start Kafka and Spark servers.
- Run
-
Push Data to Database and Kafka Stream:
- Run
producer.py
script to push data to your database and Kafka stream.
- Run
-
Receive Kafka Stream and Calculate Metrics:
- Run
consumer.py
script to receive the stream from Kafka. - Consumer script will calculate the minimum, maximum, and average for the received data and put it into a text file.
- Run
-
Run Batch Producer:
- Run
batch_producer.py
script to pull data from the database in batches of 10 and send it to Kafka.
- Run
-
Process Batched Data and Store in Text File:
- Run
consumer2.py
script to process the batched data received from Kafka. - Consumer2 script will put the processed data into a text file.
- Run
-
Comparison:
- Compare the text files generated by
consumer.py
andconsumer2.py
for speed and accuracy.
- Compare the text files generated by
- Scalability: The system is designed to scale horizontally to handle increasing data volumes.
- Reliability: Measures are implemented to ensure high availability and fault tolerance.
- Efficiency: Optimization techniques are applied to minimize latency and maximize throughput in both streaming and batch processing.
- Data Consistency: Ensuring consistency between real-time and batch-processed data can be challenging due to the asynchronous nature of stream processing.
- Performance Tuning: Fine-tuning the system for optimal performance, especially when dealing with large datasets, requires careful consideration.
- Resource Management: Efficient utilization of computational resources is essential to prevent bottlenecks and ensure smooth operation of the system.
This project demonstrates the effectiveness of combining stream and batch processing techniques for real-time weather analysis. By comparing the results of both approaches, we can gain valuable insights into the performance and suitability of each method for different use cases.
- Special thanks to the open-source community for providing the tools and frameworks used in this project.
- Acknowledgment to team members and advisors for their support and contributions throughout the development process.