Predicting US Airline Delay using spark(pyspark) and Apache Arrow.
The objective of this project is to perform analysis on the historical flight data to gain valuable insights and build a predictive model to predict whether a flight will be delayed or not for a given set of flight characteristics.
Questions to be answered post analysis:
• Which Airports have the Most Delays? • Which Routes are typically the most delayed? • Airport Origin delay per month • Airport Origin delay per day/hour • What are the primary causes for flight delays?
The objective of the predictive model(Logistic Regression) is to build a model to predict whether a flight will be delayed or not based on certain characteristics of the flight. Such a model may help both passengers as well as airline companies to predict future delays and minimize them for the future references.
Dataset is obtained from "http://stat-computing.org/dataexpo/2009/the-data.html" "https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID="