View on GitHub

NYC Taxi Data Prediction

Download this project as a .zip file Download this project as a tar.gz file

image

The Yellow Taxicab: an NYC Icon

Harvard Data Science Final Project Video

New York City, being the most populous city in the United States, has a vast and complex transportation system, including one of the largest subway systems in the world and a large fleet of more than 13,000 yellow and green taxis, that have become iconic subjects in photographs and movies.

The subway system digests the lion share of NYC's public transport use, but the 54% of NYC's residents that don't own a car and therefore rely on public transportation still take almost 200 million taxi trips per year!

Predicting pickup density using 440 million taxi trips

Thanks to some FOIL requests, data about these taxi trips has been available to the public since last year, making it a data scientist's dream. We endeavoured to delve into this gold mine using 2.5 years of NYC taxi trip data - around 440 million records - going from January 2013 to June 2015.

The primary objective of this project was to predict the density of taxi pickups throughout New York City as it changes from day to day and hour to hour. So, given a specific location, date and time, can we predict the number of pickups in that location to a reasonably high accuracy? A secondary objective was to also predict the dropoff location.

Making transportation more efficient

Predictive models like these are interesting for many people, including of course the taxi companies themselves.

  1. Taxi companies: Companies can maximize their utilization by diverting the cabs into the locations during specific times
  2. Traffic planning: Planners can use the model predictions for traffic management on specific day/time/locations. The model can be enhanced in future to incorporate features like weather, holiday etc.
  3. Data scientists: It is interesting for data scientists to see how we have modeled location data in a simple way and yet able to get reasonably good predictions

Random forests find the hot spots

After preparing the data in the cloud with Amazon Web Services, we trained random forests with deep trees to predict the pickup density. We did that in two approaches, one which predicts pickup density on an average day of the week (e.g. Mondays). A second forest predicts pickup density on a specific day (e.g. May 1, 2015). The first performs really well, being able to account for 95% of the variation in the data. The second, which also incorporates weather data, still does reasonable well, predicting density within about a factor of 1.5.

Lastly, we started work on predicting where people wanted to be dropped off, based on their pickup location. Initial results aren't terribly good, but we have ideas to improve upon this.

Our Approach

1. The data we used:

2. Exploratory data analysis

Number of Pickups in 2013 and 2014

Throughout the days of the year (horizontal axis) and the hours of the day (vertical axis) image

3. Data preparation with Apache Spark

Amazon Web Services (AWS) Cluster: (notebook)
Data cleansing: (notebook)

4. Machine learning (Pandas/Scikit learn)

Approach 1: Predicting the pickup density for an average day of week and time of day

Predicted Density Distribution vs. Actual Density Distribution on a Monday image

The above image shows the predicted number of pickups on a given Monday using a random forest regressor on the the left and the actual number of pickups on the right. The sheet number at the top of each image corresponds to the hour of the day.

Approach 2: Predicting the pickup density for a specific date and time in the future

Underneath you find the importance of each of the features in the random forest. Clearly location is most important, followed by time of the day.

Image]

Below you find the predicted and actual pick up density for May 1st, 2015.

image

Note: the noise in the data became more apparent when we used this fine temporal granularity, and the prediction accuracy decreased. We believe this results from the regressor thinking that that no data for a particular location and time means the number of pickups is unknown. Of course in reality, no records for a particular location and time means zero pickups at that location and time, because we assume that all taxi trips are recorded. We hypothesize that this shortcoming in our data preparation leads to the widespread overprediction in areas outside Manhattan.

Approach 3: Predicting the dropoff location (lat/long) based on the pickup location & time, day of week

You can play around in Tableau by clicking on the image below to explore the dropoff locations, given a pickup location.

Image

5. Next Steps:

To really make the pickup density model shine we would have to adjust the data preparation, so that we feed information about locations without any pickups to the model as well. Right now our model receives no data about the number of pickups in these locations is and thus thinks that the number of pickups is unknown. However, the absence of records at some locations means that there were zero rides in that time period. We believe that training a model with that knowledge would lead to more accurate predictions for the number of pickups on a specific date and time, such as May 1st at 6am.

The Project Team

Citations/Credit/References

  1. Thanks to Harvard AC209 TF's: Rahul Dave & Team: we used parts of their code in multiple places: (a) Cluster setup in AWS (b) Machine learning
  2. Geohash.py: A nice geohash library