This is the final project repository for Colby Wise and Michael Alvarino's Machine Learning with Probabilistic Programming final Project
Can we use probabilistic programming to create a generative model of trip durations for taxi trips in New York City?
We will be using the NYC Yellow Cab
Dataset,
specifically the data for 2016. This included approximately 1.8 million trips
between January and June. We initially preprocessed the data ourselves to add
neighborhoods of the pickups and dropoffs, but later found a copy of the
dataset that already included the preprocessing. The preprocessing we performed
initially is included in our preprocessing
directory.
Our first iteration through Box's loop was as simple as we could try, a Baysian
Gaussian Linear Model such that
Our second iteration through Box's loop added a basis function to our gaussian
linear model. We decided to try a simple polynomial basis, so that
Understanding that there were infinitely many basis functions and that we could not check and test them all, we decided to use a gaussian process. We tested two different kernels, the gaussian kernel, and the rational quadratic. We found that the gaussian process, though not a more accurate prdictor of trip duration, was a much better model of our generative process.
Use tensorflow to optimize the parameters of the gaussian process kernel function