Skip to content

SPTKL/Principle_of_Urban_Informatics

Repository files navigation

Principle of Urban Informatics 2017

This is a class homework respository for Baiyue Cao (bc1561)

Description:

This class covers the basics of data-driven urban research. I aquired computational skills, basic knowledge of statistical analysis, error analysis, good practises for handling data and big-data, and communication and visualization skills. I learned how to formulate a question relevant to Urban Science, how to find an appropriate data to answer the question, prepare and analyze the data, get an answer, to whichever confidence level, and communicate my answer, and my confidence level in the answer.

Key Words/techniques:

  • Research reproducibility: Git, virtual environment, virtual machine, version control, hypothesis formulation
  • Data ETL: Pandas, Geopandas, SQL, API
  • Statistical tests: Anderson-Darling test (AD), Kullback–Leibler divergence (KL), Chi-square, Kolmogorov–Smirnov test (KS)
  • Clustering: PCA, Kmeans, Gaussian Mixture
  • Time Series: Fourier Transformation
  • Liner modeling: OLS, WLS, GLS
  • Key data set:

Content:

  1. Setting up virtual environment and formulating null hypothesis link
  2. Extracting data from MTA API link
  3. Proving central limit theorem with visualization and data exploration with citi-bike data link
  4. Replication study for Effectivness of the NYC Post-Prison Employment Program, formulating null hypothesis and conduct statistical tests. link
  5. Running KS/AD/KL/Chi-square_ tests on sample data, creating OLS and WLS models link
  6. Visualizing NYC LL84 dataset and compared linear model vs polynomial model link
  7. Using CartoDB and SQL queries for data ETL link
  8. Visualization practice with NYC HIV demographics data link
  9. Reviewing visualization, using Geopandas to plot choropleth of broad band access percentage in NYC along with LinkNYC data, using the American Community Survey API and LinkNYC open data. link
  10. (time)-series techniques: smoothing, detrending, stationary, non-stationary, homeo- & hetero-scedastic noise, vectorization. Also conducted user behavior clustering using PCA feature selection and Kmeanslink
  11. Clustering zipcodes in NYC using business activity time series data from the Census Bureau API, conducted data whitening, then Kmeans clustering and Gaussian Mixture link

Note:

Special shout out to Federica Bianco for this amazing class.

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •