Skip to content

Data Science Course at General Assembly San Francisco

Notifications You must be signed in to change notification settings

hallr/DAT_SF_19

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DAT SF 19 Course Repository

Course materials for General Assembly's Data Science course in San Francisco (11/30/15 - 3/2/16).

Instructor: Rob Hall

TA's:

  • Justin Breucop
  • Dave Yerrington

Office Hours

Who When
Justin Sundays 3-6pm at GA
Dave Fridays 6-8pm at GA
Rob Slack and by appointment

Setup Info

Installation and Setup Checklist

Git and Github Setup

Project Info

Course Project Info

Course Project Examples

Course Schedule

Monday Wednesday
11/30: Course Overview, Introduction to Data Science 12/2: Version Control
12/7: Intro to Python 12/9: Intro to Machine Learning, KNN
12/14: NumPy, Pandas, Viz, Model Evaluation 12/16: Regression and Regularization
Project Question & Dataset Due
12/21: No Class (Holiday Break) 12/23: No Class (Holiday Break)
12/28: No Class (Holiday Break) 12/30: No Class (Holiday Break)
1/4: Logistic Regression 1/6: Naive Bayes
1/11: Clustering 1/13: APIs & Web Scraping
1/18: No Class (MLK Day) 1/20: Advanced Model Evaluation
Project First Draft Due
1/25: Decision Trees 1/27: Ensembles and Random Forests
2/1: Support Vector Machines 2/3: Dimensionality Reduction & PCA
2/8: Recommender Systems 2/10: Text Processing / NLP
Peer Feedback on Project Drafts Due
2/15: No Class (President's Day) 2/17: Database Technologies
Project Second Draft Due (Optional)
2/22: Pursuing DS Roles & Imbalanced Classes 2/24: Course Review & Where to Go from Here
2/29: Project Presentations & Project Due 3/2: Project Presentations & Project Due

syllabus last updated: 02/22/2016


Class 1: Introduction to Data Science

  • Welcome from General Assembly staff
  • Course overview (slides)
  • Introduction to data science (slides)
  • Command line & exercise (code)
  • Exit tickets

Homework:

Resources:


Class 2: Version Control

Homework:

  • If you haven't already, complete the homework exercise listed in the command line introduction. Create a Markdown document that includes your answers and the code you used to arrive at those answers. Add this file to a GitHub repo that you'll use for all of your coursework, and submit a link to your repo using the homework submission form.

Git and Markdown Resources:

Command Line Resources:

  • If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.

Class 3: Intro to Python

  • Jupyter Notebook overview (slides)
  • Intro to Python (slides)
  • Linear algebra refresher (slides)

Python Resources:


Class 4: Intro to Machine Learning & Classification with KNN

  • Intro to Machine Learning (slides)
  • Lab: KNN classification with Scikit-learn (notebook)

ML Resources:

  • For a more formal, in-depth introduction to machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)

KNN Resources:


Class 5: numpy & pandas, Visualization, Model Evaluation

  • Lab: numpy (notebook)
  • Lab: pandas (notebook)
  • Lab: Visualization with Bokeh (notebook)
  • Model Evaluation, incl. Cross Validation (slides)
  • Lab: Cross validation with Python and Scikit-learn (notebook)

Pandas Resources:

  • To learn more Pandas, review this three-part tutorial.
  • Browsing or searching the Pandas API Reference is an excellent way to locate a function even if you don't know its exact name.
  • If you want to go really deep into Pandas (and NumPy), read the book Python for Data Analysis by Wes McKinney, the creator of Pandas. Ping me on Slack for a discount code.
  • Here are examples of different types of joins in Pandas, for when you need to figure out how to merge two DataFrames.
  • Optional: Read the Teaching Assistant Evaluation dataset into Pandas, create the X and y objects (the response variable is "class attribute"), and go through scikit-learn's 4-step modeling process. (There's no need to submit your code unless you have a question or would like feedback!)

Model Evaluation Resources

Additional Resources:


Class 6: Regression and Regularization

  • Regression: Linear, Multiple, Polynomial (slides)
  • Regularization (slides)

Resources:


Class 7: Logistic Regression

To Go Deeper

Resources:

  • To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning.
  • For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
  • For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
  • The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.

Class 8: Naive Bayes

Resources:

  • For more on conditional probability, read these slides, or read section 2.2 of the OpenIntro Statistics textbook (14 pages).
  • For an intuitive explanation of Naive Bayes classification, read this post on airport security.
  • For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
  • When applying Naive Bayes classification to a dataset with continuous features, it is best to use GaussianNB rather than MultinomialNB. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
  • These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
  • Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
  • If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.

Class 9: Clustering

Clustering Resources:


Class 10: APIs & Web Scraping

API Resources:

Web Scraping Resources:


Class 11: Advanced Model Evaluation

  • Model Evaluation, ROC, & AUC (slides)
  • Lab: Imbalanced Classes, Evaluation, & ROC (solutions) (notebook)

ROC Resources:

Other Resources:


Class 12: Decision Trees

  • Decision Trees for Classification (slides)
  • Lab: Decision Trees (notebook)

Resources:

Installing GraphViz (optional):

  • Mac: Download and install PKG file
  • Windows: Download and install MSI file, and then add GraphViz to your path:
    • Go to Control Panel, System, Advanced System Settings, Environment Variables
    • Under system variables, edit "Path" to include the path to the "bin" folder, such as: C:\Program Files (x86)\Graphviz2.38\bin

Class 13: Ensembles and Random Forests

  • Ensemble Methods & Random Forests (slides)
  • Lab: Ensemble Methods & Random Forests (notebook)

Resources:

  • scikit-learn's documentation on ensemble methods covers both "averaging methods" (such as bagging and Random Forests) as well as "boosting methods" (such as AdaBoost and Gradient Tree Boosting).
  • For an intuitive explanation of Random Forests, read Edwin Chen's answer to How do random forests work in layman's terms?
  • MLWave's Kaggle Ensembling Guide is very thorough and shows the many different ways that ensembling can take place.

Class 14: Support Vector Machines

  • Support Vector Machines (slides)
  • Lab: SVMs: Illuminating Advanced Classifiers (notebook)

Additional Resources:

  • See the video embedded in the answer to this question on Quora for a great animation of how kernels project non-linear classification problems into a higher dimensional space where they can be solved with a linear decision boundary / maximum margin hyperplane.
  • For students who enjoy digging into the underlying mathematical concepts, this reading details the math behind support vector machines. Some of the examples in the lecture slides are taken from this reading.
  • Supervised learning superstitions cheat sheet is a very nice comparison of five classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes, and support vector machines).

Class 15: Dimensionality Reduction

  • Dimensionality Reduction (slides)
  • Lab: Dimensionality Reduction & Principle Components Analysis (notebook)

Additional Resources


Class 16: Recommender Systems

Thanks to Dave Yerrington for leading this session!

  • Recommendation Engines (slides)
  • Lab: Similar Users Recommender Lab (notebook)

Additional Resources


Class 17: Text Processing & NLP

  • Text Processing (slides)
  • Lab: Similar Users Recommender Lab (notebook)

Additional Resources


Class 18: Database Technologies & SQL

Additional Resources


Class 19: Pursuing Data Science Roles & Imbalanced Classes

  • Pursuing data science roles, Rocking data science interviews, and related Q&A - Dave Yerrington
  • Advanced Topic: Imbalanced Classes (slides)
  • Lab: Homework 4 solution walkthrough and Q&A

Additional Resources


Class 20: Course Review & Where to Go from Here

####NOTE: The second part of this session will be a working session for course projects.

Resources:

Kaggle Resources:


Classes 21 and 22: Final Project Presentations

  • Project presentations!

About

Data Science Course at General Assembly San Francisco

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •