Course materials for General Assembly's Data Science course in San Francisco (11/30/15 - 3/2/16).
Instructor: Rob Hall
TA's:
- Justin Breucop
- Dave Yerrington
Who | When |
---|---|
Justin | Sundays 3-6pm at GA |
Dave | Fridays 6-8pm at GA |
Rob | Slack and by appointment |
Installation and Setup Checklist
Monday | Wednesday |
---|---|
11/30: Course Overview, Introduction to Data Science | 12/2: Version Control |
12/7: Intro to Python | 12/9: Intro to Machine Learning, KNN |
12/14: NumPy, Pandas, Viz, Model Evaluation | 12/16: Regression and Regularization Project Question & Dataset Due |
12/21: No Class (Holiday Break) | 12/23: No Class (Holiday Break) |
12/28: No Class (Holiday Break) | 12/30: No Class (Holiday Break) |
1/4: Logistic Regression | 1/6: Naive Bayes |
1/11: Clustering | 1/13: APIs & Web Scraping |
1/18: No Class (MLK Day) | 1/20: Advanced Model Evaluation Project First Draft Due |
1/25: Decision Trees | 1/27: Ensembles and Random Forests |
2/1: Support Vector Machines | 2/3: Dimensionality Reduction & PCA |
2/8: Recommender Systems | 2/10: Text Processing / NLP Peer Feedback on Project Drafts Due |
2/15: No Class (President's Day) | 2/17: Database Technologies Project Second Draft Due (Optional) |
2/22: Pursuing DS Roles & Imbalanced Classes | 2/24: Course Review & Where to Go from Here |
2/29: Project Presentations & Project Due | 3/2: Project Presentations & Project Due |
syllabus last updated: 02/22/2016
- Welcome from General Assembly staff
- Course overview (slides)
- Introduction to data science (slides)
- Command line & exercise (code)
- Exit tickets
Homework:
- Work through GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows), and then browse through this command line reference.
- Watch videos 1 through 8 (21 minutes) of Introduction to Git and GitHub.
- If your laptop has any setup issues, please work with us to resolve them by Wednesday.
Resources:
- For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
- For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
- Quora has a data science topic FAQ with lots of interesting Q&A.
- Final project presentations from other class
- Q&A on course project expectations & schedule
- Version Control with Git and GitHub (slides)
- Git Configuration and Github setup
- Moved to Class 3: Intro to Python (slides)
- Exit tickets
Homework:
- If you haven't already, complete the homework exercise listed in the command line introduction. Create a Markdown document that includes your answers and the code you used to arrive at those answers. Add this file to a GitHub repo that you'll use for all of your coursework, and submit a link to your repo using the homework submission form.
Git and Markdown Resources:
- Pro Git is an excellent book for learning Git. Read Chapter 1 - Getting Started and Chapter 2 - Git Basics to gain a deeper understanding of version control and basic commands.
- Very quick Git tutorial by Github and Codeschool. Recommended practice!
- Github's Mastering Markdown is a good starting point for learning github-flavored markdown.
- Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations.
Command Line Resources:
- If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
Python Resources:
- Codecademy's Python course: Good beginner material, including tons of in-browser exercises.
- DataQuest: Similar interface to Codecademy, but focused on teaching Python in the context of data science.
- Google's Python Class: Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
- A Crash Course in Python for Scientists: Read through the Overview section for a quick introduction to Python.
- Python for Informatics: A very beginner-oriented book, with associated slides and videos.
ML Resources:
- For a more formal, in-depth introduction to machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
KNN Resources:
- A Detailed Introduction to KNN is a bit dense, but provides a more thorough introduction to KNN and its applications.
- Browse through the scikit-learn documentation for KNN to get a sense of how it's organized: user guide, module reference, class documentation
- Lab: numpy (notebook)
- Lab: pandas (notebook)
- Lab: Visualization with Bokeh (notebook)
- Model Evaluation, incl. Cross Validation (slides)
- Lab: Cross validation with Python and Scikit-learn (notebook)
Pandas Resources:
- To learn more Pandas, review this three-part tutorial.
- Browsing or searching the Pandas API Reference is an excellent way to locate a function even if you don't know its exact name.
- If you want to go really deep into Pandas (and NumPy), read the book Python for Data Analysis by Wes McKinney, the creator of Pandas. Ping me on Slack for a discount code.
- Here are examples of different types of joins in Pandas, for when you need to figure out how to merge two DataFrames.
- Optional: Read the Teaching Assistant Evaluation dataset into Pandas, create the X and y objects (the response variable is "class attribute"), and go through scikit-learn's 4-step modeling process. (There's no need to submit your code unless you have a question or would like feedback!)
Model Evaluation Resources
- For more on cross-validation, read section 5.1 of An Introduction to Statistical Learning (11 pages)
- For another explanation of training error versus testing error, the bias-variance tradeoff, and train/test split (also known as the "validation set approach"), watch Hastie and Tibshirani's video on estimating prediction error (12 minutes, starting at 2:34).
- Caltech's Learning From Data course includes a fantastic video on visualizing bias and variance (15 minutes).
- Random Test/Train Split is Not Always Enough explains why random train/test split may not be a suitable model evaluation procedure if your data has a significant time element.
Additional Resources:
- What I do when I get a new data set as told through tweets is a fun (yet enlightening) look at the process of exploratory data analysis.
Resources:
- Setosa has an excellent interactive visualization of linear regression.
- To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning, from which this lesson was adapted. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
- To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on simple linear regression and multiple linear regression.
- This introduction to linear regression is much more detailed and mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.
- John Rauser's talk on Statistics Without the Agonizing Pain (12 minutes) gives a great explanation of how the null hypothesis is rejected.
To Go Deeper
- Read these excellent articles from BetterExplained: An Intuitive Guide To Exponential Functions & e and Demystifying the Natural Logarithm (ln).
Resources:
- To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning.
- For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
- For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
- The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
Resources:
- For more on conditional probability, read these slides, or read section 2.2 of the OpenIntro Statistics textbook (14 pages).
- For an intuitive explanation of Naive Bayes classification, read this post on airport security.
- For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
- When applying Naive Bayes classification to a dataset with continuous features, it is best to use GaussianNB rather than MultinomialNB. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
- These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
- Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
- If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.
- Clustering (slides
- K-means: visualiztion
- K-means: visualization
- DBSCAN: visualization
- Lab: K-Means (notebook)
Clustering Resources:
- scikit-learn's documentation on clustering compares many different types of clustering.
- For a very thorough introduction to clustering, read chapter 8 (69 pages) of Introduction to Data Mining (available as a free download), or browse through the chapter 8 slides.
- An Introduction to Statistical Learning has useful videos on K-means clustering (17 minutes) and hierarchical clustering (15 minutes).
- Fun examples of clustering: A Statistical Analysis of the Work of Bob Ross (with data and Python code), How a Math Genius Hacked OkCupid to Find True Love, and characteristics of your zip code.
API Resources:
- Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
- API Integration in Python provides a very readable introduction to REST APIs.
- Microsoft's Face Detection API, which powers How-Old.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application.
Web Scraping Resources:
- The Beautiful Soup documentation is incredibly thorough, but is hard to use as a reference guide. However, the section on specifying a parser may be helpful if Beautiful Soup appears to be parsing a page incorrectly.
- For more Beautiful Soup examples and tutorials, see Web Scraping 101 with Python, Alex's well-commented notebook on scraping Craigslist, this notebook from Stanford's Text As Data course, and this notebook and associated video from Harvard's Data Science course.
- For a much longer web scraping tutorial covering Beautiful Soup, lxml, XPath, and Selenium, watch Web Scraping with Python (3 hours 23 minutes) from PyCon 2014. The slides and code are also available.
- For more complex web scraping projects, Scrapy is a popular application framework that works with Python. It has excellent documentation, and here's a tutorial with detailed slides and code.
- robotstxt.org has a concise explanation of how to write (and read) the
robots.txt
file. - import.io and Kimono claim to allow you to scrape websites without writing any code.
- How a Math Genius Hacked OkCupid to Find True Love and How Netflix Reverse Engineered Hollywood are two fun examples of how web scraping has been used to build interesting datasets.
- Model Evaluation, ROC, & AUC (slides)
- Lab: Imbalanced Classes, Evaluation, & ROC (solutions) (notebook)
ROC Resources:
- Rahul Patwari has a great video on ROC Curves (12 minutes).
- An introduction to ROC analysis is a very readable paper on the topic.
- These lesson notes from a course at the University of Georgia include some simple, real-world examples of the use of ROC curves.
- ROC curves can be used across a wide variety of applications, such as comparing different feature sets for detecting fraudulent Skype users, and comparing different classifiers on a number of popular datasets.
- This blog post about Amazon Machine Learning contains a neat graphic showing how classification threshold affects different evaluation metrics.
Other Resources:
- scikit-learn has extensive documentation on model evaluation.
- Section 3.3.1 of An Introduction to Statistical Learning (4 pages) has a great explanation of dummy encoding for categorical features.
Resources:
- scikit-learn's documentation on decision trees includes a nice overview of trees as well as tips for proper usage.
- For a more thorough introduction to decision trees, read section 4.3 (23 pages) of Introduction to Data Mining. (Chapter 4 is available as a free download.)
- This paper, The Science of Singing Along, contains a neat regression tree for predicting the percentage of an audience at a music venue that will sing along to a pop song.
- If you want to go deep into the different decision tree algorithms, this slide deck contains A Brief History of Classification and Regression Trees.
Installing GraphViz (optional):
- Mac: Download and install PKG file
- Windows: Download and install MSI file, and then add GraphViz to your path:
- Go to Control Panel, System, Advanced System Settings, Environment Variables
- Under system variables, edit "Path" to include the path to the "bin" folder, such as:
C:\Program Files (x86)\Graphviz2.38\bin
Resources:
- scikit-learn's documentation on ensemble methods covers both "averaging methods" (such as bagging and Random Forests) as well as "boosting methods" (such as AdaBoost and Gradient Tree Boosting).
- For an intuitive explanation of Random Forests, read Edwin Chen's answer to How do random forests work in layman's terms?
- MLWave's Kaggle Ensembling Guide is very thorough and shows the many different ways that ensembling can take place.
Additional Resources:
- See the video embedded in the answer to this question on Quora for a great animation of how kernels project non-linear classification problems into a higher dimensional space where they can be solved with a linear decision boundary / maximum margin hyperplane.
- For students who enjoy digging into the underlying mathematical concepts, this reading details the math behind support vector machines. Some of the examples in the lecture slides are taken from this reading.
- Supervised learning superstitions cheat sheet is a very nice comparison of five classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes, and support vector machines).
- Dimensionality Reduction (slides)
- Lab: Dimensionality Reduction & Principle Components Analysis (notebook)
Additional Resources
- This tutorial on Principal Components Analysis (PCA) includes good refreshers on covariance and linear algebra
- To go deeper on Singular Value Decomposition, read Kirk Baker's excellent tutorial.
Thanks to Dave Yerrington for leading this session!
Additional Resources
- Chapter 9 of Mining of Massive Datasets (36 pages) is a more thorough introduction to recommendation systems.
- Chapters 2 through 4 of A Programmer's Guide to Data Mining (165 pages) provides a friendlier introduction, with lots of Python code and exercises.
- The Netflix Prize was the famous competition for improving Netflix's recommendation system by 10%. Here are some useful articles about the Netflix Prize:
- Netflix Recommendations: Beyond the 5 stars: Two posts from the Netflix blog summarizing the competition and their recommendation system
- Winning the Netflix Prize: A Summary: Overview of the models and techniques that went into the winning solution
- A Perspective on the Netflix Prize: A summary of the competition by the winning team
- This paper summarizes how Amazon.com's recommendation system works, and this Stack Overflow Q&A has some additional thoughts.
- Facebook and Etsy have blog posts about how their recommendation systems work.
- The Global Network of Discovery provides some neat recommenders for music, authors, and movies.
- The People Inside Your Machine (23 minutes) is a Planet Money podcast episode about how Amazon Mechanical Turks can assist with recommendation engines (and machine learning in general).
- Coursera has a course on recommendation systems, if you want to go even deeper into the material.
Additional Resources
- If you want to learn a lot more NLP, check out the excellent video lectures and slides from this Coursera course (which is no longer being offered).
- Natural Language Processing with Python is the most popular book for going in-depth with the Natural Language Toolkit (NLTK).
- A Smattering of NLP in Python provides a nice overview of NLTK, as does this notebook from DAT5.
- spaCy is a newer Python library for text processing that is focused on performance (unlike NLTK).
- If you want to get serious about NLP, Stanford CoreNLP is a suite of tools (written in Java) that is highly regarded.
- When working with a large text corpus in scikit-learn, HashingVectorizer is a useful alternative to CountVectorizer.
- Automatically Categorizing Yelp Businesses discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
- Modern Methods for Sentiment Analysis shows how "word vectors" can be used for more accurate sentiment analysis.
- Identifying Humorous Cartoon Captions is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
Additional Resources
- This GA notebook provides a shorter introduction to databases and SQL that helpfully contrasts SQL queries with Pandas syntax.
- SQLZOO, Mode Analytics, Khan Academy, Codecademy, Datamonkey, and Code School all have online beginner SQL tutorials that look promising.
- What Every Data Scientist Needs to Know about SQL is a brief series of posts about SQL basics, and Introduction to SQL for Data Scientists is a paper with similar goals.
- 10 Easy Steps to a Complete Understanding of SQL is a good article for those who have some SQL experience and want to understand it at a deeper level.
- SQLite's article on Query Planning explains how SQL queries "work".
- A Comparison Of Relational Database Management Systems gives the pros and cons of SQLite, MySQL, and PostgreSQL.
- If you want to go deeper into databases and SQL, Stanford has a well-respected series of 14 mini-courses.
- Blaze is a Python package enabling you to use Pandas-like syntax to query data living in a variety of data storage systems.
- Pursuing data science roles, Rocking data science interviews, and related Q&A - Dave Yerrington
- Advanced Topic: Imbalanced Classes (slides)
- Lab: Homework 4 solution walkthrough and Q&A
Additional Resources
- This post by Jason Brownlee provides an easy-to-understand overview of options for handling imbalanced classes.
- This answer on Quora goes into more detail.
- If you want to go really deep, read this extensive academic paper, Learning from Imbalanced Data.
- Paper on using Random Forests with Imbalanced Data.
####NOTE: The second part of this session will be a working session for course projects.
Resources:
- scikit-learn's machine learning map may help you to choose the "best" model for your task.
- Choosing a Machine Learning Classifier is a short and highly readable comparison of several classification models, Classifier comparison is scikit-learn's visualization of classifier decision boundaries, Comparing supervised learning algorithms is a model comparison table that I created, and Supervised learning superstitions cheat sheet is a more thorough comparison (with links to lots of useful resources).
- Machine Learning Done Wrong, Machine Learning Gremlins (31 minutes), Clever Methods of Overfitting, and Common Pitfalls in Machine Learning all offer thoughtful advice on how to avoid common mistakes in machine learning.
- Practical machine learning tricks from the KDD 2011 best industry paper and Andrew Ng's Advice for applying machine learning include slightly more advanced advice than the resources above.
- An Empirical Comparison of Supervised Learning Algorithms is a readable research paper from 2006, which was also presented as a talk (77 minutes).
Kaggle Resources:
- Browse the excellent solution paper from the winner of Kaggle's CrowdFlower competition for an example of the work and insight required to win a Kaggle competition.
- Interpretable vs Powerful Predictive Models: Why We Need Them Both is a short post on how the tactics useful in a Kaggle competition are not always useful in the real world.
- Project presentations!