Skip to content

Collection of end-to-end regression problems (in-depth: linear regression, logistic regression, poisson regression) πŸ“ˆ

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



26 Commits

Repository files navigation

Generalized Linear Models (GLM)

In this repository I delve into three different types of regression.


πŸ“– About

This is a collection of end-to-end regression problems. Topics are introduced theoretically in the and tested practically in the notebooks linked below.

First, I tested the theory on toy simulations. I made four different simulations in python, taking advantage of the sklearn and statsmodels libraries:

After that I moved onto some real-world-data cases, developing three different end-to-end projects:

Further details can be found in the 'Practical Examples' section below in this

Note. A good dataset resource for linear/logistic/poisson regression, multinomial responses, survival data.
Note. To further explore feature selection: source 1, source 2, source 3, source 4, source 5.

πŸ“š Theoretical Overview

A generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function. In a generalized linear model, the outcome $\mathbf{Y}$ (dependent variable) is assumed to be generated from a particular distribution in a family of exponential distributions (e.g. Normal, Binomial, Poisson, Gamma). The mean $\mathbf{\mu}$ of the distribution depends on the independent variables $\mathbf{X}$ through the relation:

$$\mathbb{E}[\boldsymbol{Y}|\boldsymbol{X}] = \boldsymbol{\mu} = g^{-1}(\boldsymbol{X},\boldsymbol{\beta})$$

where $\mathbb{E}[\boldsymbol{Y}|\boldsymbol{X}]$ is the expected value of $\boldsymbol{Y}$ conditioned to $\boldsymbol{X}$ , $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta}$ is the linear predictor and $g(\cdot)$ is the link function. The unknown parameters $\boldsymbol{\beta}$ are typically estimated with maximum likelihood and IRLS techniques.

πŸŸ₯ For the sake of clarity, from now on we consider the case of the scalar outcome, $Y$.

Every GLM consists of three elements:

  1. a distribution (from the family of exponential distributions) for modeling $Y$
  2. a linear predictor $\boldsymbol{X},\boldsymbol{\beta}$
  3. a link function $g(\cdot)$ such that $\mathbb{E}[\boldsymbol{Y}|\boldsymbol{X}] = \boldsymbol{\mu} = g^{-1}(\boldsymbol{X},\boldsymbol{\beta})$

The following are the most famous/used examples.

Distribution Support Typical uses $\mu=\mathbb{E}[Y|\boldsymbol{X}]$ Link function
$\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = g(\mu)$
Link name Mean function
Normal $(\mu,\sigma^2)$ $(-\infty, \infty)$ Linear-response data $\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \mu$ Identity $\mu = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta}$
Gamma $(\mu,\nu)$ $(0,\infty)$ Exponential-response data $\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = -\mu^{-1}$ Negative inverse $\mu = -(\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})^{-1}$
Inverse-Gaussian $(\mu,\sigma^2)$ $(0, \infty)$ $\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \mu^{-2}$ Inverse squared $\mu = (\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})^{-1/2}$
Poisson $(\mu)$ ${0, 1, 2, ..}$ Count of occurrences in a fixed
amount of time/space
$\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \ln(\mu)$ Log $\mu = \exp(\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})$
Bernoulli $(\mu)$ ${0, 1}$ Outcome of single yes/no occurrence $\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \ln(\frac{\mu}{1-\mu})$ Logit $\mu = \frac{1}{1+\exp(-\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})}$
Binomial $(n, \mu)$ ${0, 1, .., n}$ Count of yes/no in $n$ occurrences $n\hspace{1pt}\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \ln(\frac{\mu}{1-\mu})$ Logit $\mu = \frac{1}{1+\exp(-\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})}$

πŸ“‚ Practical Examples

As already mentioned, let $Y$ be the outcome (dependent variable) and $\mathbf{X}$ be the independent variables. The three types of regression I analyzed (Linear, Logistic and Poisson) differ in the nature of $Y$. For each type, I collected an ad-hoc dataset to experiment with.

πŸ“‘ Linear Regression

In the case of linear regression $Y$ is a real number and it is modeled as:

$$\begin{cases} \hspace{4pt} Y\sim N(\mu,\sigma^2)\\ \hspace{4pt} \mu = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} \end{cases}$$

As a case study for linear regression i analyzed a dataset of human brain weights.

πŸ“‘ Logistic Regression

In the case of logistic regression $Y$ is a categorical value ($0$ or $1$) and it is modeled as:

$$\begin{cases} \hspace{4pt} Y \sim Bernoulli(\mu)\\ \hspace{4pt} \log(\frac{\mu}{1-\mu}) = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} \end{cases}$$

As a case study for logistic regression i analyzed an HR dataset.

For Advanced Classification techniques with Scikit-Learn check out Breast Cancer: End-to-End Machine Learning Project.

πŸ“‘ Poisson Regression

In the case of poisson regression $Y$ is a positive integer (count) and it is modeled as:

$$\begin{cases} \hspace{4pt} Y \sim Poisson(\mu)\\ \hspace{4pt}\log(\mu) = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} \end{cases}$$

As a case study for poisson regression i analyzed a dataset of smoking and lung cancer.

βš–οΈ Python sklearn vs. statsmodels

What libraries should be used? In general, scikit-learn is designed for machine-learning, while statsmodels is made for rigorous statistics. Both libraries have their uses. Before selecting one over the other, it is best to consider the purpose of the model. A model designed for prediction is best fit using scikit-learn, while statsmodels is best employed for explanatory models. To completely disregard one for the other would do a great disservice to an excellent Python library.

To summarize some key differences:

  • OLS efficiency: scikit-learn is faster at linear regression, the difference is more apparent for larger datasets
  • Logistic regression efficiency: employing only a single core, statsmodels is faster at logistic regression
  • Visualization: statsmodels provides a summary table
  • Solvers/methods: in general, statsmodels provides a greater variety
  • Logistic Regression: scikit-learn regularizes by default while statsmodels does not
  • Additional linear models: scikit-learn provides more models for regularization, while statsmodels helps correct for broken OLS assumptions