L3_Density_Estimation.tex

\documentclass[twoside]{article}
\setlength{\oddsidemargin}{0.25 in}
\setlength{\evensidemargin}{-0.25 in}
\setlength{\topmargin}{-0.6 in}
\setlength{\textwidth}{6.5 in}
\setlength{\textheight}{8.5 in}
\setlength{\headsep}{0.75 in}
\setlength{\parindent}{0 in}
\setlength{\parskip}{0.1 in}
\newcommand{\eqdef}{:\mathrel{\mathop=}}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}

%
% ADD PACKAGES here:
%

\usepackage{amsmath,amsfonts,graphicx,dsfont,amssymb, cool, cancel}
%
% The following commands set up the lecnum (lecture number)
% counter and make various numbering schemes work relative
% to the lecture number.
%
\newcounter{lecnum}
\renewcommand{\thepage}{\thelecnum-\arabic{page}}
\renewcommand{\thesection}{\thelecnum.\arabic{section}}
\renewcommand{\theequation}{\thelecnum.\arabic{equation}}
\renewcommand{\thefigure}{\thelecnum.\arabic{figure}}
\renewcommand{\thetable}{\thelecnum.\arabic{table}}
\newcommand{\indep}{\raisebox{0.05em}{\rotatebox[origin=c]{90}{$\models$}}}

%
% The following macro is used to generate the header.
%
\newcommand{\lecture}[4]{
   \pagestyle{myheadings}
   \thispagestyle{plain}
   \newpage
   \setcounter{lecnum}{#1}
   \setcounter{page}{1}
   \noindent
   \begin{center}
   \framebox{
      \vbox{\vspace{2mm}
    \hbox to 6.28in { {\bf Advanced Machine Learning
	\hfill Fall 2020} }
       \vspace{4mm}
       \hbox to 6.28in { {\Large \hfill Lecture #1: #2  \hfill} }
       \vspace{2mm}
       \hbox to 6.28in { {\it  #3 \hfill  #4} }
      \vspace{2mm}}
   }
   \end{center}
   \markboth{Lecture #1: #2}{Lecture #1: #2}

   {\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.}

   {\bf Disclaimer}: {\it These notes are adapted from ETH's Advanced Machine Learning Course and the book All Of Statistics, Larry Wasserman, Springer.}
   \vspace*{4mm}
}
%
% Convention for citations is authors' initials followed by the year.
% For example, to cite a paper by Leighton and Maggs you would type
% \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}.
% (To avoid bibliography problems, for now we redefine the \cite command.)
% Also commands that create a suitable format for the reference list.
\renewcommand{\cite}[1]{[#1]}
\def\beginrefs{\begin{list}%
        {[\arabic{equation}]}{\usecounter{equation}
         \setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}%
         \setlength{\labelwidth}{1.6truecm}}}
\def\endrefs{\end{list}}
\def\bibentry#1{\item[\hbox{[#1]}]}

%Use this command for a figure; it puts a figure in wherever you want it.
%usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION}
\newcommand{\fig}[3]{
			\vspace{#2}
			\begin{center}
			Figure \thelecnum.#1:~#3
			\end{center}
	}
% Use these for theorems, lemmas, proofs, etc.
\newtheorem{theorem}{Theorem}[lecnum]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{example}{Example}[lecnum]
\newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}}

% **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE:

\DeclareMathOperator*{\argmax}{arg\,max} 
\DeclareMathOperator*{\argmin}{arg\,min} 

\begin{document}
%FILL IN THE RIGHT INFO.
%\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**}
\lecture{3}{Density Estimation}{}{}
%\footnotetext{These notes are partially based on those of Nigel Mansell.}

% **** YOUR NOTES GO HERE:

% Some general latex examples and examples making use of the
% macros follow.  
%**** IN GENERAL, BE BRIEF. LONG SCRIBE NOTES, NO MATTER HOW WELL WRITTEN,
%**** ARE NEVER READ BY ANYBODY.

\section{Parametric Inference}

We now turn our attention to parametric models, that is, models of the form:
\begin{equation*}
    \mathfrak{F} = \{f(x; \theta) : \theta \in \Theta\}
\end{equation*}
where the $\theta \subset \mathbb{R}^k$ is the parameter space and $\theta = (\theta_1,...,\theta_k)$ is the parameter. The problem of inference then reduces to the problem of estimating the parameter $\theta$. \\
Often, we are only interested in some function $T(\theta)$. For example, if $X \sim \mathcal{N}(\mu, \sigma^2)$ then the parameter is $\theta = (\mu, \sigma)$. If our goal is to estimate $\mu$ then $\mu = T(\theta)$ is called the parameter of interest and $\sigma$ is called a nuisance parameter.

\subsection{Maximum Likelihood}

The most common method for estimating parameters in a parametric model is the maximum likelihood method. Let $X_1,...,X_n$ be IID with pdf $f(x; \theta)$.
\begin{definition}
The \textbf{likelihood function} is defined by:
\begin{equation*}
    \mathcal{L}_n(\theta) = \prod\limits_{i = 1}^{n}f(x_i; \theta)
\end{equation*}
The \textbf{log-likelihood function} is defined by $\ell_n(\theta) = \log\mathcal{L}_n(\theta)$.
\end{definition}
The likelihood function is just the joint density of the data, except that we treat it is a function of the parameter $\theta$. Thus,  $\mathcal{L}_n(\theta) : \Theta \to [0, \infty)$. The likelihood function is not a density function: in general, it is not true that $\mathcal{L}_n(\theta)$ integrates to 1 (with respect to $\theta$).
\begin{definition}
The \textbf{maximum likelihood estimator} MLE, denoted by $\hat{\theta}_n$, is the value of $\theta$ that maximizes $\mathcal{L}_n(\theta)$.
\end{definition}
The maximum of $\ell_n(\theta)$ occurs at the same place as the maximum of $\mathcal{L}_n(\theta)$, so maximizing the log-likelihood leads to the same result as maximizing the likelihood. Often, it is easier to work with the log-likelihood.
\begin{claim}
If we multiply $\mathcal{L}_n(\theta)$ by any positive constant $c$ (not depending on $\theta$) then this will not change the MLE. Hence, we shall often drop constants in the likelihood function.
\end{claim}
\begin{example}
Let $X_1,...,X_n \sim \mathcal{N}(\mu, \sigma^2)$. The parameter is $\theta = (\mu, \sigma)$ and the likelihood function (ignoring some constants) is:
\begin{equation*}
\begin{aligned}
    \mathcal{L}_n(\mu, \sigma) &= \prod\limits_{i}\frac{1}{\sigma}\exp{\{-\frac{1}{2\sigma^2}(X_i - \mu)^2\}}\\
    &= \sigma^{-n}\exp{\{-\frac{1}{2\sigma^2}\sum\limits_i(X_i - \mu)^2\}}\\
    &= \sigma^{-n}\exp{\{-\frac{nS^2}{2\sigma^2}\}}\exp{\{-\frac{n(\bar{X} - \mu)^2}{2\sigma^2}\}}
\end{aligned}
\end{equation*}
where $\bar{X} = n^{-1}\sum\limits_i X_i$ is the sample mean and $S^2 = n^{-1}\sum\limits_i(X_i - \bar{X})^2$ .The
last equality above follows from the fact that $\sum\limits_i(X_i - \mu)^2 = nS^2 + n(\bar{X} - \mu)^2$
which can be verified by writing $\sum\limits_i(X_i - \mu)^2 = \sum\limits_i(X_i - \bar{X} + \bar{X} -\mu)^2$ and then expanding the square. The log-likelihood is:
\begin{equation*}
    l(\mu, \sigma) = -n\log\sigma - \frac{nS^2}{2\sigma^2} - \frac{n(\bar{X} - \mu)^2}{2\sigma^2}
\end{equation*}
Solving the equations:
\begin{equation*}
\frac{\partial l(\mu, \sigma)}{\partial\mu} = 0 \hspace{10pt} \text{and} \hspace{10pt} \frac{\partial l(\mu, \sigma)}{\partial\sigma} = 0,
\end{equation*}
we conclude that $\hat{\mu} = \bar{X}$ and $\hat{\sigma} = S$. It can be verified that these are indeed global maxima of the likelihood.
\end{example}
\subsection{Properties of Maximum Likelihood Estimators}
Under certain conditions on the model, the maximum likelihood estimator $\hat{\theta}_n$ possesses many properties that make it an appealing choice of estimator. The main properties of the MLE are:
\begin{enumerate}
    \item The MLE is \textbf{consistent}: $\hat{\theta}_n \overset{P}{\to} \theta^*$ where $\theta^*$ denotes the true value of the parameter $\theta$;
    \item The MLE is \textbf{equivariant}: if $\hat{\theta}_n$ is the MLE of $\theta$ then $g(\hat{\theta}_n)$ is the MLE of $g(\theta)$;
    \item  The MLE is \textbf{asymptotically Normal}: $(\hat{\theta} - \theta^* / \hat{\text{se}}) \sim \mathcal{N}(0, 1)$; also, the
estimated standard error $\hat{\text{se}}$ can often be computed analytically;
    \item The MLE is \textbf{asymptotically optimal} or \textbf{efficient}: roughly, this means that among all well-behaved estimators, the MLE has the smallest variance, at least for large samples. That is, $\hat{\theta}_n$ minimizes $\mathbb{E}\big[ (\hat{\theta}_n -\theta^*)^2\big]$ as  $n \to \infty$;
    \item The MLE is approximately the Bayes estimator.  
\end{enumerate}
The properties we discuss only hold if the model satisfies certain regularity conditions. These are essentially smoothness conditions on $f(x; \theta)$, unless otherwise stated we shall tacitly assume that these conditions hold.

\subsection{Understanding Asymptotic efficiency}
The expected square error is a measure for quantifying how good an estimator $\hat{\theta}$ is:
$$\mathbb{E}\big[ (\hat{\theta}- \theta_0)^2\big]$$

The Rao-Cramer bound shows that there does not exists an estimator that reaches $\mathbb{E}\big[ (\hat{\theta}- \theta_0)^2\big] = 0$

\begin{theorem}For any  estimator $\hat{\theta}$ of $\theta$ it holds that:\medskip

$\mathbb{E}_{x|\theta}\big[ (\hat{\theta} - \theta)^2\big] \geq \dfrac{\big(\dfrac{\partial}{\partial \theta} b_{\hat{\theta}} + 1 \big)^2}{\mathbb{E}_{x|\theta}\big[ \Lambda^2\big]} + b_{\hat{\theta}}^2$\medskip

Where:
$$\Lambda = \dfrac{\partial}{\partial{\theta}} \log{p(x|\theta)}= \dfrac{1}{p(x|\theta)} \dfrac{\partial}{\partial{\theta}} p(x|\theta) \hspace{10pt} \text{and} \hspace{10pt} b_{\hat{\theta}} = \mathbb{E}_{x|\theta}[\hat{\theta}]-\theta$$
\end{theorem}

\begin{proof} 

$$\mathbb{E}_{x|\theta}[\Lambda] = \int_{x} p(x|\theta) \Lambda \; dx= \int_{x} \dfrac{\partial}{\partial{\theta}} p(x|\theta) dx = \dfrac{\partial}{\partial{\theta}}\overbrace{\int_{x} p(x|\theta)dx}^{=1} = 0$$

$$\mathbb{E}_{x|\theta}[\Lambda \hat{\theta}]= \int_{x} p(x|\theta) \Lambda \hat{\theta} \; dx = \int_{x} \dfrac{\partial}{\partial{\theta}} p(x|\theta) \hat{\theta} \; dx
= \dfrac{\partial}{\partial{\theta}} \int_{x} p(x|\theta) \hat{\theta} \; dx = \dfrac{\partial}{\partial{\theta}} \mathbb{E}_{x|\theta}[\hat{\theta}] = \dfrac{\partial}{\partial{\theta}} (\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta) + 1 =\dfrac{\partial}{\partial{\theta}} b_{\hat{\theta}}+1 $$

Consider the covariance between $\Lambda$ and $\hat{\theta}$:
$$\Big(\mathbb{E}_{x|\theta}\big[ (\Lambda - \overbrace{\mathbb{E}_{x|\theta}[\Lambda]}^{=0}) (\hat{\theta} - \mathbb{E}_{x|\theta}[\hat{\theta}\big])] \Big)^2 = \Big(\mathbb{E}_{x|\theta}[\Lambda \hat{\theta}] - \mathbb{E}_{x|\theta}\big[ \Lambda \mathbb{E}_{x|\theta}[\hat{\theta}]  
\big] \Big)^2 = \Big(\mathbb{E}_{x|\theta}[\Lambda \hat{\theta}] -  \overbrace{\mathbb{E}_{x|\theta}[\Lambda]}^{=0} \mathbb{E}_{x|\theta}[\hat{\theta}
\big] \Big)^2 =  \Big(\mathbb{E}_{x|\theta}[\Lambda \hat{\theta}] \Big)^2$$

Now, let's consider Cauchy-Schwarz inequality i.e. $\big(\mathbb{E}[xy]\big)^2 \leq \mathbb{E}[x^2]\mathbb{E}[y^2]$ applied to the cross-correlation:
$$\Big(\mathbb{E}_{x|\theta}\big[ (\Lambda - \overbrace{\mathbb{E}_{x|\theta}[\Lambda]}^{=0}) (\hat{\theta} - \mathbb{E}_{x|\theta}[\hat{\theta}\big] \Big)^2 \leq \mathbb{E}_{x|\theta}[\Lambda^2]\; \mathbb{E}_{x|\theta}\big[(\hat{\theta} - \mathbb{E}_{x|\theta}[\hat{\theta}])^2\big] = \mathbb{E}_{x|\theta}[\Lambda^2]\; \mathbb{E}_{x|\theta}\big[\big((\hat{\theta} - \theta) -(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)\big)^2\big] $$ 
$$ =  \mathbb{E}_{x|\theta}[\Lambda^2]\; \mathbb{E}_{x|\theta}\big[(\hat{\theta} - \theta)^2 + (\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)^2 -2 (\hat{\theta}-\theta)(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)\big]$$ $$ =  \mathbb{E}_{x|\theta}[\Lambda^2]\; \big \{\mathbb{E}_{x|\theta}\big[(\hat{\theta} - \theta)^2 \big] + \overbrace{\mathbb{E}_{x|\theta} \big[(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)^2 -2 (\hat{\theta}-\theta)(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)\big]}^{-b_{\hat{\theta}}^2} \big \}  = \mathbb{E}_{x|\theta}[\Lambda^2]\; \big\{\mathbb{E}_{x|\theta}\big[(\hat{\theta} - \theta)^2\big]- b_{\hat{\theta}}^2 \big\}$$ 

It's easy to verify that $\mathbb{E}_{x|\theta} \big[(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)^2 -2 (\hat{\theta}-\theta)(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)\big] = -b_{\hat{\theta}}^2$:

$$\mathbb{E}_{x|\theta}\Big[ \mathbb{E}_{x|\theta}^2[\hat{\theta}] + \theta^2 \cancel{-2 \theta \mathbb{E}_{x|\theta}[\hat{\theta}]} -2 \hat{\theta} \mathbb{E}_{x|\theta}[\hat{\theta}]  +2\hat{\theta}\theta + \cancel{2 \theta \mathbb{E}_{x|\theta}[\hat{\theta}}] - 2 \theta^2
\Big] $$ $$=\mathbb{E}^2_{x|\theta}[\hat{\theta}] + \mathbb{E}_{x|\theta}[\theta^2] -2 \mathbb{E}^2_{x|\theta}[\hat{\theta}] +2 \theta \mathbb{E}_{x|\theta}[\hat{\theta}] -2 \mathbb{E}_{x|\theta}[\theta^2]$$$$ =  - \mathbb{E}^2_{x|\theta}[\hat{\theta}] - \mathbb{E}_{x|\theta}[\theta^2] +2 \theta \mathbb{E}_{x|\theta}[\hat{\theta}] = - \mathbb{E}^2_{x|\theta}[\hat{\theta}] -\theta^2 +2 \theta \mathbb{E}_{x|\theta}[\hat{\theta}] = - \big( \mathbb{E}_{x|\theta}[\hat{\theta}]  - \theta \big )^2 = - b_{\hat{\theta}}^2 $$

Finally, from the inequality proved earlier we know that:

$$\big(\mathbb{E}_{x|\theta}[\Lambda \hat{\theta}]\big)^2 = \big( \dfrac{\partial}{\partial{\theta}} b_{\hat{\theta}}+1\big)^2 \leq \mathbb{E}_{x|\theta}[\Lambda^2]\; \mathbb{E}_{x|\theta}\big[(\hat{\theta} - \theta)^2 - b_{\hat{\theta}}^2\big]$$
It follows that:
$$\mathbb{E}_{x|\theta}\big[ (\hat{\theta} - \theta)^2\big] \geq \dfrac{\big(\dfrac{\partial}{\partial \theta} b_{\hat{\theta}} + 1 \big)^2}{\mathbb{E}_{x|\theta}\big[ \Lambda^2\big]} + b_{\hat{\theta}}^2$$
\end{proof}

\subsection{Stein Estimator}
For finite samples, the maximum-likelihood estimator is not necessarily efficient.\\ Consider a multivariate random variable with distribution $\mathcal{N}(\theta_0,\sigma^2I)$ with range $\mathbb{R}^d$ and $d \geq 3$.
If we sample a single point $y$ from this distribution then the Stein Estimator is:

$$\hat{\theta}_{JS} := \Big(1 - \dfrac{(d-2)\sigma^2}{||y||^2} \Big) y$$

It is possible to prove that the Stein Estimator is better than the maximum-likelihood estimator for any $\theta_0$.  That is:
$$
\mathbb{E}\Big[ (\hat{\theta}_{JS}- \theta_0)^2\Big] \leq \mathbb{E}\Big[ (\hat{\theta}_{ML}- \theta_0)^2\Big] \; \; \text{for any} \; \theta_0 
$$
Moreover, the inequality is strict for some values of $\theta_0$.

\section{Bayesian Learning}
Bayesian inference is usually carried out in the following way:

\begin{itemize}
    \item $\theta$ is considered to be a \textbf{random variable}  with distribution $p(\theta|\mathcal{X})$.
    \item $X \sim p(x)$  and $p(x)$ is unknown. 
    \item  $p(x|\theta)$ is  a statistical model that reflects our beliefs about $x$ given $\theta$.
\end{itemize}

We are looking for $p(X=x|\mathcal{X})$, i.e., the probability of $x$ given the sample set $\mathcal{X}$ (class conditional density):

$$p(X=x|\mathcal{X}) = \int \underbrace{p(x,\theta|\mathcal{X})}_{p(x|\theta,\mathcal{X})p(\theta|\mathcal{X})}d\theta =
\int p(x|\theta,\mathcal{X})p(\theta|\mathcal{X})d\theta = \int p(x|\theta)p(\theta|\mathcal{X})d\theta  $$

Where $p(x|\theta,\mathcal{X})= p(x|\theta)$ since $x_i \in \mathcal{X}$ and $x$ are i.i.d.\\\\
Moreover, asymptotically it holds that $p(\theta| \mathcal{X}) \sim \delta(\theta- \hat{\theta}) $; intuitively, this follows from the fact that $\hat{\theta}  \overset{p}{\to} \theta_{true}$. Thus, in the asymptotic case, we can approximate the integral with:

$$p(X=x|\mathcal{X}) = \int p(x|\theta)p(\theta|\mathcal{X})d\theta \approx
\int p(x|\theta) \delta(\theta- \hat{\theta})d\theta  = p(x|\hat{\theta}) $$

This approximation was used in the early days of Bayesian inference when it was not possible to evaluate the integral. 
\newpage
\subsection{Bayesian Learning of a Normal Distribution}
Let us begin with a simple example in which we consider a single Gaussian random variable $x$. We shall suppose that the variance $\sigma^2$ is known, and we consider the task of inferring the mean $\mu$ given a set of $N$ observations:
\begin{itemize}
    \item The likelihood is $p(x|\mu)= \mathcal{N}(\mu,\sigma^2)$
    \item The prior is $p(\mu) = \mathcal{N}(\mu_0,\sigma_0^2)$
    \item The data is $\mathcal{X} = \{x_1,\ldots,x_n\}$
\end{itemize}

We want to compute the posterior distribution $p(\mu|\mathcal{X})$:
$$p(\mu|\mathcal{X}) \propto p(\mathcal{X}|\mu)p(\mu) \implies p(\mu|\mathcal{X}) = \alpha \; p(\mathcal{X}|\mu)p(\mu)
= \alpha \cdot \prod_{i \leq n} \Big\{ \dfrac{1}{\sqrt{2\pi \sigma}} \exp{\Big(-\dfrac{1}{2} \big(\dfrac{x_i-\mu}{\sigma}\big)^2 \Big)}\Big\} \cdot \dfrac{1}{\sqrt{2\pi \sigma_0}} \exp{\Big(-\dfrac{1}{2} \big(\dfrac{\mu-\mu_0}{\sigma_0}\big)^2 \Big)}$$
$$=\alpha' \cdot  \prod_{i \leq n} \Big\{  \exp{\Big(-\dfrac{1}{2} \big(\dfrac{x_i-\mu}{\sigma}\big)^2 \Big)}\Big\} \cdot \exp{\Big(-\dfrac{1}{2} \big(\dfrac{\mu-\mu_0}{\sigma_0}\big)^2 \Big)} = \alpha' \cdot \exp{\Big\{-\dfrac{1}{2} \Big( \sum_{i \leq n} \Big(\dfrac{x_i-\mu}{\sigma}\Big)^2 + \Big(\dfrac{\mu-\mu_0}{\sigma_0}\Big)^2  \Big)\Big\}}$$
Expanding the squares we get:
$$p(\mu|\mathcal{X})= \alpha'\cdot \exp{\Big(  \mu^2 \overbrace{\big( \dfrac{n}{\sigma^2} + \dfrac{1}{\sigma_0^2}\big)}^{a} -2\mu \overbrace{\big(\dfrac{\mu_0}{\sigma_0^2} + \dfrac{1}{\sigma^2}\sum_{i \leq n}x_i^2  \big)}^{b} + c \Big)} $$
Which we know is a Gaussian Distribution, i.e. $p(\mu|\mathcal{X}) \sim \mathcal{N}(\mu_n,\sigma_n^2)$, because the exponent is a quadratic form. Furthermore, by completing the square we know that:
$$\mu_n = \dfrac{b}{a} = \dfrac{n \sigma_0^2}{n \sigma_0^2 + \sigma^2} \hat{\mu}_n + \dfrac{ \sigma^2}{n \sigma_0^2 + \sigma^2} \mu_0$$
$$\sigma_n^2 = \dfrac{1}{a} = \dfrac{\sigma^2 \sigma_0^2}{n\sigma_0^2+\sigma^2}$$

It is worth spending a moment studying the form of the posterior mean and variance. First of all, note that the mean of the posterior is a compromise between $\mu_0$ and the maximum likelihood solution $\hat{\mu}$. If the number of observed data points $n=0$ then $\mu_n$ reduces to the prior mean as expected. For $n \to \infty$, $\mu_n$ is given by the maximum likelihood solution. \\ Similarly, consider the result for the variance of the posterior distribution $\sigma_n^2$. With no observed data points, we have the prior variance, whereas if the number of data points $n \to \infty$, the variance goes to zero and the posterior distribution becomes infinitely peaked around the maximum likelihood solution.\\
We therefore see that the maximum likelihood result of a point estimate for $\mu$ is recovered precisely from the Bayesian formalism in the limit of an infinite number of observations.

\end{document}