-
Notifications
You must be signed in to change notification settings - Fork 13
/
L3_Density_Estimation.tex
267 lines (225 loc) · 18.2 KB
/
L3_Density_Estimation.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
\documentclass[twoside]{article}
\setlength{\oddsidemargin}{0.25 in}
\setlength{\evensidemargin}{-0.25 in}
\setlength{\topmargin}{-0.6 in}
\setlength{\textwidth}{6.5 in}
\setlength{\textheight}{8.5 in}
\setlength{\headsep}{0.75 in}
\setlength{\parindent}{0 in}
\setlength{\parskip}{0.1 in}
\newcommand{\eqdef}{:\mathrel{\mathop=}}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
%
% ADD PACKAGES here:
%
\usepackage{amsmath,amsfonts,graphicx,dsfont,amssymb, cool, cancel}
%
% The following commands set up the lecnum (lecture number)
% counter and make various numbering schemes work relative
% to the lecture number.
%
\newcounter{lecnum}
\renewcommand{\thepage}{\thelecnum-\arabic{page}}
\renewcommand{\thesection}{\thelecnum.\arabic{section}}
\renewcommand{\theequation}{\thelecnum.\arabic{equation}}
\renewcommand{\thefigure}{\thelecnum.\arabic{figure}}
\renewcommand{\thetable}{\thelecnum.\arabic{table}}
\newcommand{\indep}{\raisebox{0.05em}{\rotatebox[origin=c]{90}{$\models$}}}
%
% The following macro is used to generate the header.
%
\newcommand{\lecture}[4]{
\pagestyle{myheadings}
\thispagestyle{plain}
\newpage
\setcounter{lecnum}{#1}
\setcounter{page}{1}
\noindent
\begin{center}
\framebox{
\vbox{\vspace{2mm}
\hbox to 6.28in { {\bf Advanced Machine Learning
\hfill Fall 2020} }
\vspace{4mm}
\hbox to 6.28in { {\Large \hfill Lecture #1: #2 \hfill} }
\vspace{2mm}
\hbox to 6.28in { {\it #3 \hfill #4} }
\vspace{2mm}}
}
\end{center}
\markboth{Lecture #1: #2}{Lecture #1: #2}
{\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.}
{\bf Disclaimer}: {\it These notes are adapted from ETH's Advanced Machine Learning Course and the book All Of Statistics, Larry Wasserman, Springer.}
\vspace*{4mm}
}
%
% Convention for citations is authors' initials followed by the year.
% For example, to cite a paper by Leighton and Maggs you would type
% \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}.
% (To avoid bibliography problems, for now we redefine the \cite command.)
% Also commands that create a suitable format for the reference list.
\renewcommand{\cite}[1]{[#1]}
\def\beginrefs{\begin{list}%
{[\arabic{equation}]}{\usecounter{equation}
\setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}%
\setlength{\labelwidth}{1.6truecm}}}
\def\endrefs{\end{list}}
\def\bibentry#1{\item[\hbox{[#1]}]}
%Use this command for a figure; it puts a figure in wherever you want it.
%usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION}
\newcommand{\fig}[3]{
\vspace{#2}
\begin{center}
Figure \thelecnum.#1:~#3
\end{center}
}
% Use these for theorems, lemmas, proofs, etc.
\newtheorem{theorem}{Theorem}[lecnum]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{example}{Example}[lecnum]
\newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}}
% **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE:
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\begin{document}
%FILL IN THE RIGHT INFO.
%\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**}
\lecture{3}{Density Estimation}{}{}
%\footnotetext{These notes are partially based on those of Nigel Mansell.}
% **** YOUR NOTES GO HERE:
% Some general latex examples and examples making use of the
% macros follow.
%**** IN GENERAL, BE BRIEF. LONG SCRIBE NOTES, NO MATTER HOW WELL WRITTEN,
%**** ARE NEVER READ BY ANYBODY.
\section{Parametric Inference}
We now turn our attention to parametric models, that is, models of the form:
\begin{equation*}
\mathfrak{F} = \{f(x; \theta) : \theta \in \Theta\}
\end{equation*}
where the $\theta \subset \mathbb{R}^k$ is the parameter space and $\theta = (\theta_1,...,\theta_k)$ is the parameter. The problem of inference then reduces to the problem of estimating the parameter $\theta$. \\
Often, we are only interested in some function $T(\theta)$. For example, if $X \sim \mathcal{N}(\mu, \sigma^2)$ then the parameter is $\theta = (\mu, \sigma)$. If our goal is to estimate $\mu$ then $\mu = T(\theta)$ is called the parameter of interest and $\sigma$ is called a nuisance parameter.
\subsection{Maximum Likelihood}
The most common method for estimating parameters in a parametric model is the maximum likelihood method. Let $X_1,...,X_n$ be IID with pdf $f(x; \theta)$.
\begin{definition}
The \textbf{likelihood function} is defined by:
\begin{equation*}
\mathcal{L}_n(\theta) = \prod\limits_{i = 1}^{n}f(x_i; \theta)
\end{equation*}
The \textbf{log-likelihood function} is defined by $\ell_n(\theta) = \log\mathcal{L}_n(\theta)$.
\end{definition}
The likelihood function is just the joint density of the data, except that we treat it is a function of the parameter $\theta$. Thus, $\mathcal{L}_n(\theta) : \Theta \to [0, \infty)$. The likelihood function is not a density function: in general, it is not true that $\mathcal{L}_n(\theta)$ integrates to 1 (with respect to $\theta$).
\begin{definition}
The \textbf{maximum likelihood estimator} MLE, denoted by $\hat{\theta}_n$, is the value of $\theta$ that maximizes $\mathcal{L}_n(\theta)$.
\end{definition}
The maximum of $\ell_n(\theta)$ occurs at the same place as the maximum of $\mathcal{L}_n(\theta)$, so maximizing the log-likelihood leads to the same result as maximizing the likelihood. Often, it is easier to work with the log-likelihood.
\begin{claim}
If we multiply $\mathcal{L}_n(\theta)$ by any positive constant $c$ (not depending on $\theta$) then this will not change the MLE. Hence, we shall often drop constants in the likelihood function.
\end{claim}
\begin{example}
Let $X_1,...,X_n \sim \mathcal{N}(\mu, \sigma^2)$. The parameter is $\theta = (\mu, \sigma)$ and the likelihood function (ignoring some constants) is:
\begin{equation*}
\begin{aligned}
\mathcal{L}_n(\mu, \sigma) &= \prod\limits_{i}\frac{1}{\sigma}\exp{\{-\frac{1}{2\sigma^2}(X_i - \mu)^2\}}\\
&= \sigma^{-n}\exp{\{-\frac{1}{2\sigma^2}\sum\limits_i(X_i - \mu)^2\}}\\
&= \sigma^{-n}\exp{\{-\frac{nS^2}{2\sigma^2}\}}\exp{\{-\frac{n(\bar{X} - \mu)^2}{2\sigma^2}\}}
\end{aligned}
\end{equation*}
where $\bar{X} = n^{-1}\sum\limits_i X_i$ is the sample mean and $S^2 = n^{-1}\sum\limits_i(X_i - \bar{X})^2$ .The
last equality above follows from the fact that $\sum\limits_i(X_i - \mu)^2 = nS^2 + n(\bar{X} - \mu)^2$
which can be verified by writing $\sum\limits_i(X_i - \mu)^2 = \sum\limits_i(X_i - \bar{X} + \bar{X} -\mu)^2$ and then expanding the square. The log-likelihood is:
\begin{equation*}
l(\mu, \sigma) = -n\log\sigma - \frac{nS^2}{2\sigma^2} - \frac{n(\bar{X} - \mu)^2}{2\sigma^2}
\end{equation*}
Solving the equations:
\begin{equation*}
\frac{\partial l(\mu, \sigma)}{\partial\mu} = 0 \hspace{10pt} \text{and} \hspace{10pt} \frac{\partial l(\mu, \sigma)}{\partial\sigma} = 0,
\end{equation*}
we conclude that $\hat{\mu} = \bar{X}$ and $\hat{\sigma} = S$. It can be verified that these are indeed global maxima of the likelihood.
\end{example}
\subsection{Properties of Maximum Likelihood Estimators}
Under certain conditions on the model, the maximum likelihood estimator $\hat{\theta}_n$ possesses many properties that make it an appealing choice of estimator. The main properties of the MLE are:
\begin{enumerate}
\item The MLE is \textbf{consistent}: $\hat{\theta}_n \overset{P}{\to} \theta^*$ where $\theta^*$ denotes the true value of the parameter $\theta$;
\item The MLE is \textbf{equivariant}: if $\hat{\theta}_n$ is the MLE of $\theta$ then $g(\hat{\theta}_n)$ is the MLE of $g(\theta)$;
\item The MLE is \textbf{asymptotically Normal}: $(\hat{\theta} - \theta^* / \hat{\text{se}}) \sim \mathcal{N}(0, 1)$; also, the
estimated standard error $\hat{\text{se}}$ can often be computed analytically;
\item The MLE is \textbf{asymptotically optimal} or \textbf{efficient}: roughly, this means that among all well-behaved estimators, the MLE has the smallest variance, at least for large samples. That is, $\hat{\theta}_n$ minimizes $\mathbb{E}\big[ (\hat{\theta}_n -\theta^*)^2\big]$ as $n \to \infty$;
\item The MLE is approximately the Bayes estimator.
\end{enumerate}
The properties we discuss only hold if the model satisfies certain regularity conditions. These are essentially smoothness conditions on $f(x; \theta)$, unless otherwise stated we shall tacitly assume that these conditions hold.
\subsection{Understanding Asymptotic efficiency}
The expected square error is a measure for quantifying how good an estimator $\hat{\theta}$ is:
$$\mathbb{E}\big[ (\hat{\theta}- \theta_0)^2\big]$$
The Rao-Cramer bound shows that there does not exists an estimator that reaches $\mathbb{E}\big[ (\hat{\theta}- \theta_0)^2\big] = 0$
\begin{theorem}For any estimator $\hat{\theta}$ of $\theta$ it holds that:\medskip
$\mathbb{E}_{x|\theta}\big[ (\hat{\theta} - \theta)^2\big] \geq \dfrac{\big(\dfrac{\partial}{\partial \theta} b_{\hat{\theta}} + 1 \big)^2}{\mathbb{E}_{x|\theta}\big[ \Lambda^2\big]} + b_{\hat{\theta}}^2$\medskip
Where:
$$\Lambda = \dfrac{\partial}{\partial{\theta}} \log{p(x|\theta)}= \dfrac{1}{p(x|\theta)} \dfrac{\partial}{\partial{\theta}} p(x|\theta) \hspace{10pt} \text{and} \hspace{10pt} b_{\hat{\theta}} = \mathbb{E}_{x|\theta}[\hat{\theta}]-\theta$$
\end{theorem}
\begin{proof}
$$\mathbb{E}_{x|\theta}[\Lambda] = \int_{x} p(x|\theta) \Lambda \; dx= \int_{x} \dfrac{\partial}{\partial{\theta}} p(x|\theta) dx = \dfrac{\partial}{\partial{\theta}}\overbrace{\int_{x} p(x|\theta)dx}^{=1} = 0$$
$$\mathbb{E}_{x|\theta}[\Lambda \hat{\theta}]= \int_{x} p(x|\theta) \Lambda \hat{\theta} \; dx = \int_{x} \dfrac{\partial}{\partial{\theta}} p(x|\theta) \hat{\theta} \; dx
= \dfrac{\partial}{\partial{\theta}} \int_{x} p(x|\theta) \hat{\theta} \; dx = \dfrac{\partial}{\partial{\theta}} \mathbb{E}_{x|\theta}[\hat{\theta}] = \dfrac{\partial}{\partial{\theta}} (\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta) + 1 =\dfrac{\partial}{\partial{\theta}} b_{\hat{\theta}}+1 $$
Consider the covariance between $\Lambda$ and $\hat{\theta}$:
$$\Big(\mathbb{E}_{x|\theta}\big[ (\Lambda - \overbrace{\mathbb{E}_{x|\theta}[\Lambda]}^{=0}) (\hat{\theta} - \mathbb{E}_{x|\theta}[\hat{\theta}\big])] \Big)^2 = \Big(\mathbb{E}_{x|\theta}[\Lambda \hat{\theta}] - \mathbb{E}_{x|\theta}\big[ \Lambda \mathbb{E}_{x|\theta}[\hat{\theta}]
\big] \Big)^2 = \Big(\mathbb{E}_{x|\theta}[\Lambda \hat{\theta}] - \overbrace{\mathbb{E}_{x|\theta}[\Lambda]}^{=0} \mathbb{E}_{x|\theta}[\hat{\theta}
\big] \Big)^2 = \Big(\mathbb{E}_{x|\theta}[\Lambda \hat{\theta}] \Big)^2$$
Now, let's consider Cauchy-Schwarz inequality i.e. $\big(\mathbb{E}[xy]\big)^2 \leq \mathbb{E}[x^2]\mathbb{E}[y^2]$ applied to the cross-correlation:
$$\Big(\mathbb{E}_{x|\theta}\big[ (\Lambda - \overbrace{\mathbb{E}_{x|\theta}[\Lambda]}^{=0}) (\hat{\theta} - \mathbb{E}_{x|\theta}[\hat{\theta}\big] \Big)^2 \leq \mathbb{E}_{x|\theta}[\Lambda^2]\; \mathbb{E}_{x|\theta}\big[(\hat{\theta} - \mathbb{E}_{x|\theta}[\hat{\theta}])^2\big] = \mathbb{E}_{x|\theta}[\Lambda^2]\; \mathbb{E}_{x|\theta}\big[\big((\hat{\theta} - \theta) -(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)\big)^2\big] $$
$$ = \mathbb{E}_{x|\theta}[\Lambda^2]\; \mathbb{E}_{x|\theta}\big[(\hat{\theta} - \theta)^2 + (\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)^2 -2 (\hat{\theta}-\theta)(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)\big]$$ $$ = \mathbb{E}_{x|\theta}[\Lambda^2]\; \big \{\mathbb{E}_{x|\theta}\big[(\hat{\theta} - \theta)^2 \big] + \overbrace{\mathbb{E}_{x|\theta} \big[(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)^2 -2 (\hat{\theta}-\theta)(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)\big]}^{-b_{\hat{\theta}}^2} \big \} = \mathbb{E}_{x|\theta}[\Lambda^2]\; \big\{\mathbb{E}_{x|\theta}\big[(\hat{\theta} - \theta)^2\big]- b_{\hat{\theta}}^2 \big\}$$
It's easy to verify that $\mathbb{E}_{x|\theta} \big[(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)^2 -2 (\hat{\theta}-\theta)(\mathbb{E}_{x|\theta}[\hat{\theta}]-\theta)\big] = -b_{\hat{\theta}}^2$:
$$\mathbb{E}_{x|\theta}\Big[ \mathbb{E}_{x|\theta}^2[\hat{\theta}] + \theta^2 \cancel{-2 \theta \mathbb{E}_{x|\theta}[\hat{\theta}]} -2 \hat{\theta} \mathbb{E}_{x|\theta}[\hat{\theta}] +2\hat{\theta}\theta + \cancel{2 \theta \mathbb{E}_{x|\theta}[\hat{\theta}}] - 2 \theta^2
\Big] $$ $$=\mathbb{E}^2_{x|\theta}[\hat{\theta}] + \mathbb{E}_{x|\theta}[\theta^2] -2 \mathbb{E}^2_{x|\theta}[\hat{\theta}] +2 \theta \mathbb{E}_{x|\theta}[\hat{\theta}] -2 \mathbb{E}_{x|\theta}[\theta^2]$$$$ = - \mathbb{E}^2_{x|\theta}[\hat{\theta}] - \mathbb{E}_{x|\theta}[\theta^2] +2 \theta \mathbb{E}_{x|\theta}[\hat{\theta}] = - \mathbb{E}^2_{x|\theta}[\hat{\theta}] -\theta^2 +2 \theta \mathbb{E}_{x|\theta}[\hat{\theta}] = - \big( \mathbb{E}_{x|\theta}[\hat{\theta}] - \theta \big )^2 = - b_{\hat{\theta}}^2 $$
Finally, from the inequality proved earlier we know that:
$$\big(\mathbb{E}_{x|\theta}[\Lambda \hat{\theta}]\big)^2 = \big( \dfrac{\partial}{\partial{\theta}} b_{\hat{\theta}}+1\big)^2 \leq \mathbb{E}_{x|\theta}[\Lambda^2]\; \mathbb{E}_{x|\theta}\big[(\hat{\theta} - \theta)^2 - b_{\hat{\theta}}^2\big]$$
It follows that:
$$\mathbb{E}_{x|\theta}\big[ (\hat{\theta} - \theta)^2\big] \geq \dfrac{\big(\dfrac{\partial}{\partial \theta} b_{\hat{\theta}} + 1 \big)^2}{\mathbb{E}_{x|\theta}\big[ \Lambda^2\big]} + b_{\hat{\theta}}^2$$
\end{proof}
\subsection{Stein Estimator}
For finite samples, the maximum-likelihood estimator is not necessarily efficient.\\ Consider a multivariate random variable with distribution $\mathcal{N}(\theta_0,\sigma^2I)$ with range $\mathbb{R}^d$ and $d \geq 3$.
If we sample a single point $y$ from this distribution then the Stein Estimator is:
$$\hat{\theta}_{JS} := \Big(1 - \dfrac{(d-2)\sigma^2}{||y||^2} \Big) y$$
It is possible to prove that the Stein Estimator is better than the maximum-likelihood estimator for any $\theta_0$. That is:
$$
\mathbb{E}\Big[ (\hat{\theta}_{JS}- \theta_0)^2\Big] \leq \mathbb{E}\Big[ (\hat{\theta}_{ML}- \theta_0)^2\Big] \; \; \text{for any} \; \theta_0
$$
Moreover, the inequality is strict for some values of $\theta_0$.
\section{Bayesian Learning}
Bayesian inference is usually carried out in the following way:
\begin{itemize}
\item $\theta$ is considered to be a \textbf{random variable} with distribution $p(\theta|\mathcal{X})$.
\item $X \sim p(x)$ and $p(x)$ is unknown.
\item $p(x|\theta)$ is a statistical model that reflects our beliefs about $x$ given $\theta$.
\end{itemize}
We are looking for $p(X=x|\mathcal{X})$, i.e., the probability of $x$ given the sample set $\mathcal{X}$ (class conditional density):
$$p(X=x|\mathcal{X}) = \int \underbrace{p(x,\theta|\mathcal{X})}_{p(x|\theta,\mathcal{X})p(\theta|\mathcal{X})}d\theta =
\int p(x|\theta,\mathcal{X})p(\theta|\mathcal{X})d\theta = \int p(x|\theta)p(\theta|\mathcal{X})d\theta $$
Where $p(x|\theta,\mathcal{X})= p(x|\theta)$ since $x_i \in \mathcal{X}$ and $x$ are i.i.d.\\\\
Moreover, asymptotically it holds that $p(\theta| \mathcal{X}) \sim \delta(\theta- \hat{\theta}) $; intuitively, this follows from the fact that $\hat{\theta} \overset{p}{\to} \theta_{true}$. Thus, in the asymptotic case, we can approximate the integral with:
$$p(X=x|\mathcal{X}) = \int p(x|\theta)p(\theta|\mathcal{X})d\theta \approx
\int p(x|\theta) \delta(\theta- \hat{\theta})d\theta = p(x|\hat{\theta}) $$
This approximation was used in the early days of Bayesian inference when it was not possible to evaluate the integral.
\newpage
\subsection{Bayesian Learning of a Normal Distribution}
Let us begin with a simple example in which we consider a single Gaussian random variable $x$. We shall suppose that the variance $\sigma^2$ is known, and we consider the task of inferring the mean $\mu$ given a set of $N$ observations:
\begin{itemize}
\item The likelihood is $p(x|\mu)= \mathcal{N}(\mu,\sigma^2)$
\item The prior is $p(\mu) = \mathcal{N}(\mu_0,\sigma_0^2)$
\item The data is $\mathcal{X} = \{x_1,\ldots,x_n\}$
\end{itemize}
We want to compute the posterior distribution $p(\mu|\mathcal{X})$:
$$p(\mu|\mathcal{X}) \propto p(\mathcal{X}|\mu)p(\mu) \implies p(\mu|\mathcal{X}) = \alpha \; p(\mathcal{X}|\mu)p(\mu)
= \alpha \cdot \prod_{i \leq n} \Big\{ \dfrac{1}{\sqrt{2\pi \sigma}} \exp{\Big(-\dfrac{1}{2} \big(\dfrac{x_i-\mu}{\sigma}\big)^2 \Big)}\Big\} \cdot \dfrac{1}{\sqrt{2\pi \sigma_0}} \exp{\Big(-\dfrac{1}{2} \big(\dfrac{\mu-\mu_0}{\sigma_0}\big)^2 \Big)}$$
$$=\alpha' \cdot \prod_{i \leq n} \Big\{ \exp{\Big(-\dfrac{1}{2} \big(\dfrac{x_i-\mu}{\sigma}\big)^2 \Big)}\Big\} \cdot \exp{\Big(-\dfrac{1}{2} \big(\dfrac{\mu-\mu_0}{\sigma_0}\big)^2 \Big)} = \alpha' \cdot \exp{\Big\{-\dfrac{1}{2} \Big( \sum_{i \leq n} \Big(\dfrac{x_i-\mu}{\sigma}\Big)^2 + \Big(\dfrac{\mu-\mu_0}{\sigma_0}\Big)^2 \Big)\Big\}}$$
Expanding the squares we get:
$$p(\mu|\mathcal{X})= \alpha'\cdot \exp{\Big( \mu^2 \overbrace{\big( \dfrac{n}{\sigma^2} + \dfrac{1}{\sigma_0^2}\big)}^{a} -2\mu \overbrace{\big(\dfrac{\mu_0}{\sigma_0^2} + \dfrac{1}{\sigma^2}\sum_{i \leq n}x_i^2 \big)}^{b} + c \Big)} $$
Which we know is a Gaussian Distribution, i.e. $p(\mu|\mathcal{X}) \sim \mathcal{N}(\mu_n,\sigma_n^2)$, because the exponent is a quadratic form. Furthermore, by completing the square we know that:
$$\mu_n = \dfrac{b}{a} = \dfrac{n \sigma_0^2}{n \sigma_0^2 + \sigma^2} \hat{\mu}_n + \dfrac{ \sigma^2}{n \sigma_0^2 + \sigma^2} \mu_0$$
$$\sigma_n^2 = \dfrac{1}{a} = \dfrac{\sigma^2 \sigma_0^2}{n\sigma_0^2+\sigma^2}$$
It is worth spending a moment studying the form of the posterior mean and variance. First of all, note that the mean of the posterior is a compromise between $\mu_0$ and the maximum likelihood solution $\hat{\mu}$. If the number of observed data points $n=0$ then $\mu_n$ reduces to the prior mean as expected. For $n \to \infty$, $\mu_n$ is given by the maximum likelihood solution. \\ Similarly, consider the result for the variance of the posterior distribution $\sigma_n^2$. With no observed data points, we have the prior variance, whereas if the number of data points $n \to \infty$, the variance goes to zero and the posterior distribution becomes infinitely peaked around the maximum likelihood solution.\\
We therefore see that the maximum likelihood result of a point estimate for $\mu$ is recovered precisely from the Bayesian formalism in the limit of an infinite number of observations.
\end{document}