The basis for Bayesian inference is derived from Bayes' theorem. Here is Bayes' theorem, equation \ref{bayestheorem}, again
$$\Pr(A | B) = \frac{\Pr(B | A)\Pr(A)}{\Pr(B)}$$
Replacing $B$ with observations $\textbf{y}$, $A$ with parameter set $\Theta$, and probabilities $\Pr$ with densities $p$ (or sometimes $\pi$ or function $f$), results in the following
$$
p(\Theta | \textbf{y}) = \frac{p(\textbf{y} | \Theta)p(\Theta)}{p(\textbf{y})}$$
where $p(\textbf{y})$ will be discussed below, p($\Theta$) is the set of prior distributions of parameter set $\Theta$ before $\textbf{y}$ is observed, $p(\textbf{y} | \Theta)$ is the likelihood of $\textbf{y}$ under a model, and $p(\Theta | \textbf{y})$ is the joint posterior distribution, sometimes called the full posterior distribution, of parameter set $\Theta$ that expresses uncertainty about parameter set $\Theta$ after taking both the prior and data into account. Since there are usually multiple parameters, $\Theta$ represents a set of $j$ parameters, and may be considered hereafter in this article as
$$\Theta = \theta_1,...,\theta_j$$
The denominator
$$p(\textbf{y}) = \int p(\textbf{y} | \Theta)p(\Theta) d\Theta$$
defines the ``marginal likelihood'' of $\textbf{y}$, or the ``prior predictive distribution'' of $\textbf{y}$, and may be set to an unknown constant $\textbf{c}$. The prior predictive distribution\footnote{The predictive distribution was introduced by \citet{jeffreys61}.} indicates what $\textbf{y}$ should look like, given the model, before $\textbf{y}$ has been observed. Only the set of prior probabilities and the model's likelihood function are used for the marginal likelihood of $\textbf{y}$. The presence of the marginal likelihood of $\textbf{y}$ normalizes the joint posterior distribution, $p(\Theta | \textbf{y})$, ensuring it is a proper distribution and integrates to one.
|