楼主: tulipsliu
8490 294

[学科前沿] [QuantEcon]MATLAB混编FORTRAN语言 [推广有奖]

61
tulipsliu(未真实交易用户) 在职认证  发表于 2020-12-15 17:46:45
The basis for Bayesian inference is derived from Bayes' theorem. Here is Bayes' theorem, equation \ref{bayestheorem}, again

$$\Pr(A | B) = \frac{\Pr(B | A)\Pr(A)}{\Pr(B)}$$

Replacing $B$ with observations $\textbf{y}$, $A$ with parameter set $\Theta$, and probabilities $\Pr$ with densities $p$ (or sometimes $\pi$ or function $f$), results in the following

$$
p(\Theta | \textbf{y}) = \frac{p(\textbf{y} | \Theta)p(\Theta)}{p(\textbf{y})}$$

where $p(\textbf{y})$ will be discussed below, p($\Theta$) is the set of prior distributions of parameter set $\Theta$ before $\textbf{y}$ is observed, $p(\textbf{y} | \Theta)$ is the likelihood of $\textbf{y}$ under a model, and $p(\Theta | \textbf{y})$ is the joint posterior distribution, sometimes called the full posterior distribution, of parameter set $\Theta$ that expresses uncertainty about parameter set $\Theta$ after taking both the prior and data into account. Since there are usually multiple parameters, $\Theta$ represents a set of $j$ parameters, and may be considered hereafter in this article as

$$\Theta = \theta_1,...,\theta_j$$

The denominator

$$p(\textbf{y}) = \int p(\textbf{y} | \Theta)p(\Theta) d\Theta$$

defines the ``marginal likelihood'' of $\textbf{y}$, or the ``prior predictive distribution'' of $\textbf{y}$, and may be set to an unknown constant $\textbf{c}$. The prior predictive distribution\footnote{The predictive distribution was introduced by \citet{jeffreys61}.} indicates what $\textbf{y}$ should look like, given the model, before $\textbf{y}$ has been observed. Only the set of prior probabilities and the model's likelihood function are used for the marginal likelihood of $\textbf{y}$. The presence of the marginal likelihood of $\textbf{y}$ normalizes the joint posterior distribution, $p(\Theta | \textbf{y})$, ensuring it is a proper distribution and integrates to one.

62
tulipsliu(未真实交易用户) 在职认证  发表于 2020-12-15 17:47:21
By replacing $p(\textbf{y})$ with $\textbf{c}$, which is short for a `constant of proportionality', the model-based formulation of Bayes' theorem becomes

$$p(\Theta | \textbf{y}) = \frac{p(\textbf{y} | \Theta)p(\Theta)}{\textbf{c}}$$

By removing $\textbf{c}$ from the equation, the relationship changes from 'equals' ($=$) to 'proportional to' ($\propto$)\footnote{For those unfamiliar with $\propto$, this symbol simply means that two quantities are proportional if they vary in such a way that one is a constant multiplier of the other. This is due to the constant of proportionality $\textbf{c}$ in the equation. Here, this can be treated as `equal to'.}

$$
p(\Theta | \textbf{y}) \propto p(\textbf{y} | \Theta)p(\Theta)
$$

63
tulipsliu(未真实交易用户) 在职认证  发表于 2020-12-15 17:47:45
This form can be stated as the unnormalized joint posterior being proportional to the likelihood times the prior. However, the goal in model-based Bayesian inference is usually not to summarize the unnormalized joint posterior distribution, but to summarize the marginal distributions of the parameters. The full parameter set $\Theta$ can typically be partitioned into

$$\Theta = \{\Phi, \Lambda\}$$

where $\Phi$ is the sub-vector of interest, and $\Lambda$ is the complementary sub-vector of $\Theta$, often referred to as a vector of nuisance parameters. In a Bayesian framework, the presence of nuisance parameters does not pose any formal, theoretical problems. A nuisance parameter is a parameter that exists in the joint posterior distribution of a model, though it is not a parameter of interest. The marginal posterior distribution of $\phi$, the parameter of interest, can simply be written as

$$p(\phi | \textbf{y}) = \int p(\phi, \Lambda | \textbf{y}) d\Lambda$$

In model-based Bayesian inference, Bayes' theorem is used to estimate the unnormalized joint posterior distribution, and finally the user can assess and make inferences from the marginal posterior distributions.

64
tulipsliu(未真实交易用户) 在职认证  发表于 2020-12-15 17:48:51
The flat prior was historically the first attempt at an uninformative prior. The unbounded, uniform distribution, often called a flat prior, is

$$\theta \sim \mathcal{U}(-\infty, \infty)$$

where $\theta$ is uniformly-distributed from negative infinity to positive infinity. Although this seems to allow the posterior distribution to be affected soley by the data with no impact from prior information, this should generally be avoided because this probability distribution is improper, meaning it will not integrate to one since the integral of the assumed $p(\theta)$ is infinity (which violates the assumption that the probabilities sum to one). This may cause the posterior to be improper, which invalidates the model.

65
tulipsliu(未真实交易用户) 在职认证  发表于 2020-12-15 17:50:10
It is important for the prior distribution to be proper. A prior distribution, $p(\theta)$, is improper\footnote{Improper priors were introduced in \citet{jeffreys61}.} when

$$\int p(\theta) d\theta = \infty$$

As noted previously, an unbounded uniform prior distribution is an improper prior distribution because $p(\theta) \propto 1$, for $-\infty < \theta < \infty$. An improper prior distribution can cause an improper posterior distribution. When the posterior distribution is improper, inferences are invalid, it is non-integrable, and Bayes factors cannot be used (though there are exceptions).

To determine the propriety of a joint posterior distribution, the marginal likelihood must be finite for all $\textbf{y}$. Again, the marginal likelihood is

$$p(\textbf{y}) = \int p(\textbf{y} | \Theta) p(\Theta) d\Theta$$

Although improper prior distributions can be used, it is good practice to avoid them.

66
tulipsliu(未真实交易用户) 在职认证  发表于 2020-12-15 17:50:44
Prior distributions may be estimated within the model via hyperprior distributions, which are usually vague and nearly flat. Parameters of hyperprior distributions are called hyperparameters. Using hyperprior distributions to estimate prior distributions is known as hierarchical Bayes. In theory, this process could continue further, using hyper-hyperprior distributions to estimate the hyperprior distributions. Estimating priors through hyperpriors, and from the data, is a method to elicit the optimal prior distributions. One of many natural uses for hierarchical Bayes is multilevel modeling.

Recall that the unnormalized joint posterior distribution (equation \ref{jointposterior}) is proportional to the likelihood times the prior distribution

$$p(\Theta | \textbf{y}) \propto p(\textbf{y} | \Theta)p(\Theta)$$

The simplest hierarchical Bayes model takes the form

$$p(\Theta, \Phi | \textbf{y}) \propto p(\textbf{y} | \Theta)p(\Theta | \Phi)p(\Phi)$$

where $\Phi$ is a set of hyperprior distributions. By reading the equation from right to left, it begins with hyperpriors $\Phi$, which are used conditionally to estimate priors $p(\Theta | \Phi)$, which in turn is used, as per usual, to estimate the likelihood $p(\textbf{y} | \Theta)$, and finally the posterior is $p(\Theta, \Phi | \textbf{y})$.

67
tulipsliu(未真实交易用户) 在职认证  发表于 2020-12-15 17:51:24
Although the gamma distribution is the conjugate prior distribution for the precision of a normal distribution \citep{spiegelhalter03},

$$\tau \sim \mathcal{G}(0.001, 0.001),$$

better properties for scale parameters are yielded with the non-conjugate, proper, half-Cauchy\footnote{The half-t distribution is another option.} distribution, with a general recommendation of scale=25 for a weakly informative scale parameter \citep{gelman06},

$$\sigma \sim \mathcal{HC}(25)$$
$$\tau = \sigma^{-2}$$

When the half-Cauchy is unavailable, a uniform distribution is often placed on $\sigma$ in hierarchical Bayes when the number of groups is, say, at least five,

$$\sigma \sim \mathcal{U}(0, 100)$$
$$\tau = \sigma^{-2}$$

When conjugate distributions are used, a summary statistic for a posterior distribution of $\theta$ may be represented as $t(\textbf{y})$ and said to be a sufficient statistic \citep[p. 42]{gelman04}. When nonconjugate distributions are used, a summary statistic for a posterior distribution is usually not a sufficient statistic. A sufficient statistic is a statistic that has the property of sufficiency with respect to a statistical model and the associated unknown parameter. The quantity $t(\textbf{y})$ is said to be a sufficient statistic for $\theta$, because the likelihood for $\theta$ depends on the data $\textbf{y}$ only through the value of $t(\textbf{y})$. Sufficient statistics are useful in algebraic manipulations of likelihoods and posterior distributions.

68
tulipsliu(未真实交易用户) 在职认证  发表于 2020-12-15 17:52:01
In order to complete the definition of a Bayesian model, both the prior distributions and the likelihood\footnote{Ronald A. Fisher, a prominent frequentist, introduced the term likelihood in 1921 \citep{fisher21}, though the concept of likelihood was used by Bayes and Laplace. Fisher's introduction preceded a series of the most influential papers in statistics (mostly in 1922 and 1925), in which Fisher introduced numerous terms that are now common: consistency, efficiency, estimation, information, maximum likelihood estimate, optimality, parameter, statistic, sufficiency, and variance. He was the first to use Greek letters for unknown parameters and Latin letters for the estimates. Later contributions include F statistics, design of experiments, ANOVA, and many more.} must be approximated or fully specified. The likelihood, likelihood function, or $p(\textbf{y} | \Theta)$, contains the available information provided by the sample. The likelihood is

$$p(\textbf{y} | \Theta) = \prod^n_{i=1} p(\textbf{y}_i | \Theta)$$

The data $\textbf{y}$ affects the posterior distribution $p(\Theta | \textbf{y})$ only through the likelihood function $p(\textbf{y} | \Theta)$. In this way, Bayesian inference obeys the likelihood principle, which states that for a given sample of data, any two probability models $p(\textbf{y} | \Theta)$ that have the same likelihood function yield the same inference for $\Theta$. For more information on the likelihood principle, see section \ref{lprinciple}.

69
tulipsliu(未真实交易用户) 在职认证  发表于 2020-12-15 17:53:17
In non-technical parlance, ``likelihood'' is usually a synonym for ``probability'', but in statistical usage there is a clear distinction: whereas ``probability'' allows us to predict unknown outcomes based on known parameters, ``likelihood'' allows us to estimate unknown parameters based on known outcomes.

In a sense, likelihood can be thought a reversed version of conditional probability. Reasoning forward from a given parameter $\theta$, the conditional probability of $\textbf{y}$ is the density $p(\textbf{y} | \theta)$. With $\theta$ as a parameter, here are relationships in expressions of the likelihood function

$$\mathscr{L}(\theta | \textbf{y}) = p(\textbf{y} | \theta) = f(\textbf{y} | \theta)$$

where $\textbf{y}$ is the observed outcome of an experiment, and the likelihood ($\mathscr{L}$) of $\theta$ given $\textbf{y}$ is equal to the density $p(\textbf{y} | \theta)$ or function $f(\textbf{y} | \theta)$. When viewed as a function of $\textbf{y}$ with $\theta$ fixed, it is not a likelihood function $\mathscr{L}(\theta | \textbf{y})$, but merely a probability density function $p(\textbf{y} | \theta)$. When viewed as a function of $\theta$ with $\textbf{y}$ fixed, it is a likelihood function and may be denoted as $\mathscr{L}(\theta | \textbf{y})$, $p(\textbf{y} | \theta)$, or $f(\textbf{y} | \theta)$\footnote{Note that $\mathscr{L}(\theta | \textbf{y})$ is not the same as the probability that those parameters are the right ones, given the observed sample.}.

70
tulipsliu(未真实交易用户) 在职认证  发表于 2020-12-15 17:53:37
For example, in a Bayesian linear regression with an intercept and two independent variables, the model may be specified as

$$\textbf{y}_i \sim \mathcal{N}(\mu_i, \sigma^2)$$
$$\mu_i = \beta_1 + \beta_2\textbf{X}_{i,1} + \beta_3\textbf{X}_{i,2}$$

The dependent variable $\textbf{y}$, indexed by $i=1,...,n$, is stochastic, and normally-distributed according to the expectation vector $\mu$, and variance $\sigma^2$. Expectation vector $\mu$ is an additive, linear function of a vector of regression parameters, $\beta$, and the design matrix \textbf{X}.

Since $\textbf{y}$ is normally-distributed, the probability density function (PDF) of a normal distribution will be used, and is usually denoted as

$$f(\textbf{y}) = \frac{1}{\sqrt{2\pi}\sigma}\exp[(-\frac{1}{2}\sigma^2)(\textbf{y}_i-\mu_i)^2]; \quad \textbf{y} \in (-\infty, \infty)$$

By considering a conditional distribution, the record-level likelihood in Bayesian notation is

$$p(\textbf{y}_i | \Theta) = \frac{1}{\sqrt{2\pi}\sigma}\exp[(-\frac{1}{2}\sigma^2)(\textbf{y}_i-\mu_i)^2]; \quad \textbf{y} \in (-\infty, \infty)$$

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2025-12-21 23:12