Ch3.3~4 Bayes Estimation [The univariate case]

2020-2학기 서강대 김경환 교수님 강의 내용 및 패턴인식 교재를 바탕으로 본 글을 작성하였습니다.

3.3 Bayes estimation (베이지안 추정)

▶ In MLE

$\theta$ to be fixed

▶ In Bayesian Learning

$\theta$ to be random variable and training data allow us to convert a distribution on this variable into posterior probability density.

패턴 분류 문제에 대한 Bayes 추정 방식은 MLE에 의해 얻은 방식과 비슷하지만, 개념적 차이가 있다. MLE 에서는 $\theta$ 가 고정된 것으로 간주한 반면, Bayes 학습에서는 $\theta$ 가 랜덤 변수인 것으로 간주하며, 훈련 데이터는 이 변수에 대한 분포를 사후 확률 밀도로 전환시킬 수 있게 된다. (처음 무슨 말인가 했자만, 이번 설명을 끝까지 읽고 다시 앞으로 넘어와서 읽어보면 이해됨)

▶ The class-conditional densities (클래스-조건부 밀도)

$P(\omega_j | \mathbf{x}) = \frac{p(\mathbf{x} | \omega_j) P(\omega_j)}{p(\mathbf{x})}$

How can we proceed when these quantities are unkown?
To compute $P(\omega_j | \mathbf{x})$ using all of the information at our disposal.
Knowledge of the functional forms for unknown densities and ranges for the values of unknown parameters.
Or Some from a set of training samples. (또는 이 정보의 일부는 훈련 샘플 집합 내에 있을 수 있음)

이어서, 다시 $D$ 로 샘플 집합을 나타내면, 목표가 사후 롹률 $P(\omega_j | \mathbf{x}, D)$ 를 계산하는 것이라고 말함으로써 샘플들의 역할을 강조할 수 있음 이 확률들로부터 Bayes 분류기를 얻을 수 있다.

샘플 집합 $D$ 가 주어지면, 위의 식을 다시 표현한 Bayes 공식은 다음과 같음.

$P\left(\omega_{i} \mid \mathbf{x}, D\right)=\frac{p\left(\mathbf{x} \mid \omega_{i}, D\right) P\left(\omega_{i} \mid D\right)}{\sum_{j=1}^{c} p\left(\mathbf{x} \mid \omega_{j}, D\right) P\left(\omega_{j} \mid D\right)}$

The information provided by the training samples can be used to help to determine both the class-conditional densities and the prior probabilities.
We assume that the true values of the prior probabilities are known or obtainable from a trivial calculation $P(\omega_i) = P(\omega_i | D)$ . (대체 가능)

Supervised case (지도학습인 경우)

To separate the training samples by class into $c$ subsets $D_1,…,D_c$ , with the samples in $D_i$ belonging to $\omega_i$ .
The samples in $D_i$ have no influence on $p(\omega_i | \mathbf{x}, D)$ if $i \neq j$ .

위와 같이 표현함으로써의 얻는 이점 존재

첫째, 각 각 클래스를 분리해서 작업할 수 있게 해줌 - 즉, $p(\mathbf{x}|w_i, D)$ 를 계산할 때 $D_i$ 의 샘플들만 사용한다. 사전 확률들이 알려져 있다는 가정과 같이 사용하면, 위와 같이 표현 가능함
둘째, 각 클래스가 독깁적으로 다루어질 수 있기 때문에, 불필요한 클래스 구별을 하지 않아도 되고, 표기를 단순화할 수 있음. 본질적으로는 다음 형태의 $c$ 개의 분리된 문제를 가진다 : 고정된 그러나 미지의 확률 분포 $p(\mathbf{x})$ 에 따라 독립적으로 뽑힌 샘플들의 집합 $D$ 를 사용해서 $p(\mathbf{x}|D)$ 를 계산하는데, 이것이 Bayes 학습의 중심적 문제임.

▶ The Parameter Distribution (파라미터 분포) : 여기부터 중요

Assume that $p(\mathbf{x})$ has a known parametric form. The only they unknown is the value of a parameter vector $\boldsymbol{\theta}$ .

즉, $p(\mathbf{x})$ 에 대한 form(e.g : 가우시안 분포)이 알려지고, 그에 해당하는 $\boldsymbol{\theta}$ 인 parameter(e.g. : $\mu, \sigma^2$ )를 모른다고 하자.

$\boldsymbol{\theta}$ prior to observing the samples is assumed to be contained in a known prior density $p(\boldsymbol{\theta})$ . Observation of the samples converts this to a posterior density $p(\boldsymbol{\theta}|D)$ , which is sharply peaked about the true value of $\boldsymbol{\theta}$ .

샘플들을 관찰하기 전에 $\boldsymbol{\theta}$ 가 가질 수 있는 사전 밀도 $p(\boldsymbol{\theta})$ 를 따른다고 가정하자. 샘플들의 관찰은 이것을 경험적 확률 밀도 $p(\boldsymbol{\theta}|D)$ 로 전환하며, $\boldsymbol{\theta}$ 의 True 값에서 날카롭게 피크할 것이다.

Problem of learning a probability density function

확률 밀도 함수를 학습하는 문제를 파라미터 벡터 ( $\boldsymbol{\theta}$ )를 추정하는 문제로 전환시켰음을 주목, 목표는 $p(\mathbf{x}|D)$ 를 계산하는 것이고, 이것은 미지의 $p(\mathbf{x})$ 를 얻는 것에 최대한 가깝게 하는 것(?)이다. 이를 위해 모든 parameter vector에 대해 적분을 해야하는데 아래에 표현하겠다.

$\begin{aligned} p(\mathbf{x} | D) &=\int p(\mathbf{x}, \boldsymbol{\theta}|D) d \boldsymbol{\theta} \\ \end{aligned}$

위 식에서 확률의 곱의 법칙을 이용하여 다음과 같이 표현할 수 있다. $\mathbf{x}$ 의 선택과 $D$ 의 훈련 샘플들의 선정이 독립적으로 이뤄지므로, $p(\mathbf{x} | \boldsymbol{\theta}, D)$ 를 로 표현할 수 있음 (즉, $\mathbf{x}$ 의 분포는 $\boldsymbol{\theta}$ 값을 알고나면 완전하게 알려진다. 따라서 아래와 같이 쓸 수 있음)

$\begin{aligned} p(\mathbf{x} | D) &=\int p(\mathbf{x} | \boldsymbol{\theta}) p(\boldsymbol{\theta}|D) d \mathbf{\theta} \\ \end{aligned}$

이 중요한 공식은 원하는 클래스-조건부 밀도 $p(\mathbf{x}|D)$ 를 미지의 파라미터 벡터에 대한 사후 밀도 $p(\boldsymbol{\theta}|D)$ 에 연결(링크)시킨다. 만일 $p(\boldsymbol{\theta}|D)$ 가 어떤 참 값 $\widehat{\boldsymbol{\theta}}$ 을 중심으로 매우 날카로운 피크를 이루면, $p(\mathbf{x}|D) = p(\mathbf{x}|\widehat{\boldsymbol{\theta}})$ , 즉 True Parameter vetor를 추정 $\widehat{\boldsymbol{\theta}}$ 으로 대체해서 얻게 되는 결과를 얻는다.

3.4 Bayes estimation (베이지안 추정)

$p(\boldsymbol{\theta}|D)$ and probability density $p(\mathbf{x}|D)$ for case where $p(\mathbf{x}|\mu) \sim N(\boldsymbol{\mu}, \sum)$

여기서는 $p(\mathbf{x}|\boldsymbol{\mu}) \sim N(\boldsymbol{\mu}, \boldsymbol{\sum})$ 인 경우에 대해 Bayes 추정 기법을 이용해서 사후 밀도 $p(\boldsymbol{\theta}|D)$ 와 원하는 확률 밀도 $p(\mathbf{x}|D)$ 를 계산한다.

▶ The Univariate Case : $p(\mu|D)$ - 단변량 가우시안인 경우

Consider the case where $\mu$ is the only unknown parameter ( $\mu$ 가 유일한 미지의 파라미터인 경우, $\sigma^2$ 는 fix)

$p(x \mid \mu) \sim N\left(\mu, \sigma^{2}\right)$

유일한 미지의 Parameter는 $\mu$ 이다. $\mu$ 에 관해서 우리가 알고 있는 어떠한 사전 지식이든 기지의 사전 밀도 $p(\mu)$ 에 의해 표현될 수 있다고 가정한다. (아래부분)

Prior knowledge we have about $\mu$ can be expressed by a known prior density $p(\mu)$

$p(\mu) \sim N\left(\mu_0, \sigma^{2}_{0}\right)$

our best prior guess for $\mu$ (가장 좋은 사전 추측)
$\sigma^{2}_{0}$ : our uncertainty about the guess. (불확실성)

$\mu$ 에 대한 사전 밀도를 가정하고 나면, 상황을 다음과 같이 생각해 볼 수있다. 확률 법칙 $p(\mu)$ 에 의해 지배되는 모집단으로부터 $mu$ 를 위해 어떤 값이 뽑혔다고 생각해보자. 일단 이 값이 뽑히고 나면, True $\mu$ 값이 되며, $x$ 에 대한 밀도를 완전히 결정한다. 이제 $n$ 개의 샘플 $x_1, x_2, ... , x_n$ 이 그 결과 모집단으로부터 독립적으로 뽑혔다고 가정하자. $D = {x_1, x_2, ... , x_n}$ 으로 놓고, Bayes 공식을 사용하면 아래와 같이 쓸 수 있다.

위 그림을 보면 Bayes estimation의 흐름을 알 수 있음 (wow), 이 공식은 일련의 훈련 샘플들에 대해 관찰이 $\mu$ 의 참 값에 관한 우리의 생각에 어떻게 영향을 끼쳤는지를 보여줌; 사전 밀도 $p(\mu)$ 를 사후 밀도 $p(\mu|D)$ 와 관련시킴. 가우시안 분포로 가정하고 더 구체적으로 들어가보자.

Because $p(x_k | \mu) \sim N(\mu, \sigma^2)$ and $p(\mu) \sim N(\mu_0, \sigma^{2}_{0})$

If we write $p(\mu|D) \sim N(\mu_n, \sigma^{2}_{n})$ .

1차원 및 2차원에서 가우시안 분포들의 평균에 대한 Bayes 학습, 사후 분포 추정들은 추정에 사용된 훈련 샘플들의 수로 레이블을 넣음 (즉, data가 쌓일 수록 날카로운(평균은 샘플에 의해, 불확실성은 감소) 형태의 모양을 가진다.)

한 스텝 더 나아가보자!

▶ The Univariate Case : $p(\mathbf{x}|D)$ - 단변량 가우시안인 경우 (정의된 확률 다름)

Having obtained the a posteriori density for the mean, $p(\mu|D)$ , all that remains is to obtain the “class-conditional” density $p(\mathbf{x}|D)$ .

▶ The Multivariate Case : 다변량 가우시안인 경우

$\boldsymbol{\Sigma}$ 는 알고 있으나 $\boldsymbol{\mu}$ 는 모르는 다변량 경우를 다루는 방법은 단변량 경우를 직접적으로 일반화하는 것과 같으니, 직접 전개해볼 필요가 있다. (다음을 가정)

$p(\mathbf{x} | \boldsymbol{\mu}) \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \quad \text { and } \quad p(\boldsymbol{\mu}) \sim N\left(\boldsymbol{\mu}_{0}, \Sigma_{0}\right)$

마지막으로, 이번 절 처음에 나온 아래의 문장을 다시 생각해보자.

$\theta$ to be random variable and training data allow us to convert a distribution on this variable into posterior probability density.

왜 random variable로 표현되었는지, 감이 오기 시작했다. 그리고 다른 분포들에 대해서도 이렇게 정리해보면 다소 익숙해지기 시작할것 같다.

기존 MLE 기법과 다른 Bayesian Estimation을 단변량 및 다변량 가우시안 분포에 적용하여 보았다. 다음 Ch3.5 에서는 다른 분포들을 포함한 일반화시킨 "Bayesian Parameter Estimation: General theory" 를 다루도록 하겠습니다.

Reference

저작자표시

'Pattern Classification [수업]' 카테고리의 다른 글

Ch4.2 Nonparametric techniques - Density Estimation (0)	2020.10.07
Ch4.1 Nonparametric techniques - Introduction (0)	2020.10.07
Ch3.2 Maximum-likelihood Estimation (0)	2020.09.22
Ch3.1 Maximum likelihood and Bayesian parameter estimation (0)	2020.09.22
Ch2.8 Bayesian decision theory - Error Bounds for Normal Densities (0)	2020.09.17

금	토	일	월	화	수

+22°	+18°	+23°	+25°	+19°	+19°
+13°	+13°	+11°	+16°	+12°	+10°

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

DeepHaejoong

Ch3.3~4 Bayes Estimation [The univariate case]

3.3 Bayes estimation (베이지안 추정)

3.4 Bayes estimation (베이지안 추정)

Reference

'Pattern Classification [수업]' 카테고리의 다른 글

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

Ch3.3~4 Bayes Estimation [The univariate case]

3.3 Bayes estimation (베이지안 추정)

3.4 Bayes estimation (베이지안 추정)

Reference

'Pattern Classification [수업]' 카테고리의 다른 글

관련글

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역