[베이지안 딥러닝] Introduction - Decision Theory and Information Theory I

2020-2학기 이화여대 김정태 교수님 강의 내용을 바탕으로 본 글을 작성하였습니다.

Overview

Decison Theory
Information Theory

Decision problem

패턴 인식 문제를 풀 때는 불확실성이 존재하는 상황에서 의사 결정을 내려야 하는 경우가 많다. 이런 상황에서 결정 이론과 확률론을 함께 사용하면 최적의 의사 결정을 내릴 수 있다.

We observe random vector $\mathbf{x}$ and try to determine the probability of $C_k$ which represents the probability of an event belongs to the class $k$..
Determining $p(\mathbf{x}, C_k )$ from training data is an inference problem which is very difficult to solve.
Once the joint probability is known, marginal and conditional probabilities can be calculated using Bayes theorem.
Decision theory concerns how to make optimal decision given appropriate probabilities.
Decision stage is often very simple once we have solved the inference problem.

결합 확률 분포 $p(\mathbf{x}, C_k)$는 변수들의 전체 불확실설을 요약해서 나타낼 줄 것이다. 주어진 훈련 집합 데이터에서 $p(\mathbf{x}, C_k)$를 찾아내는 것은 추론(inference) 문제의 대표적인 예시다. 이는 매우 어려운 문제로, 이 문제에 대한 해결책이 앞으로 공부할 내용의 대부분을 차지하고 있기도 한다.

▶ Minimizing the miss-classification rate

One may try to minimize the miss-classification rate which is computed as follows

$$\begin{aligned} p(mistake) &= p(\mathbf{x} \in R_1, C_2) + p(\mathbf{x} \in R_2, C_1) \\ &= \int_{R_1} p(\mathbf{x}, C_2) d\mathbf{x} +\int_{R_2} p(\mathbf{x}, C_1) d\mathbf{x} \end{aligned}$$

To minimize the probbility, we have to decide $R_1$ as the region where $p(\mathbf{x}, C_2) \lt p(\mathbf{x}, C_1)$
Since $p(\mathbf{x}, C_1) = p(C_1 | \mathbf{x})p(x)$ and $p(\mathbf{x}, C_2) = p(C_2|\mathbf{x})p(\mathbf{x})$, we decide $R_1$ as the region where $p(C_1|\mathbf{x}) \gt p(C_2|\mathbf{x})$.

▶ More general case

더 일반적으로 $K$개의 클래스를 가진 경우에는 올바르게 분류된 경우의 확률을 극대화(최대화)하는 편이 조금 더 쉽다.

$$p(correct) = \sum_{k=1}^{K} p(\mathbf{x} \in R_k, C_k) = \sum_{k=1}^{K} \int_{R_k} p(\mathbf{x}, C_k) d\mathbf{x} \tag{3}\label{3}$$

The probability is maximized when the regions $R_k$ are chosen such that each $\mathbf{x}$ is assigned to the class for which $p(\mathbf{x}, C_k )$ is largest. Since $p(\mathbf{x}, C_k ) = p(C_k |\mathbf{x})p(\mathbf{x})$ and $p(\mathbf{x})$ is common to all terns, each $\mathbf{x}$ should be assigned to class having the largest posterior probability $p(C_k |\mathbf{x})$.

▶ Minimizing the expected loss (효용 함수)

For many cases, cost for mis-classification is different for each class. For example, cost for misclassifying cancer to non-cancer is much more larger than converse case..
One can design a cost function to reflect such situation.

$L_{kj}$ 는 잘못 분류한 대가로 예를 들면 다음과 같이 Matrix로 표현할 수 있음

비용 함수라고도 부르는 손실 함수를 도입하여 위의 문제점들을 공식화할 수 있다.

$$E[L]= \sum_{k} \sum_{j} \int_{R_j} L_{kj} p(\mathbf{x}, C_k) d\mathbf{x}. \tag{4}\label{4}$$

The decision rule that minimizes the expected loss is the one that assigns each new $\mathbf{x}$ to the class $j$ for which the quantity $\sum_k L_{k j} p(C_k | \mathbf{x})$ is minimum.
The task is trivial once we know the posterior probability $p(C_k |\mathbf{x})$.

각각의 클래스에 대한 사후 확률 $p(C_k|\mathbf{x})$를 알고 나면 위 방법을 쉽게 실행할 수 있다.

▶ Reject option (거부 옵션)

For some applications, it will be appropriate to avoid making decisions if the largest posterior probability $p(C_k |\mathbf{x})$ is less than some threshold, which is known reject option

몇몇 적용 사례에서는 오류 비용을 최소화 하기 위해 이처럼 결정이 힘든 지역에 대해서는 결정을 피하는 것이 더 적절할 수도 있다. (즉, 불확실한 부분은 사람이 직접 판단하는 것이 적절)

$\theta$(threshold)를 설정해서 사후 확률 $p(C_k|\mathbf{x})$들 중에서 가장 큰 값이 $\theta$보다 작거나 같은 경우 해당 입력값 $\mathbf{x}$를 거부하는 방식으로 이를 실행할 수 있음

▶ Inference and design (추론과 결정)

Classification problem can be broken down into inference stage in which we use training data to learn a model for $p(C_k |\mathbf{x})$ and subsequent decision stage in which we use the posterior probabilities to make optimal decision

지금까지 분류 문제를 두 개의 단계로 나누어서 보았음. 첫번째는 "추론 단계"로 훈련 집단을 활용하여 $p(C_k | \mathbf{x})$에 대한 모델을 학습시키는 단계, 두번째는 "결정 단계"로 학습된 사후 확률들을 이용해서 최적의 클래스 할당을 시행하는 것이다. 두 가지 문제를 한 번에 풀어내는 방식도 생각해 볼 수 있다. $\mathbf{x}$가 주어졌을 때 결정값을 돌려주는 함수를 직접 학습시키는 것 (이러한 함수를 판별 함수 (discriminant function)라고 부름)

Three distinct approaches (간단하게 설명 후 아래서 각 방법에 따른 장/단점 언급)

Generative model : First solve the inference problem of $p(\mathbf{x}|C_k)$ for each class $C_k$ , then use Bayes theorem to find $p(C_k |\mathbf{x})$. This method is generative since it may generate synthetic data points in the input space using sampling. (가장 손이 많이 가는 방식)
Discriminative model : Directly learn $p(C_k |\mathbf{x})$ from training data, then use decision theory to assign each new $\mathbf{x}$ to one of the classes (대부분의 경우)
Discrimination function : Learn a discriminative function which maps each input x directly onto a class label. Probability plays no role (확률론 사용 X)

Generative model

It can be wasteful of computational resources and excessively demanding of data to find joint distribution $p(\mathbf{x}, C_k)$.
For many cases, we only need posterior probability for decision (결합 분포를 전부 계산하는 것보다 사후 확률을 직접 계산(Discriminative model)하는 방식이 더 효율적일 것)

Discriminative model (Generative model 보다 효율적인 방법)

Minimizing risk : One can revise the minimum risk decision criteria easily if posterior probability has been computed. If we have only a discriminant function, re-training is required.
Compensating class priors : For many applications, training data is unbalanced. For example 1% of cancer data..
Learning algorithm should be exposed to a broad range of examples
One may use balanced data set for training and revise the posterior probability estimated using the artificial balanced set, which is not possible for discriminant function approach
Combining model : Also, it may be possible to combine two models
The class probability $p(C_k )$ can be easily estimated from the fraction of data points in each class

식 넣기

▶ Loss function for regression (회귀에서의 손실 함수)

지금까지 classification problem를 바탕으로 decision theory에 대해 살펴보았다. 지금부터는 regression problem의 경우에 대해 살펴보자. 앞에서 살펴본 curved fitting problem이 regression problem에 해당되며, 각각의 $\bf{x}$에 대해서 $t$의 추정값 $y(\mathbf{x})$를 선택해야 한다. 이 과정에서 loss(손실) : $L(t, y(\mathbf{x}))$가 발생한다고 가정해 보자. 그러면 평균(기대) 손실은 아래와 같이 주어질 것이다.

Average of expected loss for regression

$$
\begin{aligned}
E[L] &=\iint L(t, y(\mathbf{x})) p(\mathbf{x}, t) d \mathbf{x} d t \\
&=\iint(y(\mathbf{x})-t)^{2} p(\mathbf{x}, t) d \mathbf{x} d t
\end{aligned} \tag{5}\label{5}
$$

참고 : regression problem에서 일반적으로 loss function로 사용하는 것은 $(y(\mathbf{x})-t)^{2}$로 주어지는 제곱 손실

목표는 $E[L]$을 최소화하는 $y(\mathbf{x})$를 선택 해야하며, $y(\mathbf{x})$를 결정할 수 있다고 가정하면, 변분법(Calculus of variations, Euler-Lagrange equation)을 적용해서 아래와 같이 적을 수 있다.

Euler-Lagrange equation

$$
\begin{aligned}
\frac{\partial E[L]}{\partial y(\mathbf{x})} &=2 \int(y(\mathbf{x})-t) p(\mathbf{x}, t) d t=0 \\
y(\mathbf{x}) &=\int t p(t | \mathbf{x}) d t=E_{t}[t |
\mathbf{x}]
\end{aligned} \tag{6}\label{6}
$$

참고 : $y(\mathbf{x})$에 대해 해를 구하고 확률의 합/곱 법칙을 적용하면 위의 식을 얻을 수 있다.

식 (6)는 $\bf{x}$가 주어졌을 때의 $t$의 조건부 평균으로써 regression function(회귀 함수)라고 한다. 결과는 아래 시각화되어 있다. target 변수가 $\bf{t}$로 표현되는 다차원 변수일 경우, 최적의 해는 조건부 평균 $\bf{y}(\bf{x}) = E_t [\bf{t}|\bf{x}]$다.

위 결과는 Euler-Lagrange equation를 사용하지 않고, 다른 방법으로도 도출 할 수 있음. (이 방법은 regression problem에 대해 본질적으로 더 자세히 살펴볼 수 있음) 최적의 해가 조건부 기대값 $E[t | \mathbf{x}]$이라는 지식을 바탕으로 제곱항을 다음과 같이 전개할 수 있다.

Conditional expectation is the minimizer of the loss function

$$
\begin{aligned}
(y(\mathbf{x})-t)^{2} &=\left(y(\mathbf{x})-E[t | \mathbf{x}]+E[t | \mathbf{x}]-t)^{2}\right.\\
&=(y(\mathbf{x})-E[t | \mathbf{x}])^{2}+2(y(\mathbf{x})-E[t | \mathbf{x}])(E[t | \mathbf{x}]-t) +(E[t | \mathbf{x}]-t)^{2}
\end{aligned} \tag{7}\label{7}
$$

위 전개 결과인, 식 (7)을 loss function에 대입하고, $t$에 대하여 적분하면 교차항들이 사라지게 된다. 그 결과로 손실함수를 얻을 수 있다.

$$
E[L]=\int(y(\mathbf{x})-E[t | \mathbf{x}])^{2} p(\mathbf{x}) d \mathbf{x}+\int \operatorname{var}[t | \mathbf{x}] p(\mathbf{x}) d \mathbf{x} \tag{8}\label{8}
$$

첫번째 항 : 찾고자 하는 함수 $y(\mathbf{x})$ 존재 → $y(\mathbf{x}) = E[t|\mathbf{x}]$일 때 이 항은 최소화되어 항 자체가 사라지게됨 (변분법, Euler-Lagrange equation을 이용해 도출한 결과와 동일함), 즉, 최적의 최소 제곱 예측은 조건부 평균으로 주어짐
두번째 항 : $t$에 대한 분포의 분산을 계산하고, 이를 $\bf{x}$에 대해 평균을 낸 것 → taget이 가지고 있는 내재적인 변동성을 표현하는 것으로, noise 로 해석할 수 있음. ($y(\bf{x})$에 대해 독립적이며, 절대로 더 이상 줄일 수 없는 손실 함수의 최솟값에 해당)

[Figure] Euler-Lagrange equation 사용유무에 따라 loss function을 만족하는 y(x)의 해 도출 방법 I, II

Other loss functions can be used. e.g., Minkowski loss function

$$
E\left[L_{q}\right]=\int(y(\mathbf{x})-t)^{q} p(\mathbf{x}, t) d \mathbf{x} d t \tag{9}\label{9}
$$

regression problem의 loss function로 제곱 손실 이외에 다른 것을 사용하는 것도 가능하다. 실제로 제곱 손실이 상당히 좋지 않은 결과를 가져오므로, 더 복잡한 접근법을 사용해야하는 경우도 종종 있음. (e.g. 조건부 분포가 $p(t|\mathbf{x})$가 다봉 분포인 상황). 제곱 손실을 일반화환 예시인 민코프스키 손실의 기댓값은 식 (9)와 같이 주어짐

▶ Three approaches for regression problem

First solve the inference problem of determining joint density $p(\bf{x}, t)$. Then find conditional density and conditional mean.
First solve the inference problem of determining the conditional density $p(t|\bf{x})$ and then subsequently calculate the conditional mean.
Find a $y(\mathbf{x})$(regression function) directly from the training data.

각각의 방법으로 위에서부터 복잡한 방법...

각각의 방식의 장단점은 classification problem에서의 세 가지 방식의 장단점과 동일함 (다시 읽어보기)

Reference

Pattern Recognition and Machine Learning
PRML Example Code (git) : github.com/ctgk/PRML

저작자표시

'패턴인식과 머신러닝 > Ch 01. Introduction' 카테고리의 다른 글

[베이지안 딥러닝] Appendix. Calculus of Variations (변분법) (0)	2021.02.08
[베이지안 딥러닝] Introduction - Curve Fitting (4)	2020.09.16
[베이지안 딥러닝] Introduction (2)	2020.09.01
1.2 Probability Theory (0)	2020.07.09

목	금	토	일	월	화

-7°	-12°	-9°	-6°	-2°	-1°
-16°	-17°	-15°	-12°	-7°	-6°

DeepHaejoong

[베이지안 딥러닝] Introduction - Decision Theory and Information Theory I

Overview

Decision problem

Reference

'패턴인식과 머신러닝 > Ch 01. Introduction' 카테고리의 다른 글

댓글

티스토리툴바

[베이지안 딥러닝] Introduction - Decision Theory and Information Theory I

Overview

Decision problem

Reference

'패턴인식과 머신러닝 > Ch 01. Introduction' 카테고리의 다른 글

관련글

댓글

티스토리툴바