[베이지안 딥러닝] Ch3.3 Bayesian linear regression

2020-2학기 이화여대 김정태 교수님 강의 내용을 바탕으로 본 글을 작성하였습니다.

Overview

Linear regressoin (MLE)
Bias-Variance Decomposition
Bayeisan linear regression
Bayesian model comparison
The evidence approximation
Limit, of fixed basis functions

Bayesian linear regression

▶ Why Bayesian approach is needed?

Model complexity (such as number of basis functions) needs to be determined according to data size.
Cross validation can be computationally expensive and more importantly waste of valuable data.
Bayesian treatment may avoid over-fitting and determine model complexity using training data only.

특정 문제에 대해 적당한 모델 복잡도($\lambda$)를 결정하는 것이 새로운 문제 및 훈련 데이터와는 독립적으로 검증 데이터 집합을 이용하는, Cross validation 기법은 제한된 데이터 및 자원을 소모하는 행위라 좋지 않음. 베이지안 방법론을 바탕으로 선형 회귀를 시행하면 최대 가능도 방법에서 발생하는 과적합 문제를 피할 수 있으며, 훈련 데이터만 가지고 모델의 복잡도를 자동적으로 결정할 수 있다. 여기서는 논의를 쉽게 하기 위해 단일 target variable $t$의 경우에 대해서만 살펴볼 예정이다.

▶ 3.3.1 Parameter distribution

Conjugate prior

$$p(\mathbf{w}) = N(\mathbf{w}| \mathbf{m}_0, \mathbf{S}_0) \tag{1}\label{1}$$

모델 매개변수 $p(\mathbf{w})$에 사전 확률 분포를 도입함으로써 베이지안 선형 회귀가 시작된다. 그 다음 단계는 사루 분포를 계산하는 것으로 사전 분포와 가능도 함수의 곱에 비례함

참고 : 식 (2)를 구하는 과정 → [베이지안 딥러닝] Introduction - Curve Fitting (Bayesian curve fitting)

Posterior

$$p(\mathbf{w}| \mathbf{t}) = N(\mathbf{w}|\mathbf{m}_N, \mathbf{S}_N) \tag{2}\label{2}$$

where

$$\mathbf{m}_N = \mathbf{S}_N(\mathbf{S}_0^{-1} \mathbf{m}_0 + \beta\Phi^T\mathbf{t}) \tag{3}\label{3}$$

$$\mathbf{S}_N^{-1} = \mathbf{S}_0^{-1} + \beta \Phi^T \Phi \tag{4}\label{4}$$

사전 분포가 가우시안 분포이므로 최빈값과 평균값이 일치한다. 따라서 최대 사후 가중 벡터는 단순히 $\mathbf{w}_{MAP}=\mathbf{m}_N$으로 주어지게 되며, $\mathbf{S}_0 = \alpha^{-1}\mathbf{I} (\alpha \rightarrow 0)$인 무한대로 넓은 사전 분포를 고려해보자. 이때 사후 분포의 평균 $\mathbf{m}_N$은 $\mathbf{w}_{ML}$이 된다. 이와 흡사하게 $N = 0$인 경우에는 사후 분포가 사전 분포와 같아진다. 또한 데이터 포인트들이 순차적으로 입력된다면 각 단계에서의 사후 분포가 다음 단계의 사전 분포가 된다.

이번 장에서는 처리 과정을 단순화하기 위해 특정 형태의 가우시안 사전 분포 식 (5)를 사용할 것이다.

$$ p(\mathbf{w}|\alpha) = N(\mathbf{w}|\mathbf{0}, \alpha^{-1}\mathbf{I}) \tag{5}\label{5}$$

이에 해당하는 $\mathbf{w}$에 대한 사후 분포는 다음과 같이 주어진다.

$$\mathbf{m}_N = \beta \mathbf{S}_N \Phi^{T}\mathbf{t} \tag{6}\label{6}$$

$$\mathbf{S}_N^{-1} = \alpha \mathbf{I} + \beta\Phi^{T}\Phi \tag{7}\label{7}$$

log of the posterior distribution

$$\ln p(\mathbf{w}|\mathbf{t}) = -\frac{\beta}{2} \sum_{n=1}^{N} \{t_n - \mathbf{w}^T\phi(\mathbf{x}_n) \}^2 - \frac{\alpha}{2} \mathbf{w}^T\mathbf{w} + \text{const.} \tag{8}\label{8}$$

로그 사후 분포는 로그 가능도와 로그 사전 분포의 합으로 나타낼 수 있으며, 이를 $\mathbf{w}$에 대한 함수로 적으면 위와 같다.

Maximization of the posterior distribution with respect to $\mathbf{w}$ is equivalent to minimization of the sum of squares error function with quadratic regularization with $\lambda = \alpha / \beta$.
MAP estimation is obtained by finding the maximizer of the posterior distribution

Example : Illustration of Bayesian learning

단순한 linear fitting 에제를 바탕으로 basis function 모델의 베이지안 학습과 사후 분포의 순차적 업데이트 방식에 대해 살펴보자.

Model : $y (x, w) = w_0 + w_1 x$
Underlying function $f(x) = -0.3 + 0.5x$
Data sample : $t_n = (x_n, f(x_n)) + e(n)$, where $e(n)$ is from Gaussian distribution with $\sigma = 0.2$ and $x_n \sim U(-1,1)$ 로부터 인공적으로 데이터를 만들어 냄

목표 : 위 sample data로부터 $a_0$ 및 $a_1$을 구하는 것, (정밀도 매개변수 $\alpha, \beta$를 각각 fix시킴)

$\alpha = 2.0$
$\beta = (\frac{1}{0.2})^2$

[Figure 1]에 있는 ⓐ부터 시작해서 ⓝ까지 순차적으로 가보면, 베이지안 학습이 이루어지는 과정을 (순차적 업데이트) 알 수 있을 것이다. 추가적으로 데이터 크기에 대해서 이 결괏값에 어떤 종속성이 있는지 살펴볼 것

[Figure 1] Illustration of sequential Bayesian learning for a simple linear model of the form y(x, w) = w0 + w1x

무한히 많은 수의 데이터 포인트들을 관측한 후에는 $\mathbf{w}$에 대한 사후 분포가 실 매개변수값들(흰색 십자가)을 중심으로 한 델타 함수가 될 것(?)이다. (즉, 분포지만, 수집된 data를 기반, MLE로 추정한 것과 근사할 것으로 예상)

▶ 3.3.2 Predictive distribution

Predictive distribution

$$p(t |x, \mathbf{t}, \alpha, \beta) = \int p(t|\mathbf{w}, \beta)p(\mathbf{w}|\mathbf{t}, \alpha, \beta) d \mathbf{w} \tag{9}\label{9}$$

실제 응용 사례에서는 $\mathbf{w}_{MAP}$를 알아내는 것보다 새로운 $\mathbf{x}$ 값에 대하여 $t$의 값을 예측하는 것이 더 중요할 수 있음. 이를 위해 식 (9)와 같이 정의되는 예측 분포(predictive distribution)를 고려해보자.

$$p(t |x, \mathbf{t}, \alpha, \beta) = N(t | \mathbf{m}_N^T \mathbf{\phi(\mathbf{x})}, \sigma^2_N(\mathbf{x})) \tag{10}\label{10}$$

where (예측 분포의 분산)

$$\sigma^2_N(\mathbf{x}) = \frac{1}{\beta} + \mathbf{\phi(\mathbf{x})}^T \mathbf{S}_N \mathbf{\phi(\mathbf{x})} \tag{11}\label{11}$$

식 (11)에서 첫 번째 항인 $\frac{1}{\beta}$은 noise를 표현하고 있으며, 두 번째 항은 매개변수 $\mathbf{w}$에 대한 불확실성을 표현하고 있다. 만일 추가적인 데이터가 관측된다면, 사후 분포는 더 좁아질 것이다. 그 결과로 $\sigma^2_{N+1}(\mathbf{x}) \leqslant \sigma^2_{N}(\mathbf{x})$라는 것을 보일 수 있다. 극단적으로 $N \rightarrow \infty$인 경우, 예측 분포의 분산은 매개변수 $\beta$에 의해 결정되는 데이터의 noise 만을 포함하게 됨.

Example : Predictive uncertainty depends on $x$ and is smallest in the neighborhood of data points

베이지안 선형 회귀 모델에서의 예측 분포를 보이기 위해서 1.1절에서 sign 곡선 데이터 집합을 다시 확인해보자.

[Figure 2] Examples of the predictive distribution (3.58) for a model consisting of 9 Gaussian basis functions of the form (3.4) using the synthetic sinusoidal data set of Section 1.1. See the text for a detailed discussion.

[Figure 2]는 point에 대한 예측 분산만을 $x$에 대한 함수로 표여주고 있음

녹색 곡선 : True Line (참 값)
파랑색 점 : 데이터 포인트
빨간색 곡선 : 해당 가우시안 예측 분포의 평균
빨간색 음영 : 평균치로부터 양쪽방향으로 1 표준편차만큼 표현

예측값의 불확실성은 $x$에 대해 종속적이며, 데이터 포인트들의 주변에서 그 불확실성이 가장 작다. 또한 불확실성의 정도는 관측된 데이터 포인트들의 수가 늘어남에 따라 감소하고 있다.

[Figure 3]는 서로 다른 $x$의 예측값들에 대한 공분산을 살펴보기 위해서는 $\mathbf{w}$에 대한 사전 분포로부터 샘플들을 추출 한 후 그에 대한 함수들 $y(x, \mathbf{w})$를 그린 것이다.

[Figure 3] Plots of the function y(x, w) using samples from the posterior distributions over w corresponding to the plots in Figure 2

가우시안과 같은 지역적인 지역 함수를 사용한다면, 기저 함수의 중심으로부터 떨어진 구간에서는 예측 분산(불확실성) 두 번째항의 기여도가 0이 될 것이고, noise의 기여도인 $\beta^{-1}$만이 남게 될 것이다. 이를 그래도 해석하면, basis function에 의해 포함되는 바깥에 대해서 예측할 경우에는 모델의 신뢰도가 높아진다는 결과(?)가 나오는데, 이는 바람직하지 않음. Gaussian Process를 통해 또 다른 방법을 활용함으로써 이 문제를 피할 수 있다고 한다.

$\mathbf{w}, \beta$가 둘 다 알려져 있지 않을 경우에는 사전 켤레 분포 $p(\mathbf{w}, \beta)$를 사용할 수 있고, 이때는 가우시안 감마 분포로 주어지며, 예측 분포는 스튜던트 t 분포가 된다고 한다.

▶ 3.3.3 Equivalent kernel

앞에서 다룬 식 (3)인, $\mathbf{m}_N = \beta \mathbf{S}_N \Phi^{T}\mathbf{t}$는 선형 기저 함수 모델에서의 평균이다. 평균에 대한 사후 해를 흥미로운 방식으로 해석할 수 있음. (가우시안 프로세스)를 포함한 커널 방법론에 대해 살펴보는 첫단계이며, 식 (3)을 아래 식 (12)에 대입해보면 식(13)이다.

$$y(\mathbf{x}, \mathbf{w})=\sum_{j=0}^{M-1} w_{j} \phi_{j}(\mathbf{x}) = \mathbf{w}^T\mathbf{\phi}(\mathbf{x}) \tag{12}\label{12}$$

The Predictive distribution

$$y(\mathbf{x}, \mathbf{m}_N) = \mathbf{m}_N^T \mathbf{\phi(\mathbf{x})} = \beta \phi(\mathbf{x})^T \mathbf{S}_N \Phi^T \mathbf{t} = \sum_{n=1}^{N} \beta \phi(\mathbf{x})^T \mathbf{S}_N \phi(\mathbf{x}_n)t_n \tag{13}\label{13}$$

... 에제 데이터 만들어서 설명!

We can think

$$y(\mathbf{x}, \mathbf{m}_N) = \sum_{n=1}^{N} k(\mathbf{x}, \mathbf{x}_n) t_n \tag{14}\label{14}$$

point $\mathbf{x}$에서의 예측 분포들의 평균은 훈련 집합 타긴 변수 $t_n$들의 선형 결합으로 주어지며 식 (14) 처럼 적을 수 있다.

여기서 함수 $k(\mathbf{x}, \mathbf{x}^{'})$는 다음처럼 정의된다. 이는 smmother matrix(평활 행렬) 또는 equivalent kernel(등가 커널)이라고도 알려짐

$$k(\mathbf{x}, \mathbf{x}^{'}) = \beta \phi(\mathbf{x})^T \mathbf{S}_N \mathbf{\phi(\mathbf{x}^{'})} \tag{15}\label{15}$$

The mean of predictive distribution is obtained by a weighted combination of the target values in whch data point close to $x$ are given higher weights

훈련 집합 표적값들의 선형 결합을 입력받아서 예측값을 내는 이러한 회귀 함수는 선형 평활기라고 불리며, 등가 커널은 입력값 $\mathbf{x}_n$에 종속적임 (이유 : $S_N$의 정의에 $\mathbf{x}_n$이 포함되어 있기 때문). 기저 함수가 가우시안인 경우의 등가 커널에 대해 [Figure 4]에 그려져있다. 커널 함수 $k(x, x^{'})$를 세 개의 서로 다른 $x$값들에 대해서 $x^{'}$로 그렸다. 각각은 $x$ 주변에서 지역화되어 있다. $y(x, \mathbf{m}_N)$으로 주어지는 $x$에서의 예측 분포는 표적값들의 가중 조합을 통해 구해지게 되는데, 이때 $x$에 근접할수록 더 높은 가중치를, $x$ 값으로부터 더 멀리 떨어질수록 낮은 가중치를 가지게 된다.

[Figure 4] The equivalent kernel k(x, x') for the Gaussian basis functions

It seems reasonable that we should weight local evidence more strongly than distance evidence

직관적으로 봤을 때 지역적으로 가까이 있는 증거를 더 멀리 떨어져 있는 증거보다 더 높게 가중하는 것이 타당해 보임. (이러한 지역화 성질은 지역화된 가우시안 기저 함수뿐 아니라 비지역적인 다항 기저 함수와 시그모이드 기저 함수의 경우에도 적용됨[Figure 5])

[Figure 5] Examples of equivalent kernels k(x, x') for x = 0 plotted as a function of x'

Covariance between $y(\mathbf{x})$, $y(\mathbf{x}^{'})$

식 (16), $y(\mathbf{x})$와 $y(\mathbf{x}^{'})$ 간의 공분산에 대해 고려해 보면 등가 커널의 역할에 대한 더 깊은 통찰을 얻을 수 있음 (전개 과정은 생략) 즉, 등가 커널의 형태로부터 서로 근처에 있는 포인트들의 예측 평균들은 상관성이 크며, 서로 멀리 떨어져 있는 포인트들의 예측 평균들은 상관성이 작다는 것을 볼 수 있음

$$
\begin{aligned}
\text{cov}[y(\mathbf{x}), y(\mathbf{x}^{'}] &= \text{cov}[\mathbf{\phi(\mathbf{x})}^T\mathbf{w}, \mathbf{w}^T \mathbf{\phi(\mathbf{x}^{'})}] \\
&=\mathbf{\phi(\mathbf{x})^T}\mathbf{S}_N \mathbf{\phi(\mathbf{x}^{'})} \\ &= \beta^{-1} k(\mathbf{x}, \mathbf{x}^{'}) \\
\end{aligned} \tag{16}\label{16}
$$

Localization property holds not only for the localized basis functions such as Gaussian but also for non-localized basis function such as polynomial basis
Instead of introducing basis functions, it is possible to design equivalent kernel and use this to make predictions which leads to a framework of Gaussian process.
Note that under mild conditions $\sum_n k(\mathbf{x}, \mathbf{x}^{'})=1$

등가 커널이 가중치들을 결정하며, 이 가중치들을 바탕으로 훈련 집합의 타깃 변수들이 합쳐져서 새로운 $\mathbf{x}$ 값에 대한 예측을 한다는 것을 살펴보았으며, 이 가중치들을 모든 $\mathbf{x}$ 값들에 대해 합산하면 1이 된다는 것을 알 수 있음 (연습문제 3.14)

Reference

Pattern Recognition and Machine Learning
PRML Example Code (git) : github.com/ctgk/PRML

저작자표시

'패턴인식과 머신러닝 > Ch 03. Linear Models for Regression' 카테고리의 다른 글

[베이지안 딥러닝] Ch3.5 The Evidence Approximation (0)	2021.02.18
[베이지안 딥러닝] Ch3.2 Bias-Variance Decomposition (6)	2021.02.07
[베이지안 딥러닝] Ch3.1 Linear models for Regression (MLE) (0)	2021.01.18

목	금	토	일	월	화

-7°	-12°	-9°	-6°	-2°	-1°
-16°	-17°	-15°	-12°	-7°	-6°

DeepHaejoong

[베이지안 딥러닝] Ch3.3 Bayesian linear regression

Bayesian linear regression

Reference

'패턴인식과 머신러닝 > Ch 03. Linear Models for Regression' 카테고리의 다른 글

댓글

티스토리툴바

[베이지안 딥러닝] Ch3.3 Bayesian linear regression

Bayesian linear regression

Reference

'패턴인식과 머신러닝 > Ch 03. Linear Models for Regression' 카테고리의 다른 글

관련글

댓글

티스토리툴바