Lecture 9 : CNN Architectures

CNN Architectures

Case Studies

Also...

NiN (Network in Network)
DenseNet
Wide ResNet
FractalNet
ResNeXT
SqueezeNet
Stochastic Depth

Review : LeNet-5

CNN을 처음으로 도입하였으며, 우편 번호와 숫자를 확인하는데 사용하였다고 한다.

AlexNet (ILSVR'12 winner)

AlexNet은 8개의 Layer로 구성되어 있다. (참고 : Convolution Layer (5개) 및 Fully Connected Layer (3개))

Output size $=\frac{\text { input_size }-\text { filter }+\left(2^{*} \text { padding }\right)}{\text { stride }}+1$

Input : 227 x 227 x 3 images 일때,

Q1 : First Layer(CONV1)에서 96 11 X 11 fillters applied at stride 4 를 통과된 output size 및 volume?

96 55 X 55 <Hint : (227 - 11)/4 + 1 = 55>

Q2 : What is the total number of parameters in this layer?

Parameters : (11*11*3)*96 = 35K

Q3 : Second layer (POOL1) : 3 X 3 filters applied at stride 2 를 통과된 output size 및 volume?

96 27 X 27 <Hint : (55 - 3)/2 + 1 = 27>

Q4 : What is the total number of parameters in this layer?

Parameters : 0 (CONV/FC는 parameter가 있지만, ReLU 및 Pool 등은 parameter가 없음)

Detail

ReLU 함수 : 처음으로 도입
used Norm layers (not common layer)
heavy data augmentaion
dropout : 0.5
batch size : 128
SGD Momentum : 0.9
Learning rate : 0.01 (reduced by 10 manually when val accuracy plateasus)
L2 weight decay : 0.0005
7 CNN ensemble : 18.2 % -> 15.4 %

위 그림에서 Feature Map이 앞서 output volume과 다른 이유는 이때 당시의 GPU의 Memory가 작기 때문에 96/2 = 48로 나누어서 모델을 학습시켰다고 합니다. (즉, 2개의 GPU를 사용함)

위에서 CONV1, CONV2, CONV4, CONV5는 2개 GPU가 각각 독립적으로 Feature map 생성

위에서 CONV3. FC6, FC7, FC8은 2개 GPU개를 cross하여 Feature map 생성

2012년 ImageNet에서 우승을 하였으며, 최초로 CNN을 활용하여 우승, 이후 CNN이 급속도로 발전, transfer learning을 AlexNet으로도 많이 했다고 합니다.

ZFNet (AlexNet에서 parameter변화)

AlexNet 과 ZFNet의 차이점

CONV1 : 첫번째 CONV를 11 X 11 stride 4 (AlexNet) 에서 7 X 7 stride 2 (ZFNet)으로 변경
CONV3, 4, 5 : 채널을 384, 384, 256 (AlexNet)에서 512, 1024, 512 (ZFNet)으로 변경

결론적으로 기존 ImageNet top 5 error가 16.4% 에서 11.7% 로 감소되었다.

지금부터 Layer의 갯수가 더 많아지는 대표 model을 살펴볼 것이다.

VGGNet

Filter Size가 작아질수록 Network는 깊어짐

AlexNet은 8개의 Layer를 사용했지만. VGG 16Net은 16개 ~ 19개 layers를 사용함
3 X 3 CONV stride 1, pad 1 (Only, 모든 layer 동일 적용)
2 X 2 MAX POOL stride 2 (Only, 모든 layer 동일 적용)

ILSVRC'13에서 ZFNet의 top error가 11.7%이던 것을 ILSVRC'14에서 VGG를 사용하여 7.3%까지 줄임

Q1 : Whe use smaller filters ? (3 X 3 CONV)

3 X 3 CONV (stride 1) layer는 7 X 7 Layer와 같은 effective receptive field를 가짐

Q2 : What is the effective receptive field of three 3x3 conv (stride 1) layers?

작은 필터를 사용하면 parameter를 줄일 수가 있고, 여러번 겹쳐서 사용하면 더 큰 필터가 표현하는 영역(Receptive Field)을 표현할 수 있게 됩니다
깊어지지만, more non-linearities 성질을 가짐으로 분류하기 좋음
3 X (3 X 3 X C(input 채널의 수)) = 27 C vs 7 X 7 X C = 49 C

Most memory is in early CONV
Most Params are in late FC

VGG는 메모리 사용량이 많은 단점이 있음. image 1장당 필요한 memory가 약 96mb이므로 pc의 memory가 1GB인 경우 10장만 처리할 수 있다. 그리고, 전체 params 138M로 AlexNet보다 약 2배 늘었다.

Tip : VGG에서 CONV가 많기 때문에 구조의 반복 편의를 위해 layer마다 이름을 위와 같이 지었다.

ILSVRC’14 2nd in classification, 1st in localization
- localization : 어디에 물체가 있는지(Bounding Box) + Classification
Similar training procedure as Krizhevsky 2012
No Local Response Normalisation (LRN)
Use VGG16 or VGG19 (VGG19 only slightly better, more memory)
Use ensembles for best results
FC7 features generalize well to other tasks
- Featrue Representation!

Layer의 depth : total number of layers
input image의 depth : width x height x depth

GoogLeNet

22개의 layers (much deeper networks with computational efficiency)
"Inception" module이 도입되어 효율적임
FC가 없다.
parameter가 500만개만 존재함 (AlexNet보다 12배 적음)
ILSVRC'14 분류 대회에서 6.7%로 우승을 차지함

Inception module : 좋은 local network를 만들기 위해, network 사이에 local topology를 구현하여 쌓아올림

여러개의 Filter 연산(1x1 CONV, 3x3 CONV, 5x5 CONV, 3x3 MAX POOL)을 평행하게 진행한 후 각각의 filter map을 concat하여 하나의 filter map을 만든다.

Q : What is the problem with this? [Hine : Computational complexity]

계산을 하는데 비용이 많이 든다.

Q1 : What is the output size of the 1x1 conv, with 128 filters? (zero padding)

28 X 28 X 128 (1 x 1 CONV를 할 경우 W, H는 변하지 않으나, D(detpth)는 변한다)

Q2 : What are the output sizes of all different filter operations?

28 X 28 X 128
28 X 28 X 192
28 X 28 X 96
28 X 28 X 256

Q3 : What is output size after filter concatenation?

28 X 28 X (128 + 192 + 96 + 256) = 28 X 28 X 672

width 및 height가 28 X 28 로 같기 때문에 합칠 수 있음

CONV 연산 갯수 계산 결과

[1 X 1 CONV, 128] : (28 X 28 X 128) X (1 X 1 X 256)
[3 X 3 CONV, 192] : (28 X 28 X 192) X (3 X 3 X 256)
[5 X 5 CONV, 96] : (28 X 28 X 96) X (5 X 5 X 256)

Total : 854M로 많은 연산량을 필요로 한다.

Pooling Layer 계산 결과

module input의 depth를 유지하기 때문에 concat이후에 Filter map의 depth는 더 늘어날 수 밖에 없다.

따라서, depth가 커지고 expensive compute도 늘어나고 있다.

위의 문제점을 해결하기 위해 "Bottleneck"을 도입하여 feature map의 depth를 줄임으로써 계산 복잡도도 줄이게 된다.

1x1 conv로 depth를 줄임

CONV 연산 갯수 계산 결과

[1 X 1 CONV, 64] : (28 X 28 X 64) X (1 X 1 X 256)
[1 X 1 CONV, 64] : (28 X 28 X 64) X (1 X 1 X 256)
[1 X 1 CONV, 128] : (28 X 28 X 128) X (1 X 1 X 256)
[3 X 3 CONV, 192] : (28 X 28 X 192) X (3 X 3 X 64)
[5 X 5 CONV, 96] : (28 X 28 X 96) X (5 X 5 X 64)
[1 X 1 CONV, 64] : (28 X 28 X 64) X (1 X 1 X 256)

Total : 358M로 앞서 "bottleneck"을 도입하지 않은 모델의 연산량 854M보다 훨씬 적은 연산량을 필요로 한다.

참고 : 1 X 1 CONV를 수행하면 정보 손실이 없을까? 발생한다고 한다. 하지만, 동시에 redundancy가 있는 input features를 선형 결합을 하고 non-linear를 추가하면 네트워크가 더 깊어지는 효과가 있어서 도움이 된다고 한다.

Inception modules를 계속 쌓아서 GoogLeNet이 완성된다.

FC Layer의 parameter가 매우 많기 때문에, 가능한면 제거!

FC 대신에 global average pooling(보라색 부분)을 사용한 결과로 feature vector를 추출할 수 있다. (overfitting 방지 가능) 이어서 FC Layer를 통해 Softmax를 진행

보조 분류기 : 미니 네트워크로 불리며, 실제로 여기서도 loss를 구할 수 있다고 한다. (네트워크의 깊이가 깊기 때문에, gradient를 잃어버릴 수 있는 경우, 보조 분류기를 통해 추가적인 gradient를 얻어 학습을 도울 수 있다고 한다.)

21개의 CONV
1개의 FC (softmax를 위한 마지막 부분)

ResNet

152개의 layer를 사용함으로써 depth에 대한 엄청난 혁명을 가져옴

152-layers
ILSVRC'15 classification 우승 (3.57% top 5 error)
classification / detetection

Q : What’s strange about these training and test curves?

layer가 깊어질수록 training / test 성능이 좋지 않다. 심지어 overfitting도 아니다.

일반적으로 layer가 깊어지면 overfitting이 나올 가능성이 높다고 하지만, 위 그래프의 train 및 test trend가 비슷하게 그려지므로 overfitting이 아니다.

Hypothesis : the problem is an optimization problem, deeper models are harder to optimize

Deep한 모델이 shallower한 모델보다 성능이 좋아야 함

Solution : A solution by construction is copying the learned layers from the shallower model and setting additional layers to identity mapping.

입력 X를 F(x) 이후에 그대로 element 합으로 추가. 실제 학습할 때는 $F(x) = F(x) - x$인 잔차(residual)인 변화량만 학습하게 된다. 위의 방법을 사용하면 학습이 쉬워진다고 한다.

Residual block을 쌓음
기본적으로 모든 Residual block은 2개의 3 X 3 CONV layer를 구성되어 있음

주기적으로 Filter를 2배 늘린다.
block 마다 처음 CONV를 진행할 때 stride 2를 함으로써 downsample한다. (각 차원마다 /2)

처음 Input이 통과하는 시작점에는 7 X 7, output channel = 64 및 stride = 2인 CONV layer를 적용

GoogleNet과 같이 마지막 Layer에는 일반적인 FC Layer를 사용하지 않으며, 최종 CONV이후에 global average pooling을 적용한 뒤, class분류를 위한 softamax 전에 FC layer를 사용한다.

Version에 따라 18, 34, 50, 101, 152 layer가 있다. 아래는 논문에서 실험한 model 구조이다.

GoogleNet과 마찬가지로 Layer의 depth가 깊어질수록 parameter 및 memory에 따른 계산 복잡도가 엄청 커지게 된다. layer가 50개가 넘어가면 보통 bottleneck을 사용한다고 한다.
효율성 증대를 위해 사용

모든 CONV layer 이후에 Batch Normalization 적용
Xavier/2 초기화 적용
SGD + Momentum (0.9)
Learning rate : 0.1 (검증셋의 error가 줄어들지 않으면 10으로 나눔)
etc

identity 계산이 추가됨으로써 backpropagatation에 좋은 영향을 미친다고 한다. (기울기 손실 방지)

Comparing complexity

inception-v4 : ResNet + Inception
VGG : Highest memory, most operations
GoogLeNet : most efficient
AlexNet : Smaller compute, still memory heavy, lower accuracy
ResNet : Moderate efficieny depending on model, highest accuracy

Power consumption

Other architectures to know...

Network in Network(NiN)

Identity Mappings in Deep Residual Networks

depth를 더 늘려 성능 향상

Wide Residual Networks

기존 Filter의 갯수를 F개로 설정하였지만, 여기서는 F x k개의 Filter를 적용시킴으로 layer의 depth가 아닌 channel을 넓힘
50-layer wide ResNet의 성능은 original ResNet 152-layer와 비슷함
layer의 depth가 아닌 wide하게 filter를 추가함으로써 병렬연산이 가능하게 되어 컴퓨팅 효율을 향상 시킴

Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)

기존 Inception module에서 사용된 개념과 유사하게 residual block을 paralle 형태로 만듬

Deep Networks with Stochastic Depth

기울기 소실을 방지하고 학습하는 동안에 network를 계산없이 pass시킴으로써 학습 시간을 단축시키기 위함
학습 기간에 residual block layers들을 랜덤하게 빼준다. (죽임)
idenetity function으로 그냥 넘어감
test 에는 모든 layer 사용함 (마치 dropout과 비슷함)

Densely Connected Convolutional Networks

Dense Block 도입

끝...

저작자표시 (새창열림)

'CS231' 카테고리의 다른 글

Lecture 11 : Detection and Segmentation (0)	2020.05.17
Lecture 10 : Recurrent Neural Networks (0)	2020.05.09
Lecture 8 : Deep Learning Software (0)	2020.05.02
Lecture 7 : Training Neural Networks II (2)	2020.04.16
Lecture 6 : Training Neural Networks I (2)	2020.04.12

목	금	토	일	월	화

-7°	-12°	-9°	-6°	-2°	-1°
-16°	-17°	-15°	-12°	-7°	-6°

DeepHaejoong

Lecture 9 : CNN Architectures

CNN Architectures

Review : LeNet-5

AlexNet (ILSVR'12 winner)

ZFNet (AlexNet에서 parameter변화)

VGGNet

GoogLeNet

ResNet

Comparing complexity

Power consumption

Other architectures to know...

Network in Network(NiN)

Identity Mappings in Deep Residual Networks

Wide Residual Networks

Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)

Deep Networks with Stochastic Depth

Densely Connected Convolutional Networks

'CS231' 카테고리의 다른 글

댓글

티스토리툴바

Lecture 9 : CNN Architectures

CNN Architectures

Review : LeNet-5

AlexNet (ILSVR'12 winner)

ZFNet (AlexNet에서 parameter변화)

VGGNet

GoogLeNet

ResNet

Comparing complexity

Power consumption

Other architectures to know...

Network in Network(NiN)

Identity Mappings in Deep Residual Networks

Wide Residual Networks

Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)

Deep Networks with Stochastic Depth

Densely Connected Convolutional Networks

'CS231' 카테고리의 다른 글

관련글

댓글

티스토리툴바