💗

CGNet : : A Light-weight Context Guided Network for Semantic Segmentation (2020)

CGNet: A Light-Weight Context Guided Network for Semantic Segmentation

The demand of applying semantic segmentation model on mobile devices has been increasing rapidly. Current state-of-the-art networks have enormous amount of parameters hence unsuitable for mobile devices, while other small memory footprint models follow the spirit of classification network and ignore the inherent characteristic of semantic segmentation.

https://ieeexplore.ieee.org/document/9292449

CGNet

wutianyiRosun

Abstract

모바일 환경에서의 semantic segmentation 적용 시도 증가 → 가벼운 모델이 필요하다

일부 경량화 segmentation은 classification에서 사용된 방법을 사용해서 segmentation 에서 고려할 특성들을 무시하고 만들어진 문제가 있었다.

이를 개선하기 위해 Context Guided Network ( CGNet )을 소개한다. → 가볍고, 효율이 좋은 segmentation model

CGBlock : local feature와 local feature를 둘러싼 surrounding context를 학습. 더 나아가 global context와 관련된 feature도 사용하여 성능을 향상

CG Block을 기반으로 네트워크 모든 단계에서 상황에 맞는 정보를 이해하고 segmentation 정확도를 높이기 위해 설계됨.

※ local featrue는 convolutional filter가 연산되는 영역

•

CGNet은 파라미터 수를 줄이고 메모리 공간을 절약하도록 정교하게 설계됐다.

•

비슷한 파라미터 수를 가지는 segmentation 모델들과 비교해 성능이 훨씬 뛰어남

•

CityScape와 Camvid데이터로 실험

•

post-processing / multi-scale testing 없이 제안됨.

•

0.5M 미만의 파라미터로 64.8%의 Cityscape dataet에 대한 mIOU 달성

Introduction

자율 주행 / 로봇 / 모바일 → 적은메모리, 높은 정확도의 모델 설계 중요

파라미터 수와 CityScape dataset에 대한 mean IOU를 나타낸 위의 그래프는 파란색점은 높은 정확도 모델, 빨간 점은 메모리 사용이 적은 모델을 나타낸다.

•

오른쪽의 파란점 모델들은 mobile device에서 사용하기에 적합하지 않다.

•

왼쪽의 빨간점들은 image classification의 원리만을 따르고 segmentation의 고유한 속성은 무시하여 정확도가 낮다. 

•

따라서 CGNet은 segmentation의 내재적 특성을 활용해 정확도를 높이고자 한다.

segmentation은 pixel-level categorization + object localization 모두 포함

⇒ spatial dependency와 semantic contextual information을 효과적이고 효율적으로 모델링할 수 있도록 설계한 CGBlock 제안

CGBlock은…

•

local feature와 주변 context가 결합된 joint feature를 학습한다. 

local feature와 local feature 주변의 context가 공간 상에서 공유하는 특징을 잘 학습하게 됨

•

global context를 사용해 joint feature를 개선

global context는 유용한 구성요소를 강조하고 쓸모없는 구성요소를 억제하기 위해 ( 대비 증가 ) 채널별로 joint faeture의 가중치를 재조정하는데 적용된다.

•

CGNet의 모든 단계에서 사용된다. 

이를 통해 semantic level( deep layer ) 와 spatial level ( shallow layer ) 모두에서 context 정보를 캡쳐할 수 있다.

⇒ segmentation 에 더 적합하다는 주장

Related Work

small semantic segmentation models

정확도 파라미터 수 or 메모리 사용량 간의 trade-off 가 적당히 필요

•

ENet : FCN과 같은  기존 segmentation 모델의 마지막 단계를 제거하자, 임베디드 장치에서 segmentation이 가능하다. 

•

ICNet : compressed-PSPNet기반으로 image cascade network 제안해 속도 개선

•

ESPNet : 자원 제약 하에 고해상도 이미지를 segmentation할 수 있는 빠르고 효율적인 네트워크 제안

⇒ 위 모델들은 대부분 image classification 모델 설계원리를 따른다. ⇒ pixel 단위인 segmentation 정확도 하락

context information models

context information → segmentation 성능 향상에 도움이 된다. 라는 것이 최근 연구에서 보여짐

•

dilation 8 : class liklihood map 이후 multiple dilated convolutional layer로 exercise multi-scale context를 합침( aggregation )

•

SAC( scale-adaptive convolution ) 가변적 크기의 receptive field 적용

•

DeepLab v3 : ASPP(Atrous Spatial Pyramid Pooling)으로 context information을 다양한 sclae로 얻음.

attention models

attention mechanism → model의 능력 향상을 위해 널리 사용됨.

•

RNNsearch : machine translation에서 target word 예측시 input words에 가중치를 주는 방식 제안

•

CGBlock은 global context 정보를 사용해 weight vector를 계산하는 방식으로 attention mechanism을 사용. → local feature + surrounding context feature의 joint feature를 개선

Proposed Approach

Context Guided Block ( CG Block )

위의 Fig3은 CGBlock의 overview이다.

•

(a)사진의 노란 작은영역은 그 영역의 class를 판단하기에는 어려운 영역이다. 

노란 영역 : local feature

•

(b)사진의 빨간 영역을 포함하여 보면 그 영역의 class인식에 도움이 된다.

빨간 영역 : surrounding context

•

(c)사진은 (b)에 사진 전체를 포함하는 보라색 사각형이 추가되었다.

보라색 영역 : global context

이는 인간 시각 시스템에서 영향을 받은 것으로, 노란 영역을 인식하기 위해 빨간 영역과 보라색 영역의 주변 정보에 의존하는 것을 모델링 했다.

(d)는 CGBlock의 형태이다.

•

floc(∗)f_{loc}(*)floc​(∗) : local feature ( extractor ), 3×33\times33×3 의 기본 convolutional layer. 상하좌우 8개의 방향에서 feature를 학습하게됨.

•

fsur(∗)f_{sur}(*)fsur​(∗) : surrounding context ( extractor ), dilated convolution으로 같은 필터 개수로 더 넓은 receptive field를 가진다. 따라서 주변 상황을 캡쳐해 학습한다.

•

fjoi(∗)f_{joi}(*)fjoi​(∗) : floc(∗)f_{loc}(*)floc​(∗) 와 fsur(∗)f_{sur}(*)fsur​(∗) 를 aggregate한 feature.  floc(∗)f_{loc}(*)floc​(∗) 와 fsur(∗)f_{sur}(*)fsur​(∗) 를 concatenate해 생성하며, concatenation이후 batchnorm을 적용한다.

•

fglo(∗)f_{glo}(*)fglo​(∗) : global context ( extractor ), weighted vector로 취급되며 유용한 구성요소를 강조하고 쓸모없는 구성요소는 억제하기 위한 용도로 사용된다. 이는 joint feature에 적용된다. 

◦

GAP + FClayer( Linear layer ) 를 사용함으로써 global context를 얻는다.

class FGlo(nn.Module):
    """
    the FGlo class is employed to refine the joint feature of both local feature and surrounding context.
    """
    def __init__(self, channel, reduction=16):
        super(FGlo, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
                nn.Linear(channel, channel // reduction),
                nn.ReLU(inplace=True),
                nn.Linear(channel // reduction, channel),
                nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return x * y
Python
복사

•

추출된 global context로 joint feature의 가중치를  재조정하기 위해 scale layer를 사용한다. (element-wise multiplication)

•

학습은

f_{loc}(*)

와

f_{sur}(*)

를 각각 먼저 학습하게 된다.

class ContextGuidedBlock(nn.Module):
    def __init__(self, nIn, nOut, dilation_rate=2, reduction=16, add=True):
        """
        args:
           nIn: number of input channels
           nOut: number of output channels, 
           add: if true, residual learning
        """
        super().__init__()
        n= int(nOut/2)
        self.conv1x1 = ConvBNPReLU(nIn, n, 1, 1)  #1x1 Conv is employed to reduce the computation
        self.F_loc = ChannelWiseConv(n, n, 3, 1) # local feature
        self.F_sur = ChannelWiseDilatedConv(n, n, 3, 1, dilation_rate) # surrounding context
        self.bn_prelu = BNPReLU(nOut)
        self.add = add
        self.F_glo= FGlo(nOut, reduction)

    def forward(self, input):
        output = self.conv1x1(input)
        loc = self.F_loc(output)
        sur = self.F_sur(output)
        
        joi_feat = torch.cat([loc, sur], 1) 

        joi_feat = self.bn_prelu(joi_feat)

        output = self.F_glo(joi_feat)  #F_glo is employed to refine the joint feature
        # if residual version
        if self.add:
            output  = input + output
        return output
Python
복사

추가로 CGBlock은 성능 개선을 위해 backpropagation 시에 residaul learning을 사용한다.

•

LRL : input + joint feature

•

GRL : input + global feature

CGNet

이제 CGBlock을 기반으로 한 전체 CGNet에 대한 part이다.

CGNet 전체적인 설계원칙은 메모리 절약을 위해 “deep and thin”이라는 원칙을 따른다.

→ 실제로 51개의 convolution 레이어만 가져 다른 모델에 비해 매우 얕다.

또한 공간정보의 보존을 위해 3단계만으로 구성해 1/8 featuremap resolution을 사용한다. ( 많은 segmentation 모델이 downsampling을 5단계로 구성해 featuremap resolution이 1/32 )

위의 표는 CityScapes를 위한 CGNet의 아키텍쳐이다.

•

stage 1 : standard convolution layer 3개를 쌓아 1/2 resolution의 feature map을 얻는다 ( stride 2의 conv )

•

stage 2 : M개의 CGBlock을 쌓아 입력이미지의 1/4로 downsampling한 feature maps를 얻는다.

•

stage 3 : N개의 CGBlock을 쌓아 입력이미지의 1/8로 downsampling한 feature maps를 얻는다.

Fig5를 보면 stage 2와 stage 3은 이전 stage의 첫 block의 output과 마지막 block의 output을 결합해 입력으로 받기 때문에 feature를 재사용하고 residual learning을 가능하게 한다.

CGNet의 정보흐름의 개선을 위해 stage 2, 3각각 1/4, 1/8의 downsampling된 입력 영상을 추가로 전달하는 input injection 메커니즘도 추가한다.

input injection : Average pooling으로 downsampling ratio = 1이면 1/2로, 2이면 1/4로 resolution을 감소시키는 구조

class InputInjection(nn.Module):
    def __init__(self, downsamplingRatio):
        super().__init__()
        self.pool = nn.ModuleList()
        for i in range(0, downsamplingRatio):
            self.pool.append(nn.AvgPool2d(3, stride=2, padding=1))
    def forward(self, input):
        for pool in self.pool:
            input = pool(input)
        return input
Python
복사