[작성중]Pytorch에서 Mixed Precision 사용하기

이전 블로그에서 작성했던 글

yolov5를 공부하다가 이런 부분을 발견 if half: #device != cpu이면 model.half() # to FP16 어떤 의미일까 궁금했다. 단순히 모델의 half만 이용한다기에는 gpu인데 굳이..?라는 생각에 찾아봤다. FP16(16 bit floating point), FP32(32 bit floating point) 일단은.. computer science(이하 cs)에서의 정밀도의 차이이다. cs에서의 정밀도는 보통 bit / 이진수로 측정된다.

https://chang-aistory.tistory.com/54?category=933326

Pytorch 공식

Automatic Mixed Precision - PyTorch Tutorials 1.11.0+cu102 documentation

Author: Michael Carilli torch.cuda.amp provides convenience methods for mixed precision, where some operations use the torch.float32 () datatype and other operations use torch.float16 (). Some ops, like linear layers and convolutions, are much faster in . Other ops, like reductions, often require the dynamic range of .

https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html

Automatic Mixed Precision examples - PyTorch 1.11.0 documentation

Ordinarily, "automatic mixed precision training" means training with torch.cuda.amp.autocast and torch.cuda.amp.GradScaler together. Instances of torch.cuda.amp.autocast enable autocasting for chosen regions. Autocasting automatically chooses the precision for GPU operations to improve performance while maintaining accuracy. Instances of torch.cuda.amp.GradScaler help perform the steps of gradient scaling conveniently.

https://pytorch.org/docs/stable/notes/amp_examples.html

torch.cuda.amp는 mixed precision을 위한 편리한 기능을 제공하여 준다

몇몇 연산은 torch.float32(float) datatype을 사용하고, 다른 연산들은 torch.float16(half) 연산을 사용하도록

•

사진을 보면 Conv 연산 등은 half로, Softmax나 Loss같은 acc 유지에 중요한 역할을 하는 부분은 FP32로 유지

•

weight는 single precision을 유지하고 back propagation은 half precision으로 수행

Mixed precisoin은 각 연산에 그것에 적절한 datatype을 적용하도록 하고 네트워크의 runtime과 memory footprint를 감소시켜준다

Pytorch에서 automatic mixed precision training은 일반적으로 torch.cuda.amp.autocast와 torch.cuda.amp.GradScaler을 함께 씀을 의미한다

•

torch.cuda.amp.autocast : 선택한 영역에서 autocasting을 가능하게 한다. autocasting은 자동으로 acc는 유지하면서 performance는 향상킬 GPU연산을 위한 precision을 선택한다.

•

torch.cuda.amp.GradScaler : 편리하게 gradient scaling의 step을 수행하는 것을 돕는다. 이는 float16 fradients로 gradient underflow를 감소시킴으로써 네트워크의 수렴을 향상시킨다. forwardpass에서 특정 연산이 float 16이라면 이 연산을 위한 backward pass는 float16 gradient를 가질 것이다. 매우 작은 값의 gradient 값은 float16에서 충분하게 표현되지 않기 때문에 underflow가 발생할 수 있다. 따라서 연관 파라미터의 업데이트가 이뤄지지 않는다. 이를 방지하기 위해 gradient scaling이 네트워크의 loss를 scale factor에 의해 곱하고 backward pass를 scaled loss에서 수행한다. backward pass를 흐르는 gradient들은 이후 다시 같은 factor 로 scale된다. 다시말해, gradient값들은 더 큰 값을 갖게되어 zero로 flush되지 않는다.
 각 파라미터의 gradients들은 ( .grad ) optimizer 업데이트 이전에 unscaled되어야한다. 그래서 scale factor가 learning rate에 영향을 미치지 않도록 해야한다. 

Underflow와 Overflow

Mixed Precision은 주로 Tensor Core-enabled architectures ( Volta, Turing, Ampere ) 에서 이점이 있다 ( Tensor core의 연산 자체가 FP16 ). 보통 2-3배 빨라질 것이다

이전의 아키텍쳐들 ( Kepler, Maxwell, Pascal ) 은 큰 속도 향상은 없을 수도 있다.

일반적인 Mixed Precision Training

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

# 모델과 optimizers는 default precision으로 생성
model = Net().cuda()
optimizer = torch.optim.SGD(model.parameters(), ... )
loss_fn = nn.L1Loss()
# training 시작 시에 GradScaler 생성 ( 한번 )
scaler = GradScaler()

for epoch in epochs:
	for input, target in data:
		optimizer.zero_grad() # optmizer gradient 초기화
		
		with autocast():
			output = model(input)
			assert output.dtype is torch.float16 # check
			loss = loss_fn(output, target)
			
		## Loss를 scale한다
		# backward()를 scaled loss에 수행해 scaled gradients를 생성한다
		# backward pass를 autocast 하에 하는 것은 지향하지 않는다
		# Backward 연산은 대응하는 forward 연산대해 선택한 동일한 dtype autocast에서 실행된다
		scaler.scale(loss).backward()
		
		## scaler.step()은
		# 첫번쨰로 optimizer에 할당된 파라미터들의 gradients를 unscale하고 
		# gradients가 infs 나 NaNs를 포함하지 않는다면 optimizer.step()이 호출된다.
		# infs나 NaNs가 포함되어 있다며너 optimizer.step()는 건너뛰어진다
		scaler.step(optimizer)
		
		## scaler.update()는
		# 다음 iteration을 위한 scale이 업데이트 된다
		scaler.update()
Python
복사

추가 예시

import torch, time, gc

# Timing utilities
start_time = None

def start_timer():
    global start_time
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.synchronize()
    start_time = time.time()

def end_timer_and_print(local_msg):
    torch.cuda.synchronize()
    end_time = time.time()
    print("\n" + local_msg)
    print("Total execution time = {:.3f} sec".format(end_time - start_time))
    print("Max memory used by tensors = {} bytes".format(torch.cuda.max_memory_allocated()))

use_amp = True

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Mixed precision:")
Python
복사

Working with Unscaled Gradients

scaler.scale(loss).backward() 에서 생성되는 loss들은 모두 scale된다.

backward()와 scaler.step(optimizer) 사이에 gradients들을 변형하거나 검사하고 싶다면 unscale을 먼저 해야한다.

예를 들어 gradient clipping(torch.nn.utils.clip_grad_norm_()) 같은 경우엔 global norm 이나 maximum magnitude로 gradient집합을 조작한다

unscale없이 이를 조정한다면 gradient의 norm 이나 maximum 도 같이 scale되어야 하고, 개발자가 요청한 threshold가 invalid하게 된다

scaler.unscale_(optimizer) : optimizer에 할당되어 있는 gradient들을 unscale

Gradient Clipping

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

        # optimizer's gradients are already unscaled, so scaler.step does not unscale them,
        # although it still skips optimizer.step() if the gradients contain infs or NaNs.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()
Python
복사

scaler는 scaler.unscale_(optimizer)가 호출되었다는 것을 기록한다 ( 이 iteration에서 이 optimizer에 호출되었다는 것을)

따라서 scaler.step(optimizer)는 optimizer.step() 호출 전에 쓸데없이 unscale을 수행하지 않는다. ( 원래 scaler.step의 첫 순서가 unscale )

Working with Scaled Gradients

Gradient accumulation

gradient accumulation은