torch에서 processing time 측정하기 - torch.cuda.synchronize()

torch.cuda.synchronize - PyTorch 1.12 documentation

Join the PyTorch developer community to contribute, learn, and get your questions answered.

https://pytorch.org/docs/stable/generated/torch.cuda.synchronize.html

discuss.pytorch.org

https://discuss.pytorch.org/t/how-does-torch-cuda-synchronize-behave/147049/3

[Pytorch] 파이토치 시간 측정, How to measure time in PyTorch

Pytorch 에서 CUDA 호출이 비동기식이기 때문에 타이머를 시작 또는 중지 하기 전에 torch.cuda.synchronize() 를 통해 코드를 동기화 시켜주어야 한다. start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) start.record() z = x + y end.record() # Waits for everything to finish running torch.cuda.synchronize() print(start.elapsed_time(end)) 참고자료 1 : https://discuss.pytorch.org/t/best-way-to-measure-timing/39496 Best..

https://eehoeskrap.tistory.com/m/462

pytorch 튜토리얼에서 학습시작시에 해주는 torch.cuda.synchrosize() 에 대해 알아봤다.

보통 실행시간 측정에 관련된 글들이 많이 보인다.

공식 문서에는 다음과 같이 나와있다.

Waits for all kernels in all streams on a CUDA device to complete.

이 외에는 설명이 없다.

cuda

GPU에서 실행되는 함수를 호출할 때 이 함수를 보통 kernel이라고 부르고 호출된다는 것을 launch라고 한다.

C/C++코드들의 함수호출과 달리 kernel launch는 비동기적(asynchronous)이다. 즉 CPU코드는 kernel이 완료되는 것을 기다리지 않고 바로 다음 코드를 실행한다.

따라서 torch.cuda.synchronize()를 호출하면 host(CPU) 코드가 device(GPU) 코드의 완료를 기다리게 하고 device 코드가 완료되면 CPU가 실행을 재개한다.

conclusion

그래서 cuda호출이 비동기식이기에 timer 시작 / 중지 이전에 torch.cuda.synchronize()를 통해 코드를 동기화시켜주어야 정확히 시간을 측정할 수 있다는 것이다.

따라서 맨 위에서 북마크로 제시한 바와 같이 여러 커뮤니티에서 다음과 같이 시간을 측정해야한다고 한다.

## 2개의 cuda tensors ( x, y )를 더하는 연산의 예시
starter = torch.cuda.Event(enable_timing=True)
ender = torch.cuda.Event(enable_timing=True)

starter.record()
z = x + y
ender.record()

torch.cuda.synchronize()
op_time = starter.elapsed_time(ender)
print(op_time)
Python
복사

torch.cuda.Event

Wrapper around a CUDA event.

CUDA events는 device의 progress를 모니터링하는데에 사용할 수 있는 synchronization markers이다.

→ CUDA stream 내부 명령 흐름 중 특정 지점에 표시를 남기는 것이다.

CUDA Stream

정확하게 시간을 측정하려면 CUDA streams를 synchronize해야한다고 공식 문서에 적혀있다.

이벤트가 처음 record되거나 다른 프로세스로 내보낼 때 기본 CUDA 이벤트가 느리게 초기화됩니다.

생성 후에는 동일한 장치의 스트림만 이벤트를 기록할 수 있습니다. 그러나 모든 장치의 스트림은 이벤트를 기다릴 수 있습니다.

enable_timing을 True로 설정해주면 event가 시간을 측정해야함을 알려주는 것이라고 합니다.

record() 메소드는 주어진 stream에서 event를 기록합니다