[PyTorch][에러 해결] Solving problem of GPU memory: ‘torch.utils.checkpoint’

Programming/Python

[PyTorch][에러 해결] Solving problem of GPU memory: ‘torch.utils.checkpoint’

영스퀘어 2023. 3. 7. 15:09

https://pytorch.org/docs/stable/checkpoint.html

torch.utils.checkpoint — PyTorch 1.13 documentation

Shortcuts

pytorch.org

많은 수의 파라미터를 가지는 크기가 큰 딥러닝 모델의 경우, batch를 1로 설정하더라도 'out of memory'를 만나게 될 수 있다.

나의 경우, Transformer backbone을 사용하는 VSR 모델에 대한 실험을 해보려하니 48GB 메모리를 가지는 Quadro RTX 8000 GPU로도 실험이 불가능하였다.

GPU를 활용한 실험을 할 때 이러한 메모리 제한 문제는 'torch.utils.checkpoint'로 해결 할 수 있다.

tutorial을 보면, 아래와 같이 쓰여있다.

'Rather than storing all intermediate activations of the entire computation graph for computing backward, the checkpointed part does not save intermediate activations, and instead recomputes them in backward pass.'

자세한 정보가 적혀있지는 않지만, 원래는 학습 시 'intermediate activations' 저장을 위해 memory를 차지해야만 했지만, 이렇게 저장하는 방식 대신 backward pass에서 계산하는 방식으로 변경하면서 메모리를 줄일 수 있는 것으로 보인다.

더 자세한 정보는 tutorial 링크를 첨부해놓았으니 참고바란다.

사용법은 간단하다.

[Train 프로세스가 구현되어 있는 부분]

1. DDP 적용을 위한 추가 설정

model._set_static_graph()

2. cudnn 관련 설정

torch.backends.cudnn.benchmark = True

3. AMP와 함께 학습

scaler = torch.cuda.amp.GradScaler(enabled=True)

optimizer_g.zero_grad()

with autocast(enabled=True):
    output_HR = model(input_LR)
    l_pix = loss_function(output_HR, gt)
    
    scaler.scale(l_pix).backward()
    scaler.step(optimizer_g)
    scaler.update()

[모델이 구현되어 있는 부분]

1. 모듈 import

import torch.utils.checkpoint as checkpoint

2. checkpoint 함수 적용

class BasicLayer(nn.Module):
    def __init__(self, dim, depth):

        super().__init__()

        # build sample blocks
        self.blocks = nn.ModuleList([
            SampleBlock(
                dim=dim) for i in range(depth)
        ])

    def forward(self, x):
        for blk in self.blocks:
            # x = blk(x)
            x = checkpoint.checkpoint(blk, x)
            
        return x

'Programming > Python' 카테고리의 다른 글

[PyCharm][에러 해결] 디버깅 할 때 "Collecting data..." 라는 메세지만 뜨고 값을 볼 수 없을 때 (0)	2022.05.09
[PyTorch][에러 해결] RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32]], which is output 0 of SelectBackward, is at version 165; expected version 164 instead. H.. (0)	2021.10.19

현재글[PyTorch][에러 해결] Solving problem of GPU memory: ‘torch.utils.checkpoint’

Motion Model, Affine Motion Model, Video Coding, Affine Motion Compensation, Inter Prediction, Video Super-Resolution, JND, HEVC, Motion Compensation, jem, cnn, MOTION ESTIMATION, FVC, Codec, Super-Resolution, VTM, VVC, Block Partition, VSR, QTBT,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

영스퀘어