Run Very Large Language Models on Your…

Benjamin Marie

Dec 22, 2022

With PyTorch and Hugging Face’s device_map

Read →

5 Comments

Grégoire Mesnil

Dec 19, 2023

Hi Benjamin,

Do we know how the inference computation is performed with a device map spread over VRAM/RAM/disk? Is everything being transferred to the GPU for computation or do we have inference performed on CPU for layers with parameters on RAM/disk? Thanks!

Expand full comment

Reply (1)

Benjamin Marie

Dec 19, 2023

I don't know how it works. I didn't see anything about that in the doc of accelerate. I would guess that all the parts of the model are read where they are while the computation happens on the fastest device.

Expand full comment

Reply (1)

Grégoire Mesnil

Dec 20, 2023

Thanks! I'll ask people at HF.

The reason I was asking: at some point you mentioned setting `max_memory` for the VRAM and leaving some room to avoid OOM. I wonder if the OOM was due, on top of loading other things, to loading other parts of the model that were on disk/RAM into the VRAM for computation. I'll post my answer here when I receive it.

Expand full comment

Reply (2)

Grégoire Mesnil

Dec 20, 2023

From this video, it looks like everything is computed on the GPU (weights are moved from their location to the VRAM as inference is performed). That's another good reason for leaving some space on the VRAM while setting max_memory.

https://youtu.be/MWCSGj9jEAo

Expand full comment

Benjamin Marie

Dec 20, 2023

It also depends on the task we want to do with the model. I think device-map only takes care of the loading but if later we do batch decoding then we need to store the batches somewhere where there is space. For fine-tuning, as far as I know, it doesn't leave space for optimizer states.

Expand full comment

The Kaitchup – AI on a Budget

Run Very Large Language Models on Your…