5 Comments

Hi Benjamin,

Do we know how the inference computation is performed with a device map spread over VRAM/RAM/disk? Is everything being transferred to the GPU for computation or do we have inference performed on CPU for layers with parameters on RAM/disk? Thanks!

Expand full comment

I don't know how it works. I didn't see anything about that in the doc of accelerate. I would guess that all the parts of the model are read where they are while the computation happens on the fastest device.

Expand full comment

Thanks! I'll ask people at HF.

The reason I was asking: at some point you mentioned setting `max_memory` for the VRAM and leaving some room to avoid OOM. I wonder if the OOM was due, on top of loading other things, to loading other parts of the model that were on disk/RAM into the VRAM for computation. I'll post my answer here when I receive it.

Expand full comment

From this video, it looks like everything is computed on the GPU (weights are moved from their location to the VRAM as inference is performed). That's another good reason for leaving some space on the VRAM while setting max_memory.

https://youtu.be/MWCSGj9jEAo

Expand full comment

It also depends on the task we want to do with the model. I think device-map only takes care of the loading but if later we do batch decoding then we need to store the batches somewhere where there is space. For fine-tuning, as far as I know, it doesn't leave space for optimizer states.

Expand full comment