Device Map: Avoid Out-of-Memory Errors When Running Large Language Models
A small trick to run LLMs on any computer
Device mapping is a feature implemented in the Accelerate library by Hugging Face. It splits a large language model (LLM) into smaller parts that can be individually loaded on different devices: GPUs VRAM, CPU RAM, and hard disk.
In this article, I won’t further explain how it works. I have already written a detailed report about device map that you can read here:
I will explain why even with device map you may still get out-of-memory (OOM) errors triggered by your GPU.
GPUs store more than just the LLM
Device map handles all the steps of LLM loading to avoid OOM errors: empty tensors creation, weights duplication, etc. It maximizes the usage of the VRAM available on the GPUs.
For instance, if you have an A100 48 GB, device map will try to use the entire 48 GB for the LLM. Leaving only a few GB available for any other operations that the GPU may have to do later.
You also need some space on your GPUs to store CUDA kernels, various other tensors, and the graphical user interface (GUI) of your OS if you have a screen plug to your GPU (it consumes 2 GB for Ubuntu). Device map can’t know in advance how much memory your other processes will consume.
So we need to keep free some of the memory, just in case, especially on your main system GPU.
Set up a max memory usage for LLM on your GPU
In practice, reserving some of the GPUs’ memory is easy to achieve with Accelerate.
Let’s say that you have several GPUs with 48 GB of VRAM. You can set the max memory as follows to avoid OOM errors:
max_memory = {i: '46000MB' for i in range(torch.cuda.device_count())}
max_memory[0] = '30000MB'
model = OPTForCausalLM.from_pretrained("facebook/opt-6.7b", device_map="auto", max_memory=max_memory)
Note: You must import accelerate. I explain how to run device map in the article I pointed out in the introduction.
In this example, I kept 16 GB available on the first GPU, and 2 GB on the other GPUs. It will be too much in most cases and largely depends on the LLM you use. I recommend manually adjusting it until your GPU triggers OOM errors.
If you have only one GPU, let’s say an RTX 3080 24 GB, you could set up max_memory as follows:
max_memory[0] = '18000MB'
Then, increase the value by steps of 500MB until you trigger an OOM.
Conclusion
Setting up max memory will help you to avoid OOM errors.
Nonetheless, in some cases, you may think you have found a value that works well and then suddenly get an OOM error. It may happen for various reasons. On a personal computer, this is often due to other tasks that require some GPU VRAM, e.g., if you fine-tune an LLM but decide at the same time to launch Netflix in 4k, it won’t slow down your training but it may run out of memory.