Use FlashAttention-2 for Faster Fine-tuning and Inference

How to use FlashAttention-2 for QLoRA fine-tuning

Nov 16, 2023

∙ Paid

FlashAttention is a popular method to optimize the attention computation in the Transformer. It can significantly accelerate inference and fine-tuning for large language models (LLM).

FlashAttention is now implemented in many frameworks, including Hugging Face Transformers and Unsloth, and supports most of the recent LLMs.

unsloth: Faster and Memory-Efficient QLoRA Fine-tuning

Benjamin Marie

December 28, 2023

Read full story

In this article, I briefly describe how FlashAttention works and especially detail the optimizations brought by FlashAttention-2. We will see how to use it with Hugging Face Transformers and what kind of speedup you can expect when using it for QLoRA fine-tuning.

You can find the code for QLoRA fine-tuning using FlashAttention-2 in this notebook:

Get the notebook (#28)

Last update: March 13th, 2024

The Kaitchup – AI on a Budget

Use FlashAttention-2 for Faster Fine-tuning and Inference

How to use FlashAttention-2 for QLoRA fine-tuning

unsloth: Faster and Memory-Efficient QLoRA Fine-tuning

This post is for paid subscribers