The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Use FlashAttention-2 for Faster Fine-tuning and Inference

Use FlashAttention-2 for Faster Fine-tuning and Inference

How to use FlashAttention-2 for QLoRA fine-tuning

Benjamin Marie's avatar
Benjamin Marie
Nov 16, 2023
∙ Paid
12

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Use FlashAttention-2 for Faster Fine-tuning and Inference
5
2
Share

FlashAttention is a popular method to optimize the attention computation in the Transformer. It can significantly accelerate inference and fine-tuning for large language models (LLM).

FlashAttention is now implemented in many frameworks, including Hugging Face Transformers and Unsloth, and supports most of the recent LLMs.

unsloth: Faster and Memory-Efficient QLoRA Fine-tuning

unsloth: Faster and Memory-Efficient QLoRA Fine-tuning

Benjamin Marie
·
December 28, 2023
Read full story

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I briefly describe how FlashAttention works and especially detail the optimizations brought by FlashAttention-2. We will see how to use it with Hugging Face Transformers and what kind of speedup you can expect when using it for QLoRA fine-tuning.

You can find the code for QLoRA fine-tuning using FlashAttention-2 in this notebook:

Get the notebook (#28)

Last update: March 13th, 2024

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share