Introduction to the Open LLM Falcon-40B: Performance, Training Data, and Architecture
Get started using Falcon-7B, Falcon-40B, and their instruct versions
The Falcon models have drawn a lot of attention since they have been released in May 2023.
They are causal large language models (LLM), or so-called “decoder-only” models, very much like GPT.
Definition: Causal Language Model
Causal language modeling involves predicting the token that follows a sequence of tokens. During training, the model’s attention is solely directed toward the left context. The right context is masked. These models are usually trained on billion words.
The Falcon models are completely free, even for commercial use (Apache 2.0 License), since May 31st. The Falcon models are developed and trained by the Technology Innovation Institute (TII) of Abu Dhabi.
According to the first results, Falcon-40B, the biggest of the Falcon models, outperforms all the other causal LLMs, including LLaMa-65B and MPT-7B.
In this article, I introduce in detail Falcon-40B, Falcon-7B, and their instruct versions. We will see how they perform compared to other models, how they were trained, and how to run Falcon7-B on your own GPU with QLoRa.
Performance on OpenLLM
The instruct version of Falcon-40B is ranked first on
the OpenLLM leaderboard. The standard version is ranked second.
The OpenLLM leaderboard evaluates the performance of LLMs on 4 tasks:
AI2 Reasoning Challenge (25-shot): Questions of grade-school science.
HellaSwag (10-shot): A commonsense inference benchmark.
MMLU (5-shot): 57 tasks in various domains such as maths, computer science, and law.
TruthfulQA (0-shot): A benchmark that evaluates how truthful is the model when answering questions.
Falcon-40B outperforms Meta AI’s LLaMa-65B on all these tasks.
Falcon RefinedWeb
The Falcon models were mainly trained on the Falcon RefinedWeb dataset. It was also created by TII and is distributed under an Apache 2.0 license.
RefinedWeb was extracted from CommonCrawl and has been thoroughly curated. TII claims it is multimodal-friendly since they preserved links and alt texts of images.
In the dataset card published in the Hugging Face Hub, TII wrote: “This public extract […]”. To me, it is thus unclear whether the Falcon models have been trained on this public version of the dataset, which is only an “extract”, or whether they have used a bigger internal version.