Hi Everyone,
In this edition of The Weekly Kaitchup:
A Negation Benchmark for LLMs
Run Models with Trillions of Parameters on Your Computer Thanks to 1-bit Quantization
Llemma and Proof-Pile-2: A Model and a Dataset for Mathematics
What to Read On Substack: This is a new section where I recommend articles published in other Substack newsletters
The Kaitchup has now 857 subscribers. Thanks a lot for your support!
If you are a free subscriber, consider upgrading to paid to access all the notebooks and articles. There is a 7-day trial that you can cancel anytime.
If you are a monthly paid subscriber, switch to a yearly subscription to get a 17% discount (2 months free)!
A Negation Benchmark for LLMs
LLMs struggle with understanding negation. To better assess LLMs’ ability to deal with negation, the University of the Basque Country created a new dataset containing 400k sentences about commonsense knowledge, that can be true or false, in which negation is present in about 2/3 of the corpus in different forms.
The dataset is available (Apache 2.0 license) on the Hugging Face Hub:
You can also find code to evaluate LLMs with this dataset in this GitHub repository:
The creators of the datasets used it to evaluate popular LLMs (LLaMA, Pythia, T5, …). Their conclusion:
while LLMs are proficient at classifying affirmative sentences, they struggle with negative sentences and lack a deep understanding of negation, often relying on superficial cues.
Fine-tuning the LLMs on the dataset didn’t significantly improve them at dealing with negative sentences.
Experiments with this benchmark and details on its creation are given in this arXiv paper:
This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models
Run Models with Trillions of Parameters on Your Computer Thanks to 1-bit Quantization
4-bit quantization is good enough for most LLMs with billions of parameters. It works better for larger LLMs as the quantization gets more accurate with more parameters to quantize.
3-bit quantization works fine for very large language models, e.g., with more than 100B parameters such as Falcon-180B. 2-bit quantization can also produce acceptable results for very large models even though I would recommend keeping some parts at higher precision, e.g., using ExLlamav2.
What about 1-bit quantization?
I recently read two promising papers proposing methods to get 1-bit models:
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Switch Transformer has 1.6 trillion parameters. You would need 3.2 TB of memory just to load the model. QMoE shows that it is possible to compress its weights to an average of 0.8 bits without much accuracy loss. After compression, the model can be loaded on a machine with more than 160 GB of CPU RAM (or 8 GPUs with 24 GB of VRAM, such as RTX 3090/4090 GPUs).
BitNet: Scaling 1-bit Transformers for Large Language Models
QMoE is a post-training quantization algorithm designed for MoE models. On the other hand, BitNet is more flexible as it inserts and trains 1-bit layers in the Transformers. In practice, it replaces “nn.Linear” with a new “BitLinear” module:
In their evaluation, the authors of BitNet show very impressive results. It seems to compete with models quantized with 4-bit GPTQ. However, note that with BitNet the model is trained with 1-bit weights while GPTQ is a post-training quantization algorithm, i.e., the model has not been trained with low-precision weights.
BitNet keeps gradients and optimizer states to high precision for stability during training.
Llemma and Proof-Pile-2: A Model and a Dataset for Mathematics
EleutherAI released a new LLM for mathematics called Llemma.
It is based on Code Llama fine-tuned on the Proof-Pile-2, a dataset collected by EleutherAI containing a mixture of scientific papers, web data containing mathematics, and mathematical code.
Llemma is a 7 billion parameter model. You can run it on your GPU if it has 24 GB of VRAM. If you quantize it to 4-bit, it can also run on a GPU with at least 6 GB of VRAM.
You can get it from the Hugging Face Hub:
There is also a bigger version with 34 billion parameters:
The dataset Proof-pile-2 is available here:
Training details and evaluation are presented in this arXiv paper:
Llemma: An Open Language Model For Mathematics
What To Read On Substack
In this section, I recommend articles that I read on Substack:
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers:
Have a nice weekend!
Hello
i have some questions please.
- do we have to make a specific prompt style to fine tune ( lora) a model ? I mean the ‘road’ used during inference will be more probable to use lora weights ?
- with peft we can add lora weight from extra weights, why don’t we do several time with different calibrate lora to have better results ?
Finally mixe my two questions to have the best model so far....
Working on it .what do you think ?