The Weekly Kaitchup #12

Negation Benchmark for LLMs - 1-bit LLMs - Llemma

Benjamin Marie

Oct 27, 2023

Hi Everyone,

In this edition of The Weekly Kaitchup:

A Negation Benchmark for LLMs
Run Models with Trillions of Parameters on Your Computer Thanks to 1-bit Quantization
Llemma and Proof-Pile-2: A Model and a Dataset for Mathematics
What to Read On Substack: This is a new section where I recommend articles published in other Substack newsletters

The Kaitchup has now 857 subscribers. Thanks a lot for your support!

If you are a free subscriber, consider upgrading to paid to access all the notebooks and articles. There is a 7-day trial that you can cancel anytime.

If you are a monthly paid subscriber, switch to a yearly subscription to get a 17% discount (2 months free)!

A Negation Benchmark for LLMs

LLMs struggle with understanding negation. To better assess LLMs’ ability to deal with negation, the University of the Basque Country created a new dataset containing 400k sentences about commonsense knowledge, that can be true or false, in which negation is present in about 2/3 of the corpus in different forms.

Illustration by University of the Basque Country

The dataset is available (Apache 2.0 license) on the Hugging Face Hub:

HiTZ/This-is-not-a-dataset

You can also find code to evaluate LLMs with this dataset in this GitHub repository:

hitz-zentroa/This-is-not-a-Dataset

The creators of the datasets used it to evaluate popular LLMs (LLaMA, Pythia, T5, …). Their conclusion:

while LLMs are proficient at classifying affirmative sentences, they struggle with negative sentences and lack a deep understanding of negation, often relying on superficial cues.

Fine-tuning the LLMs on the dataset didn’t significantly improve them at dealing with negative sentences.

Experiments with this benchmark and details on its creation are given in this arXiv paper:

This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models

Run Models with Trillions of Parameters on Your Computer Thanks to 1-bit Quantization

4-bit quantization is good enough for most LLMs with billions of parameters. It works better for larger LLMs as the quantization gets more accurate with more parameters to quantize.

3-bit quantization works fine for very large language models, e.g., with more than 100B parameters such as Falcon-180B. 2-bit quantization can also produce acceptable results for very large models even though I would recommend keeping some parts at higher precision, e.g., using ExLlamav2.

Run Llama 2 70B on Your GPU with ExLlamaV2

Benjamin Marie, PhD

September 27, 2023

Read full story

What about 1-bit quantization?

I recently read two promising papers proposing methods to get 1-bit models:

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Switch Transformer has 1.6 trillion parameters. You would need 3.2 TB of memory just to load the model. QMoE shows that it is possible to compress its weights to an average of 0.8 bits without much accuracy loss. After compression, the model can be loaded on a machine with more than 160 GB of CPU RAM (or 8 GPUs with 24 GB of VRAM, such as RTX 3090/4090 GPUs).

BitNet: Scaling 1-bit Transformers for Large Language Models

QMoE is a post-training quantization algorithm designed for MoE models. On the other hand, BitNet is more flexible as it inserts and trains 1-bit layers in the Transformers. In practice, it replaces “nn.Linear” with a new “BitLinear” module:

In their evaluation, the authors of BitNet show very impressive results. It seems to compete with models quantized with 4-bit GPTQ. However, note that with BitNet the model is trained with 1-bit weights while GPTQ is a post-training quantization algorithm, i.e., the model has not been trained with low-precision weights.

BitNet keeps gradients and optimizer states to high precision for stability during training.

Llemma and Proof-Pile-2: A Model and a Dataset for Mathematics

EleutherAI released a new LLM for mathematics called Llemma.

It is based on Code Llama fine-tuned on the Proof-Pile-2, a dataset collected by EleutherAI containing a mixture of scientific papers, web data containing mathematics, and mathematical code.

Llemma is a 7 billion parameter model. You can run it on your GPU if it has 24 GB of VRAM. If you quantize it to 4-bit, it can also run on a GPU with at least 6 GB of VRAM.

You can get it from the Hugging Face Hub:

EleutherAI/llemma_7b

There is also a bigger version with 34 billion parameters:

EleutherAI/llemma_34b

The dataset Proof-pile-2 is available here:

EleutherAI/proof-pile-2

Training details and evaluation are presented in this arXiv paper:

Llemma: An Open Language Model For Mathematics

What To Read On Substack

In this section, I recommend articles that I read on Substack:

Machine Learning and Large Language Models

Introducing KeyLLM — Keyword Extraction with LLMs

Large Language Models (LLMs) are becoming smaller, faster, and more efficient. Up to the point where I started to consider them for iterative tasks, like keyword extraction. Having created KeyBERT, I felt that it was time to extend the package to also include LLMs. They are quite powerful and I wanted to prepare the package for when these models can be …

2 years ago · 14 likes · 5 comments · Maarten Grootendorst

Deep (Learning) Focus

Proximal Policy Optimization (PPO): The Key to LLM Alignment

This newsletter is presented by Rebuy, the commerce AI company.Join subscribers from Microsoft, Tesla, Google, Meta, and more that use Deep (Learning) Focus to better understand AI research! If you like the newsletter, feel free to get in touch or follow me on…

2 years ago · 7 likes · 2 comments · Cameron R. Wolfe, Ph.D.

The AiEdge Newsletter

How To Optimize Your RAG Pipelines

The RAG pipeline Indexing optimization Query optimization Retrieval optimization Document selection optimization Context optimization The RAG pipeline The idea with Retrieval Augmented Generation (RAG) is to encode the data you want to expose to your LLM into embeddings and index that data into a vector database. When a user asks a question, it is converted to an embedding, and we can use it to search for similar embeddings in the database. Once we found similar embeddings, we construct a prompt with the related data to provide context for an LLM to answer the question. Similarity here is usually measured using the cosine similarity metric…

2 years ago · 9 likes · 2 comments · Damien Benveniste

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers:

Share The Kaitchup – AI on a Budget

Have a nice weekend!

baconnier loic

Oct 28, 2023

Hello

i have some questions please.

- do we have to make a specific prompt style to fine tune ( lora) a model ? I mean the ‘road’ used during inference will be more probable to use lora weights ?

- with peft we can add lora weight from extra weights, why don’t we do several time with different calibrate lora to have better results ?

Expand full comment

Finally mixe my two questions to have the best model so far....

Working on it .what do you think ?

3 replies by Benjamin Marie and others

3 more comments...

The Kaitchup – AI on a Budget