The Weekly Kaitchup #1

Affordable RLHF, llama2.rs, ChatGPT's custom instructions for free, a new open LLM called SILO, and the NeurIPS 2023 LLM Efficiency Challenge

Aug 12, 2023

This is the first edition of The Weekly Kaitchup. In The Weekly Kaitchup, I briefly comment on recent scientific papers, new frameworks, tips, open models/datasets, etc. focusing on affordable and computationally efficient AI.

If there is something in particular you want me to explain in-depth, please drop a comment and I’ll see what I can do.

Thank you for your support and have a nice weekend!

Affordable RLHF with DeepSpeed-Chat

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales (Yao et al., 2023)

Microsoft is still actively improving its DeepSpeed library. In this paper, they present a new end-to-end framework for Reinforcement Learning with Human Feedback (RLHF, the training method used to train chat models such as ChatGPT).

They claim that this framework provides one of the easiest solutions to train instructGPT (ChatGPT-like) models thanks to various optimizations.

The gain in performance compared to other frameworks looks impressive. But it seems that they mostly look at configurations using several A100 GPUs, which are far from affordable. It would be interesting to see if the gain is still here using only one consumer GPU.

You will find some code samples of DeepSpeed-Chat here.

Side note: The paper is difficult to read and poorly formatted (text overflow, variable font sizes, figures with poor resolution) I wonder why Microsoft did not take more time to finalize this paper. It’s unusual.

Llama 2 in Rust

Various C++ implementations support Llama 2. llama.cpp is the most popular one. I tried it for this article originally published in Towards AI:

High-Speed Inference with llama.cpp and Vicuna on CPU

Benjamin Marie

June 14, 2023

High-Speed Inference with llama.cpp and Vicuna on CPU

You don’t need a GPU for fast inference For inference with large language models, we may think that we need a very big GPU or that it can’t run on consumer hardware. This is rarely the case. Nowadays, we have many tricks and frameworks at our disposal, such as

Read full story

Sasha Rush is working on a new one-file Rust implementation of Llama 2. It’s a Rust port of Karpathy's llama2.c.

While this project is clearly in an early development phase, it’s already very impressive. It achieves 7.9 tokens/sec for Llama 2 7B and 0.9 tokens/sec for Llama 2 70B, both quantized with GPTQ.

You can learn about GPTQ for LLama 2 here:

Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer

Benjamin Marie

July 27, 2023

Read full story

Sasha claimed on X (Twitter…) that he could run the 70B version of Llama 2 using only the CPU of his laptop. But of course, it’s very slow (5 tokens/min).

If you understand Rust, I recommend reading the code. It gives a lot of ideas to efficiently deal with the quantization and dequantization of LLMs.

Thank you for reading The Kaitchup. This post is public so feel free to share it.

SILO: An LLM exclusively trained on public domain and permissively licensed text

SILO is a new LLM with the particularity that it is trained on a curated 228 billion tokens obtained from public domain and permissively licensed text.

They released the data as a new corpus that they called Open License Corpus (OLC). OLC is already available on the Hugging Face Hub and distributed under an Apache 2.0 license (commercial use allowed).

To the best of my knowledge, this is the largest corpus of this kind.

The model itself is rather small with 1.3B parameters. It uses the architecture of LLaMa and the tokenizer of GPT-NeoX. SILO 1.3B is also available on the Hugging Face Hub.

Another particularity of this model is that it can retrieve information from a database during inference. That may explain their choice of a small number of parameters for the model.

SILO is a work proposed by The University of Washington, UC Berkeley, and the Allen Institute for AI (Min et al., 2023).

OpenAI introduces custom instructions in the free plan of ChatGPT

OpenAI released custom instructions for ChatGPT on July 20th. It was only available for paid users. Now, free ChatGPT users can also use this new feature.

Custom instructions let you explicitly define the characteristics of the user and constraints on ChatGPT’s response. Technically, it was already possible to do that before but it seems that they improved the model to make sure it takes the instructions into account.

The instructions are also carried over the entire conservation, removing the need to enter them again with each prompt.

The NeurIPS 2023 LLM Efficiency Challenge —Starter Guide by Lightening AI

NeurIPS proposes a new competition this year: The LLM Efficiency Challenge. In this competition, participants train an LLM for 24 hours using only 1 GPU.

It’s not too late if you want to participate.

Sebastian Raschka wrote a complete guide for this challenge.

Sebastian also writes a newsletter here on substack, Ahead of AI, I recommend it:

The Kaitchup – AI on a Budget

High-Speed Inference with llama.cpp and Vicuna on CPU

Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer

Discussion about this post