The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
OpenAI GPT-OSS: Native 4-Bit MoE Models

OpenAI GPT-OSS: Native 4-Bit MoE Models

Everything you need to know about GPT-OSS 20B and 120B, MXFP4 quantization, and running on Blackwell & H100 GPUs -- The Weekly Kaitchup #104 Special Edition

Benjamin Marie's avatar
Benjamin Marie
Aug 07, 2025
∙ Paid
11

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
OpenAI GPT-OSS: Native 4-Bit MoE Models
Share

Hi Everyone!

This is a special edition of the Weekly Kaitchup covering the GPT-OSS release by OpenAI.


After months of anticipation and repeated promises, OpenAI has finally released its open-weight models:

  • gpt-oss

GPT-OSS includes two mixture-of-experts (MoE) checkpoints released under the Apache 2.0 license. The GPT-OSS 20B model uses 3.6 billion active parameters per token from a 21-billion total, running on 16 GB consumer GPUs. The larger gpt-oss-120B has 117 billion parameters and fits on a single 80 GB H100 GPU thanks to MXFP4 quantization.

Yes, MXFP4! With native support for this format now available on Blackwell GPUs, including the consumer-grade versions, the emergence of MXFP4-native models was only a matter of time. Hopper GPUs also support MXFP4, but they are large and costly, limiting their accessibility. Update: HF has updated Transformers and now allows loading MXFP4 using GPU with compute capability >= 7.5 (e.g T4, A100, L4, H100, or B200).

In this article, we’ll review the GPT-OSS models and see how they can be run, including on older GPUs.

Architecture of GPT-OSS

GPT-OSS uses a token-choice mixture-of-experts (MoE) architecture, where a small fixed number of experts are selected per token. This selection is guided by computing MoE routing weights using a softmax applied after selecting the top-k scoring experts, commonly referred to as softmax-after-topk. Each expert layer employs very standard SwiGLU activations, improving non-linearity and training stability over standard ReLU.

Attention layers alternate between two mechanisms: full-context attention and a tiny 128-token sliding window. This alternation allows the model to manage longer sequences efficiently while maintaining detailed local context. To support extended sequences, Rotary Positional Embeddings (RoPE) are used, enabling context lengths of up to 128,000 tokens.

A learned attention sink is introduced in each attention head. This mechanism modifies the denominator of the softmax computation by adding a learned stabilizing term, improving numerical robustness across long inputs and complex token dependencies.

The tokenizer is fully aligned with GPT-4o and other OpenAI API models, ensuring interoperability and ease of integration. Additionally, GPT-OSS introduces a set of new tokens designed specifically for compatibility with the OpenAI Responses API, enabling structured outputs and consistent behavior in API-based deployments.

Overall, the architecture is fairly standard for MoE models. The only notable difference lies in the relationship between the number of layers, attention heads, and hidden dimensions.

For example, when comparing GPT-OSS 20B to Qwen3-30B-A3B:

Number of Layers

  • Qwen3-30B-A3B: 48

  • GPT-OSS 20B: 24

Attention Heads

  • Qwen3-30B-A3B: 32

  • GPT-OSS 20B: 64

Intermediate Hidden Size

  • Qwen3-30B-A3B: 768

  • GPT-OSS 20B: 2,880

Despite its smaller parameter count, GPT-OSS 20B has a notably wider architecture, with twice the number of attention heads and significantly larger feedforward layers. This makes it a much wider and shallower model than typical for its scale.

Share

GPT-OSS: Are They Good?

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share