Architecture

Sparse where it counts

A mixture-of-experts transformer that routes each token to the two experts that matter. You get the knowledge of a larger model at the inference cost of a small one.

Model parameters

Model typeSparse MoE

Total parameters9B

Active parameters2.6B / token

Experts per layer8

Expert routingTop-2

Context window32,768

Vocabulary49,152

Transformer block

Hidden dimension2048

Layers32

Attention heads16

KV heads (GQA)4

MLP dim / expert5504

Positional encodingRoPE

Norm / activationRMSNorm / SwiGLU

What it is good at

Trained on the work, not just the web

Post-trained on curated data engineering and data science tasks, then reinforcement-tuned against executed code.

☷

SQL, expertly

Generation, optimization, execution-plan reasoning, and dialect translation like Snowflake to BigQuery.

⚡

PySpark pipelines

Author and optimize Spark pipelines end to end, with lakehouse and streaming and CDC patterns built in.

✦

ThinkingLanguage native

Speaks TL fluently and emits declarative Bonacci pipelines you can run as they are.

▸

Pipeline autopilot

Autonomous multi-step orchestration from design to deploy, reasoning across the whole pipeline.

◇

Tool use and MCP

Calls tools, retrieves with RAG-aware generation, and uses extended thinking for hard problems.

☷

Modeling and DQ

Data modeling, quality and observability, and security-aware generation across the lakehouse.

How it was built

Pretrain, then learn from executed code

01 / pretrain

General corpus

Broad language understanding, code, and reasoning foundations.

stage 2 · 90B tokens

02 / pretrain

FineWeb-Edu

High-quality educational and technical web content.

03 / sft

Supervised fine-tuning

Curated DE and DS pairs: SQL, PySpark, ThinkingLanguage, pipeline blueprints, and schemas.

15K+ pairs

04 / rl

GRPO with code execution

Reward-shaped on SQL correctness, query-plan quality, and PySpark execution.

230 test cases across 6 benchmarks

TPU v6e-16 on Google CloudMaxText frameworkRoPE + RMSNorm + SwiGLUGQA attention

When it ships

Serve it like any open model

An OpenAI-compatible endpoint with vLLM, or load it in Transformers. bfloat16 weights, 32K context.

vllm

# OpenAI-compatible endpoint
$ vllm serve thinkingdbx/bonacci-moe-9b \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.90

transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "thinkingdbx/bonacci-moe-9b"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16)

Model id and weights publish at launch.

Coming soon

Be first to run Bonacci MoE 9B

It already powers parts of Bonacci Studio. When the weights publish, we will email everyone who asked. No spam, one message at launch.

Notify me at launch Read how it was trained GitHub