How I Built 3 9B MoE Models Under $5K-$10K Budget

Training three 9B Mixture-of-Experts models on 100B tokens each for under $10K using Google's TPU Research Cloud program.

The Challenge

Frontier Models API costs are expensive, free models suck at reasoning, and the AI tech stack is way too expensive in the long-run. Small Language Models (SLMs) excel at their domain, often beating Frontier Models.

The Solution: Domain-Specific SLMs

I embarked on building three domain-specific models (SLMs) for long-term AI model solutions to power my data platforms. Google's 6th Gen and 4th Gen TPUs, available for free for 30 days under the TRC Program, were the perfect solution.

Data Sources

FineWebEdu
StackOverflow
Wikipedia
General Web Scraping
CWE Corpus
Security adversaries
Compliance docs
Regulatory texts

Setup & Architecture

To maximize TPU utilization, I used:

MaxText/JAX framework
FSDP (Fully Sharded Data Parallel)
Orbax Checkpoints on GCS
Automated scout + autopilot layer

The MoE architecture uses 2B active parameters per token, giving the knowledge of 9B while paying inference costs closer to a 2B dense model.

Training Process

The most painful aspect of using free TPUs is that Trillium VMs get preempted frequently. To automate this:

Auto setup and launch training
Scout script runs every 10 minutes
Automatically resumes training when VMs get preempted

I initially made the mistake of saving all checkpoints, resulting in 40TB of training data and exploding Google Cloud Storage costs. Later, I learned to save only the last 1 or 2 checkpoints.

Workflow

Setup
Data Download
Tokenize
Train
SFT + GRPO
Deploy

Post-Training

SFT + GRPO
Raw Weights
GGUF Formats
Ollama integration

Timeline & Costs

40+ Days of training
$5K-$10K USD in total costs
Hidden costs and miscellaneous expenses

Key Takeaways

Learn to own your AI stack. Frontier AI Models and companies are fragile with a lot going on in their space. Building your own models gives you control and long-term cost savings.

#ModelTraining #SmallLanguageModels #buildinpublic #AI #Data #CyberSecurity #Finance #DataEngineering #DataScience #LLM #Google #TPU

- own your stack ✎

← back to blog