The Challenge
Frontier Models API costs are expensive, free models suck at reasoning, and the AI tech stack is way too expensive in the long-run. Small Language Models (SLMs) excel at their domain, often beating Frontier Models.
The Solution: Domain-Specific SLMs
I embarked on building three domain-specific models (SLMs) for long-term AI model solutions to power my data platforms. Google's 6th Gen and 4th Gen TPUs, available for free for 30 days under the TRC Program, were the perfect solution.
Data Sources
- FineWebEdu
- StackOverflow
- Wikipedia
- General Web Scraping
- CWE Corpus
- Security adversaries
- Compliance docs
- Regulatory texts
Setup & Architecture
To maximize TPU utilization, I used:
- MaxText/JAX framework
- FSDP (Fully Sharded Data Parallel)
- Orbax Checkpoints on GCS
- Automated scout + autopilot layer
The MoE architecture uses 2B active parameters per token, giving the knowledge of 9B while paying inference costs closer to a 2B dense model.
Training Process
The most painful aspect of using free TPUs is that Trillium VMs get preempted frequently. To automate this:
- Auto setup and launch training
- Scout script runs every 10 minutes
- Automatically resumes training when VMs get preempted
I initially made the mistake of saving all checkpoints, resulting in 40TB of training data and exploding Google Cloud Storage costs. Later, I learned to save only the last 1 or 2 checkpoints.
Workflow
- Setup
- Data Download
- Tokenize
- Train
- SFT + GRPO
- Deploy
Post-Training
- SFT + GRPO
- Raw Weights
- GGUF Formats
- Ollama integration
Timeline & Costs
- 40+ Days of training
- $5K-$10K USD in total costs
- Hidden costs and miscellaneous expenses
Key Takeaways
Learn to own your AI stack. Frontier AI Models and companies are fragile with a lot going on in their space. Building your own models gives you control and long-term cost savings.
— own your stack ✎
← back to blog