micromodels

micromodelsPerspectives on AI/ML from the trenches.https://micromodels.com/Scaling laws revisited: a closer look at compute-optimal inferencehttps://micromodels.com/news/scaling-laws-revisited/https://micromodels.com/news/scaling-laws-revisited/The original Chinchilla scaling laws were derived in a regime where training compute dominated the cost of a model over its lifetime. That assumption no longer holds for most deployed systems. A new paper revisits the compute-optimal frontier under an inference-heavy workload model, where the breakeven point between training cost and serving cost shifts the optimal model size and training-token ratio substantially. What I found interesting is the framing: rather than asking "what is the largest model I can train for a fixed budget?", the authors ask "given a target serving-traffic profile, what training recipe minimizes total cost?". The answer depends sharply on traffic shape — bursty vs. steady-state, peak vs. average — and on hardware amortization assumptions that are usually left implicit. The practical takeaway is that the model's "compute-optimal" size is not a single number but a function of how you expect to serve it. For teams building products around a single model, this matters: the same training budget produces very different models depending on the deployment assumption. I'm not entirely sold on the cost model — it assumes a particular hardware generation and a particular amortization schedule — but the framing is useful and the empirical fit is good. Worth a read if you're sizing a training run today.Thu, 22 Jan 2026 00:00:00 GMTNotes on building a small mixture-of-experts from scratchhttps://micromodels.com/blog/tiny-moe-from-scratch/https://micromodels.com/blog/tiny-moe-from-scratch/A walkthrough of training a tiny MoE on a single GPU: routing collapse, load-balancing losses, and what the literature glosses over.Tue, 20 Jan 2026 00:00:00 GMTAnthropic publishes a constitution for agentic assistantshttps://micromodels.com/news/anthropic-constitution/https://micromodels.com/news/anthropic-constitution/Anthropic released a 14-page document outlining behavioral norms for long-horizon agents. The constitution covers when an agent should ask for clarification versus when it should act autonomously, how it should handle ambiguous instructions, and what kinds of refusal are appropriate in high-stakes contexts. The document is structured as a set of principles rather than hard rules, which I think is the right call given how varied agent deployments are today. It covers delegation boundaries, escalation paths, and the tricky question of how much context an agent should retain across sessions. What stands out is the emphasis on uncertainty communication. The constitution explicitly directs agents to signal confidence levels when the path forward isn't clear. This is a refreshing departure from the typical assistant paradigm where every answer is delivered with equal conviction. If you build tool-using systems, this is worth reading even if you disagree with specific principles. Having any published framework to react against is better than designing in a vacuum.Sun, 18 Jan 2026 00:00:00 GMTOpenAI releases structured-output guarantees for o-series modelshttps://micromodels.com/news/openai-structured-outputs/https://micromodels.com/news/openai-structured-outputs/OpenAI's new API mode promises schema-conformant JSON with >99% reliability. The mechanism works by constraining the model's output logits at generation time, effectively forcing it to produce valid JSON matching a provided schema. This is a meaningful improvement over prompting-based approaches where you cross your fingers and parse the result. The implications for agent loops are significant. Structured tool calls have been a pain point in production systems — parsing failures, schema violations, and retry logic add complexity and latency. A guaranteed-structured output removes an entire class of failure modes from the agentic stack. The caveat is that the guarantee applies to syntax, not semantics. The model can still produce a valid JSON object with incorrect or nonsensical field values. Schema conformance doesn't mean correct reasoning. So while this reduces parsing overhead, validation logic is still necessary. Still, this is the right direction. Making structured output a first-class API primitive rather than an emergent behavior of prompting raises the floor for production reliability.Mon, 12 Jan 2026 00:00:00 GMTWhy your eval set is lying to you (part 1)https://micromodels.com/blog/eval-set-lying-1/https://micromodels.com/blog/eval-set-lying-1/Contamination, format leakage, and the quiet ways benchmark scores stop reflecting real-world capability. With a worked example.Thu, 08 Jan 2026 00:00:00 GMTA survey of mechanistic interpretability at the 2025 frontierhttps://micromodels.com/news/mech-interp-survey/https://micromodels.com/news/mech-interp-survey/This 80-page survey covers the state of mechanistic interpretability across the largest open-weight models as of late 2025. It covers sparse autoencoders, attention-pattern probing, circuit-level analysis, and activation patching — with detailed comparisons of what works at scale versus what only works in toy settings. The section on sparse autoencoders is particularly useful. It consolidates findings from several labs on how reconstruction fidelity scales with dictionary size, and where current approaches still fail (particularly on rare features and compositional behaviors). The authors don't shy away from the limitations. The survey also catalogs a shift in the field toward automated circuit discovery. Manual circuit analysis was the norm in 2023–2024, but the community is converging on tools that can propose and test mechanistic hypotheses with minimal human intervention. Whether these tools actually produce faithful explanations is still an open question. Dense but well-organized. Each section has a concrete recommendations box for practitioners, which makes it more useful than most surveys in this space.Mon, 05 Jan 2026 00:00:00 GMTA practitioner's guide to small-model fine-tuninghttps://micromodels.com/blog/small-model-finetuning/https://micromodels.com/blog/small-model-finetuning/LoRA, QLoRA, DoRA — what actually matters when you're fine-tuning a 7B model on a single consumer GPU. Plus the gotchas nobody mentions.Tue, 30 Dec 2025 00:00:00 GMTRAG is mostly a retrieval problem: a year-end retrospectivehttps://micromodels.com/news/rag-retrospective/https://micromodels.com/news/rag-retrospective/This industry post-mortem argues that most RAG failures trace to retrieval quality, not generation. The authors analyzed hundreds of RAG pipeline logs across production deployments and found that when the system produced a bad answer, the root cause was almost always in the retrieval step — missing documents, low-ranked relevant passages, or context windows cluttered with noise. The concrete benchmarks are instructive. They propose a retrieval-centric evaluation framework that decouples retriever performance from generator performance, making it possible to pinpoint which component needs improvement. This is obvious in retrospect but rarely done in practice. The practical recommendations: invest in chunking strategy, metadata filtering, and hybrid search before worrying about prompt engineering or fine-tuning the generator. Most teams over-optimize the generation side because it's more visible, while leaving obvious retrieval gaps unaddressed. This resonates with my experience. I've seen teams spend weeks tuning system prompts for a RAG pipeline that was fundamentally limited by embedding quality and chunk boundary choices. Measure retrieval first.Sun, 28 Dec 2025 00:00:00 GMT