Model Releases

  • NVIDIA Cosmos3-Nive — NVIDIA’s compact omnimodal that handles text, image, video, audio, and action. Supports vLLM-omni inference if you want to run it without melting your GPU setup.
  • Prism ML Bonsai Ternary 4B with GemLite 2-bit Quantization — Ternary weights at 1.58-bit plus GemLite 2-bit quantization for a 4B text-to-image diffusion model. Extreme quantization meets custom CUDA kernels — the kind of thing that makes “just use fp16” people nervous.
  • OpenMOSS MOSS-TTS-v1.5 — Multilingual TTS covering 27 languages including Mandarin, Cantonese, Japanese, Korean, and most of Europe. Apache 2.0 licensed with a backing arXiv paper, so you can actually ship with it.

Open Source Releases

  • Claude Code v2.1.160 — Now prompts before touching shell startup files and git config, because nobody wants an AI quietly rewriting their .zshenv at 2 AM. Also warns before editing build-tool configs that could grant code execution.
  • smarts.md MCP Server — Live docs for verified smart contracts, queryable via MCP. Lets AI agents pull real-time contract documentation instead of hallucinating ABI details. 3,185 installs suggests people are actually using this.

Research Worth Reading

  • BitsMoE — Spectral energy-guided bit allocation for MoE LLM quantization at ultra-low bits. Unlike pruning (which throws capacity away forever), this tries to keep quality while actually fitting the model in memory.
  • CAST — Fixes the GRPO failure mode where all sampled trajectories for a prompt are uniformly right or wrong, leaving the model with zero useful gradient signal. Clipped asymmetric self-teaching with advantage flipping sounds like exactly the kind of thing you’d wish existed six months ago.
  • BudgetDraft — Speculative decoding that actually accounts for the drafter’s resource constraints. Trains the drafter with token acceptance awareness while using a sparse KV cache, since the drafter and verifier rarely share the same memory budget in practice.
  • TIGER — Graph-based evidence routing for fact-level hallucination repair in multimodal generation. Avoids the circular feedback problem where you condition on the output you’re trying to verify.
  • MindZero — Theory of Mind inference learned online, no annotations required. Updates mental-state hypotheses in real time for AI agents assisting humans. Enables agents to infer what a user knows, intends, or believes mid-interaction and adapt behavior accordingly — critical for assistive AI that must respond to shifting human needs without explicit labels.
  • RAFT — Tackles catastrophic forgetting in domain fine-tuning through data refinement and adaptive distillation. Closes the gap between what the domain data expects and what the model naturally wants to say, then distills to keep general capabilities intact.

AI Dev Tools

  • OpenCode v1.15.13 — Custom session metadata via API/SDK, plus a fix for Gateway Anthropic Opus 4.7+ returning empty thinking blocks. Metadata on sessions sounds small until you’re debugging a complex agent pipeline and actually need it.
  • fff — Fast File Finder — Fast file search toolkit for AI agents, Neovim, and general dev workflows. Rust, C, and NodeJS support. Because “find that file” is the bottleneck nobody talks about until their agent is timing out on a 50k-file repo.
  • luongnv89/claude-howto — Visual, example-driven Claude Code guide from basics to advanced agents, with copy-paste templates. For those who learn better from “here’s exactly what to type” than from reading docs for an hour.

Today’s Synthesis

The through-line today is making large models actually run where you need them — on constrained hardware, in production, without losing your mind. BitsMoE tackles the memory wall for MoE models with spectral bit allocation, while Prism ML Bonsai Ternary 4B pushes quantization to 1.58-bit ternary weights plus 2-bit activations with custom CUDA kernels. Together they point to a practical playbook: instead of just shrinking models uniformly, you allocate precision where the model actually needs it and compress aggressively everywhere else. If you’re deploying diffusion or MoE models at the edge or on a budget GPU cluster, the combination of structured quantization and smarter bit allocation is the difference between “fits in memory” and “fits in memory at acceptable quality.” Pair either approach with BudgetDraft ’s resource-aware speculative decoding and you’ve got a full inference stack that respects real-world constraints from weights to latency.