Model Releases

  • nvidia/Qwen3.6-35B-A3B-NVFP4 — NVIDIA’s FP4 quantized MoE variant of Qwen3.6-35B-A3B via ModelOpt. Cuts precision to FP4 for cheaper inference while keeping the MoE architecture intact. If you’re deploying MoE models at scale and want to shave off memory, this is worth a look.

Open Source Releases

  • dsalt 0.3.88 — Dynamic Sparse Attention with Landmark Tokens, now with a high-performance Triton backend. Targets the attention bottleneck in long-context transformers without the usual accuracy trade-offs.

  • supermemory — A fast, scalable memory engine and API for AI apps. Gives your agents a persistent store for contextual info instead of relying on ever-growing context windows. 🧠

  • jbfoundry 0.2.0 — Framework for systematic LLM jailbreak testing. Provides a structured pipeline for probing security boundaries — useful if you’re on the red-team side or just want to know where your model bends.

  • python-code-quality 0.2.1 — Aggregates outputs from 11 Python code quality tools and feeds them into an LLM with minimal token usage. A pragmatic approach to AI-assisted code review that doesn’t burn through your context budget.

  • opik-mcp 0.2.0 — MCP server for Opik (Comet’s LLM observability platform). Lets you plug LLM monitoring and tracing into any MCP-compatible workflow. 🛠️

  • llm-inspect 0.1.8 — Zero-config LLM API inspector that intercepts calls and shows token breakdowns + cost estimates in a local dashboard. No setup, no SDK changes — just visibility into what your LLM calls are actually costing you.

Research Worth Reading

  • VeriGate: Verifier-Gated Step-Level Supervision for GRPO — Tackles the sparse reward problem in GRPO by injecting verifier-gated step-level supervision. When all sampled trajectories get the same outcome reward, learning stalls — this gives the model finer-grained signals to actually improve.

  • Physically Viable World Models: Query-Conditioned Embodied AI — Makes the case that world models for embodied AI should represent physical structure, not just predict pixels. Proposes a query-conditioned framework that produces physically correct intervention rollouts instead of visually plausible nonsense.

  • Harness Updating Is Not Harness Benefit: Self-Evolving LLM Agents — Disentangles whether self-evolving agent improvements come from the model itself or from edits to external harnesses (prompts, skills, tools). A useful framework for measuring actual evolution capability vs. prompt engineering theater.

  • MAVEN: Improving Generalization in Agentic Tool Calling — Focuses on cross-domain generalization for agentic tool calling — composing reasoning strategies, preserving intermediate states, and coordinating tools across domains. Addresses the gap between single-benchmark wins and reliable multi-domain agents.

  • Learning Agent-Compatible Context Management for Long-Horizon Tasks — Proposes learned context management for LLM agents on long-horizon tasks, replacing fixed summarization heuristics with adaptive context control. Targets the long-context degradation that kills reasoning in web search and deep research scenarios.

  • SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning — Trims structural redundancy (a.k.a. overthinking) in chain-of-thought reasoning from RL-trained models. Segment-level adaptive trimming cuts compute without hurting answer correctness — because not every reasoning step deserves its own paragraph. ✂️

AI Dev Tools

  • Babysitter — Agentic workforce orchestration framework with deterministic, hallucination-free self-orchestration. Manages complex multi-agent workflows with enforced obedience. Whether “hallucination-free orchestration” holds up in practice is another question, but the premise is interesting. 🤖

Today’s Synthesis

A few threads today converge on a practical question: how do you get more out of your models without just throwing more compute at them? SLAT tackles this directly for reasoning-heavy workloads by trimming overthinking segments in chain-of-thought — cut the fat, keep the correctness. Meanwhile, nvidia/Qwen3.6-35B-A3B-NVFP4 attacks the same efficiency problem from the deployment side, FP4-quantizing a MoE model so you can serve more throughput per GPU. Together they suggest a concrete workflow: use SLAT-style segment-level trimming at inference time to reduce the reasoning tokens your model generates, then deploy the quantized MoE variant to shrink the serving cost of whatever’s left. If you’re running agent loops or deep research pipelines, that’s two independent levers on the same cost curve — one in the model’s behavior, one in its weights. And if you actually want to know what those calls are costing you, llm-inspect gives you a zero-config dashboard for the token breakdowns without changing your code. Pair it all and you’ve got a measurable efficiency stack instead of vibes-based cost optimization.