Model Releases

  • Qwen/Qwen3.6-35B-A3B — 35B MoE model for image-text-to-text and chat, served in safetensors under Apache 2.0 with Azure deployment. Useful if you need a large decoder that fits in a datacenter budget; questionable if you just need a small endpoint 🔥🤖

Open Source Releases

  • anthropic/claude-code: v2.1.116 — Performance bumps for big sessions, faster resume, and smoother VS Code fullscreen scrolling. Worth a look if you live in Claude Code, but still a CLI wrapper with its own tax 🛠️
  • vllm-sr 0.3.0.dev20260421083012 — Semantic router for MoM setups to pick models based on input intent. Good for cutting inference costs if your workload is diverse; integration overhead may not be worth it for a single model 🛠️
  • agents-gateway 0.2.8 — FastAPI extension for API-first agent services with structured routing. Handy if you are wiring agents into an existing HTTP stack; otherwise another layer to debug 🛠️
  • ai-runtime-guard 2.2.2 — MCP security wrapper with policy tiers and audit controls. Relevant if you are deploying MCP hosts in mixed environments; adds latency so measure first 🛠️

Research Worth Reading

Today’s Synthesis

To make this actionable: deploy vllm-sr as a routing layer that classifies intent (simple vs. complex) and map each class to an SLO-driven policy. Connect the router to agents-gateway in your FastAPI stack so authenticated requests are directed to either a cost-optimized small model or to Qwen3.6-35B-A3B when task complexity justifies higher spend. Validate routing heuristics with A/B tests on token counts and error rates; measure end-to-end latency and cost per request before enabling the security guardrail to avoid adding pure overhead. Feed routing and execution signals (chosen model, latency, token usage, outcome) into a lightweight reward model designed using the rubric approach to reinforce cheaper paths that meet quality thresholds. Start with a small cohort, instrument token and cost savings, and iterate routing rules rather than chasing another MoE benchmark.