Tenkai Daily — April 21, 2026
Model Releases
- Qwen/Qwen3.6-35B-A3B — 35B MoE model for image-text-to-text and chat, served in safetensors under Apache 2.0 with Azure deployment. Useful if you need a large decoder that fits in a datacenter budget; questionable if you just need a small endpoint 🔥🤖
Open Source Releases
- anthropic/claude-code: v2.1.116 — Performance bumps for big sessions, faster resume, and smoother VS Code fullscreen scrolling. Worth a look if you live in Claude Code, but still a CLI wrapper with its own tax 🛠️
- vllm-sr 0.3.0.dev20260421083012 — Semantic router for MoM setups to pick models based on input intent. Good for cutting inference costs if your workload is diverse; integration overhead may not be worth it for a single model 🛠️
- agents-gateway 0.2.8 — FastAPI extension for API-first agent services with structured routing. Handy if you are wiring agents into an existing HTTP stack; otherwise another layer to debug 🛠️
- ai-runtime-guard 2.2.2 — MCP security wrapper with policy tiers and audit controls. Relevant if you are deploying MCP hosts in mixed environments; adds latency so measure first 🛠️
Research Worth Reading
- Bilevel Optimization of Agent Skills via Monte Carlo Tree Search — Uses MCTS-based bilevel optimization to improve agent skill packages. Interesting if you like search-based tuning; heavy on compute and abstract for most teams 📄
- Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents — Proposes a rubric reward model to shape SWE agent behavior beyond terminal feedback. Could help reduce reward hacking, but rubric construction remains an art 📄
Today’s Synthesis
To make this actionable: deploy vllm-sr as a routing layer that classifies intent (simple vs. complex) and map each class to an SLO-driven policy. Connect the router to agents-gateway in your FastAPI stack so authenticated requests are directed to either a cost-optimized small model or to Qwen3.6-35B-A3B when task complexity justifies higher spend. Validate routing heuristics with A/B tests on token counts and error rates; measure end-to-end latency and cost per request before enabling the security guardrail to avoid adding pure overhead. Feed routing and execution signals (chosen model, latency, token usage, outcome) into a lightweight reward model designed using the rubric approach to reinforce cheaper paths that meet quality thresholds. Start with a small cohort, instrument token and cost savings, and iterate routing rules rather than chasing another MoE benchmark.