Tenkai Daily — May 11, 2026
Model Releases
- Qwen-Fixed-Chat-Templates — Corrected Jinja chat templates for Qwen3.5 and Qwen3.6, fixing tool-calling and thinking-mode formatting. If you’ve been fighting broken XML tags or mangled think blocks in LM Studio, llama.cpp, or MLX, this is your fix. 🤖
Open Source Releases
- yobitsugi 0.1.7 — AI-powered SAST/SCA scanner that patches your codebase via local LLMs, callable as a slash command from Claude Code, Codex, Cursor, Gemini CLI, Aider, OpenCode, and Copilot CLI. Finally, a vulnerability scanner that doesn’t just scream “HIGH SEVERITY” at you and leave you holding the bag. 🛠️
- dsalt 0.2.22 — Dynamic sparse attention with landmark tokens, powered by a Triton implementation. If you’ve been hand-rolling custom CUDA kernels for attention and crying into your keyboard, this might save you. 🛠️
- omlx — LLM inference server for Apple Silicon with continuous batching and SSD caching, managed from the macOS menu bar. For the “I want 70B running locally on my M-series Mac” crowd. 🤖
- torch-npu 2.9.1 — PyTorch bridge for NPU hardware. If you’re targeting Ascend or other non-GPU accelerators inside the PyTorch ecosystem, this keeps you from rewriting everything from scratch. 🛠️
- sigil-sdk-langchain 0.2.1 — LangChain callback handlers for the Sigil Python SDK. Part of a broader SDK wave covering LangChain, Anthropic, LlamaIndex, OpenAI Agents, Pydantic AI, and OpenAI wrappers. 📄
- llmakits 0.6.61 — Python toolkit for multi-model LLM integration with scheduling, fault tolerance, and load balancing. Building resilient multi-model infra and don’t want to duct-together five libraries yourself? Here you go. 🛠️
Research Worth Reading
- RateQuant — Applies rate-distortion theory to KV cache quantization, assigning optimal mixed-precision bit-widths per attention head instead of a blunt uniform approach. The KV cache memory wall is arguably the biggest serving bottleneck right now, so squeezing it intelligently matters. 🔥
- LKV — End-to-end learned head-wise budgets and token selection for KV cache eviction. Ditches the heuristic priors and optimizes directly for downstream task objectives. Neat idea, though the “end-to-end” training cost will make some teams flinch. 📄
- Toeplitz MLP Mixers — Replaces attention with triangular-masked Toeplitz matrix multiplication, pulling sequence modeling down to O(dn log n) from quadratic. A genuinely interesting alternative architecture if you’re tired of the attention tax. 📄
- GraphDC — A divide-and-conquer multi-agent framework for LLM-based graph algorithm reasoning. Decomposes gnarly topology problems into sub-tasks handled by specialized agents. Graph reasoning remains one of LLMs’ weak spots, so any structured decomposition approach is worth eyeballing. 📄
- More Thinking, More Bias — Shows that longer reasoning trajectories in models like DeepSeek-R1 actually amplify per-question position bias rather than dampening it. Kinda deflates the “just add more chain-of-thought” reflex. Worth reading before you naively bump max_think_tokens. 🔥
- CASCADE — Case-based continual adaptation for LLMs during deployment, breaking the training-deployment boundary. The idea of models learning from live interactions post-deployment without full retraining is compelling, though the catastrophic forgetting gremlins are far from solved. 📄
Today’s Synthesis
If you’re running models with long chain-of-thought — especially reasoning-heavy ones like DeepSeek-R1 — you’re getting hit from two directions at once. More Thinking, More Bias
shows that longer reasoning trajectories actually amplify position bias rather than dampening it, meaning blindly cranking max_think_tokens is a losing strategy. And every extra token in that trajectory is another entry in your KV cache, already the biggest serving bottleneck in production LLMs today. Two papers point at concrete relief: RateQuant
assigns mixed-precision bit-widths per attention head instead of a blunt uniform cut, and LKV
learns head-wise eviction budgets optimized directly for your downstream task. The practical move: profile your long-reasoning workloads, find which attention heads are the memory hogs, and apply mixed-precision quantization or learned eviction there before you throw more GPUs at the wall.