Model Releases

JetBrains/Mellum2-12B-A2.5B-Thinking — A Mixture-of-Experts model (12B total, 2.5B active) with explicit thinking/reasoning capabilities, released under Apache 2.0. MoE architecture keeps inference costs down while the reasoning traces make it relevant for code generation and tasks where chain-of-thought actually matters.

Open Source Releases

chopratejas/headroom — Library, proxy, and MCP server that compresses tool outputs, logs, files, and RAG chunks before they hit the LLM — 60–95% fewer tokens with answer quality intact. If your agentic pipelines are drowning in context bloat, this is worth a look. 🛠️

repo-rag 0.1.6 — Local RAG indexer and MCP server built for AI coding agents (Claude, Cursor, Factory Droid, etc.). Standardized MCP interface for codebase-aware retrieval, so your assistant actually knows what’s in the repo instead of guessing.

prune-sdk 0.1.0 — Drop-in Anthropic/OpenAI client proxy that claims 40–70% LLM API cost reduction with a one-line swap. Targets engineers who want to cut inference spend without rewriting client code. 🤷 mileage may vary, but the pitch is clean.

egent-code-plexus 0.6.4 — Symbol-level code intelligence graph for AI agents and LLMs. Structured code understanding at the symbol level means more precise navigation and comprehension for LLM-powered dev tools, not just fuzzy full-file retrieval.

Open-LLM-VTuber/Open-LLM-VTuber — Open-source, hands-free voice interaction with any LLM, featuring voice interruption and Live2D avatar support, running locally. For developers building voice-driven AI interfaces with a visual avatar layer.

celestine — MCP server for astrological calculations: birth charts, transits, progressions, calendar generation. 2,726 installs. Not everything has to be about transformers. 🌟

Research Worth Reading

📄 Spectral Asymptotics of Neural Network Loss Landscapes — Proves the Spectral Alignment Decomposition, an exact result explaining why the curvature exponent α varies systematically across layer types (α≈2 for convolutions, ≈1 for transformer attention, <1 for MLP up-projections). A principled mathematical framework for understanding loss landscape geometry — rare to see this level of rigor.

📄 Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs — Hallucination signals are linearly separable from mid-layer hidden states across three 7B–8B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) even in 4-bit NF4 quantization. Pinpoints which network depth carries the strongest truthfulness signal — useful for building better hallucination detectors.

📄 Visual Graph Scaffolds for Structural Reasoning in LLMs — Proposes using graphs not just as external knowledge sources but as structural scaffolds that organize the reasoning process itself, inspired by human graph-structured mental models. Reframes graph-augmented LLM reasoning from retrieval to reasoning organization.

📄 AURA: Action-Gated Memory for Robot Policies at Constant VRAM — KV-cache design is optimized for batched datacenter inference, not embodied robot agents running long, non-resetting episodes on memory-constrained edge hardware. Action-gated memory maintains constant VRAM usage — a practical contribution for deploying LLMs on robots.

📄 Toward a Modular Architecture for Embedded AI Agent Systems at the Edge — Proposes a modular architecture for LLM-based agentic AI on embedded microcontrollers with strict memory and energy constraints. Existing frameworks assume server-class resources; this tackles the actual challenge of bringing agentic reasoning and tool use to pervasive computing environments.

📄 When Helping Hurts: Multi-Agent Debate for Data Cleaning — Systematically evaluates multi-agent debate for data cleaning across three benchmarks and four model families, identifying “critique-induced confusion” (CIC) where hallucinated critic feedback degrades generator output. Actionable findings on when multi-agent debate helps versus when it actively makes things worse. 🔥

AI Dev Tools

Claude Code v2.1.161 — Adds OTEL_RESOURCE_ATTRIBUTES as labels on metric datapoints so you can slice usage metrics by team, repo, or whatever custom dimensions you need. Also improves agent fan-out visibility with done/total progress and a peek at the longest-running item. Small but practical quality-of-life upgrades.

Today’s Synthesis

Three threads today converge on the same problem: LLMs are terrible at managing their own context and compute budget, and the fixes are increasingly modular. chopratejas/headroom attacks context bloat at the proxy layer — compressing tool outputs and RAG chunks before they hit the model. prune-sdk does something similar but targets the API bill directly, acting as a drop-in client proxy. And repo-rag takes a different angle: instead of sending your entire repo to the model, it indexes locally and serves only what’s relevant via MCP. If you’re building agentic workflows right now, the playbook is becoming clear: compress at the proxy, index at the source, and never let raw context reach the model uncompressed. None of these tools require changing your LLM provider or rewriting your agent logic — they’re infrastructure-layer interventions that compound. Stack headroom + repo-rag and you’ve cut both the input tokens and the retrieval noise in one move. 🛠️