Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is t
---
🔗 **[显示HN :针:我们将双子座工具蒸馏成26米模型](https://github.com/cactus-compute/needle)**
> Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model
📊 241投票 · 投稿者: HenryNdubuaku
🏷️ 来源: Hacker News
⏱️ 2026-05-13 08:00
加载回复中...