**摘要**
Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions,
👤 作者: Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca

---
🔗 **[From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression](https://arxiv.org/abs/2606.02559v1)**

> From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
🏷️ 来源: ArXiv cs.AI
⏱️ 2026-06-02 14:01