Efficient LLM

Efficient LLM
- Survey
- Efficient LLM
- Finetune
- Quantization
- Distillation
- Pruning
- Inference
- Small Language Models
- Transformer
- Hardware
- Tutorials
- Projects
- Products
- Misc

Survey

A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, arXiv, 2411.03350, arxiv, pdf, cication: -1

Fali Wang, Zhiwei Zhang, Xianren Zhang, ..., Ming Huang, Suhang Wang · (mp.weixin.qq)
A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, arXiv, 2411.03350, arxiv, pdf, cication: -1

Fali Wang, Zhiwei Zhang, Xianren Zhang, ..., Ming Huang, Suhang Wang
A Survey of Small Language Models, arXiv, 2410.20011, arxiv, pdf, cication: -1

Chien Van Nguyen, Xuan Shen, Ryan Aponte, ..., Ryan A. Rossi, Thien Huu Nguyen

Efficient LLM

Finetune

Knowledge Composition using Task Vectors with Learned Anisotropic Scaling, arXiv, 2407.02880, arxiv, pdf, cication: -1

Frederic Z. Zhang, Paul Albert, Cristian Rodriguez-Opazo, ..., Anton van den Hengel, Ehsan Abbasnejad · (atlas - fredzzhang)
Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study, arXiv, 2411.02462, arxiv, pdf, cication: -1

André Storhaug, Jingyue Li · (peft-unit-test-generation-replication-package - andstor)
LoRA vs Full Fine-tuning: An Illusion of Equivalence, arXiv, 2410.21228, arxiv, pdf, cication: -1

Reece Shuttleworth, Jacob Andreas, Antonio Torralba, ..., Pratyusha Sharma · (𝕏)

Quantization

PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs, arXiv, 2410.05265, arxiv, pdf, cication: -1

Mengzhao Chen, Yi Liu, Jiahao Wang, ..., Wenqi Shao, Ping Luo · (PrefixQuant - ChenMnZ) · (arxiv)
🌟 Scaling Laws for Precision, arXiv, 2411.04330, arxiv, pdf, cication: -1

Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, ..., Christopher Ré, Aditi Raghunathan · (𝕏) · (𝕏)
🌟 BitNet a4.8: 4-bit Activations for 1-bit LLMs, arXiv, 2411.04965, arxiv, pdf, cication: -1

Hongyu Wang, Shuming Ma, Furu Wei
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, arXiv, 2411.02355, arxiv, pdf, cication: -1

Eldar Kurtic, Alexandre Marques, Shubhra Pandit, ..., Mark Kurtz, Dan Alistarh
QTIP: Quantization with Trellises and Incoherence Processing, arXiv, 2406.11235, arxiv, pdf, cication: 1

Albert Tseng, Qingyao Sun, David Hou, ..., Christopher De Sa · (qtip - Cornell-RelaxML) · (x) · (t)

Distillation

Stronger Models are NOT Stronger Teachers for Instruction Tuning, arXiv, 2411.07133, arxiv, pdf, cication: -1

Zhangchen Xu, Fengqing Jiang, Luyao Niu, ..., Bill Yuchen Lin, Radha Poovendran
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling, arXiv, 2410.11325, arxiv, pdf, cication: -1

Wenda Xu, Rujun Han, Zifeng Wang, ..., Chen-Yu Lee, Tomas Pfister

Pruning

The Super Weight in Large Language Models, arXiv, 2411.07191, arxiv, pdf, cication: -1

Mengxia Yu, De Wang, Qi Shan, ..., Colorado Reed, Alvin Wan
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity, arXiv, 2411.02335, arxiv, pdf, cication: -1

Yuqi Luo, Chenyang Song, Xu Han, ..., Zhiyuan Liu, Maosong Sun
What Matters in Transformers? Not All Attention is Needed, arXiv, 2406.15786, arxiv, pdf, cication: 1

Shwai He, Guoheng Sun, Zheyu Shen, ..., Ang Li

Inference

SAM Decoding: Speculative Decoding via Suffix Automaton, arXiv, 2411.10666, arxiv, pdf, cication: -1

Yuxuan Hu, Ke Wang, Jing Zhang, ..., Cuiping Li, Hong Chen · (SAM-Decoding - hyx1999)
FastDraft: How to Train Your Draft, arXiv, 2411.11055, arxiv, pdf, cication: -1

Ofir Zafrir, Igor Margulis, Dorin Shteyman, ..., Guy Boudoukh
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration, arXiv, 2411.10958, arxiv, pdf, cication: -1

Jintao Zhang, Haofeng Huang, Pengle Zhang, ..., Jun Zhu, Jianfei Chen · (SageAttention - thu-ml)
Faster Text Generation with Self-Speculative Decoding 🤗
distributed-llama - b4rtaz
SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD GPUs
OpenAI beats Anthropic and Fireworks to releasing Speculative Decoding
Latency optimizationImprove latency across a wide variety of LLM-related use cases.
A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression, arXiv, 2406.11430, arxiv, pdf, cication: 5

Alessio Devoto, Yu Zhao, Simone Scardapane, ..., Pasquale Minervini
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration, arXiv, 2410.02367, arxiv, pdf, cication: -1

Jintao Zhang, Jia wei, Pengle Zhang, ..., Jun Zhu, Jianfei Chen
Fast Best-of-N Decoding via Speculative Rejection, arXiv, 2410.20290, arxiv, pdf, cication: -1

Hanshi Sun, Momin Haider, Ruiqi Zhang, ..., Peter Bartlett, Andrea Zanette
Optimizing and Characterizing High-Throughput Low-Latency LLM Inference in MLCEngine

· (reddit)
Battle of Inference Engines: Llama.cpp vs MLC LLM vs vLLM

· (reddit)
Universal Assisted Generation: Faster Decoding with Any Assistant Model 🤗
Models continually pretrained using LayerSkip 🤗

· (arxiv)

Small Language Models

SlimLM: An Efficient Small Language Model for On-Device Document Assistance, arXiv, 2411.09944, arxiv, pdf, cication: -1

Thang M. Pham, Phat T. Nguyen, Seunghyun Yoon, ..., Franck Dernoncourt, Trung Bui
Hymba: A Hybrid-head Architecture for Small Language Models, arXiv, 2411.13676, arxiv, pdf, cication: -1

Xin Dong, Yonggan Fu, Shizhe Diao, ..., Jan Kautz, Pavlo Molchanov
MobileLLM is an auto-regressive language model leveraging an optimized transformer architecture 🤗

· (arxiv)

Transformer

🌟 SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration, arXiv, 2411.10958, arxiv, pdf, cication: -1

Jintao Zhang, Haofeng Huang, Pengle Zhang, ..., Jun Zhu, Jianfei Chen · (SageAttention. - thu-ml)
ThunderKittens: Simple, Fast, and Adorable AI Kernels, arXiv, 2410.20399, arxiv, pdf, cication: -1

Benjamin F. Spector, Simran Arora, Aaryan Singhal, ..., Daniel Y. Fu, Christopher Ré
🎬 Differential Transformer 论文原理逐段讲解
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs, arXiv, 2410.13276, arxiv, pdf, cication: -1

Yizhao Gao, Zhichen Zeng, Dayou Du, ..., Fan Yang, Mao Yang
MoH: Multi-Head Attention as Mixture-of-Head Attention, arXiv, 2410.11842, arxiv, pdf, cication: -1

Peng Jin, Bo Zhu, Li Yuan, ..., Shuicheng Yan · (arxiv) · (MoH - SkyworkAI) · (huggingface)

Hardware

database of Machine Learning Hardware 𝕏

Tutorials

Dynamic Sparsity in Machine Learning
🎬 Lecture 32: Unsloth

Projects

nano-sparse-attention - PiotrNawrot

· (𝕏)
sgl-learning-materials - sgl-project
exo - exo-explore

Run your own AI cluster at home with everyday devices.

Products

Misc

🎬 How FlashAttention Accelerates the Generative AI Revolution
k-mktr / gpu-poor-llm-arena 🤗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

efficient_llm.md

efficient_llm.md

Efficient LLM

Survey

Efficient LLM

Finetune

Quantization

Distillation

Pruning

Inference

Small Language Models

Transformer

Hardware

Tutorials

Projects

Products

Misc

Files

efficient_llm.md

Latest commit

History

efficient_llm.md

File metadata and controls

Efficient LLM

Survey

Efficient LLM

Finetune

Quantization

Distillation

Pruning

Inference

Small Language Models

Transformer

Hardware

Tutorials

Projects

Products

Misc