-
A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness,
arXiv, 2411.03350
, arxiv, pdf, cication: -1Fali Wang, Zhiwei Zhang, Xianren Zhang, ..., Ming Huang, Suhang Wang · (mp.weixin.qq)
-
A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness,
arXiv, 2411.03350
, arxiv, pdf, cication: -1Fali Wang, Zhiwei Zhang, Xianren Zhang, ..., Ming Huang, Suhang Wang
-
A Survey of Small Language Models,
arXiv, 2410.20011
, arxiv, pdf, cication: -1Chien Van Nguyen, Xuan Shen, Ryan Aponte, ..., Ryan A. Rossi, Thien Huu Nguyen
-
Knowledge Composition using Task Vectors with Learned Anisotropic Scaling,
arXiv, 2407.02880
, arxiv, pdf, cication: -1Frederic Z. Zhang, Paul Albert, Cristian Rodriguez-Opazo, ..., Anton van den Hengel, Ehsan Abbasnejad · (atlas - fredzzhang)
-
Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study,
arXiv, 2411.02462
, arxiv, pdf, cication: -1André Storhaug, Jingyue Li · (peft-unit-test-generation-replication-package - andstor)
-
LoRA vs Full Fine-tuning: An Illusion of Equivalence,
arXiv, 2410.21228
, arxiv, pdf, cication: -1Reece Shuttleworth, Jacob Andreas, Antonio Torralba, ..., Pratyusha Sharma · (𝕏)
-
PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs,
arXiv, 2410.05265
, arxiv, pdf, cication: -1Mengzhao Chen, Yi Liu, Jiahao Wang, ..., Wenqi Shao, Ping Luo · (PrefixQuant - ChenMnZ) · (arxiv)
-
🌟 Scaling Laws for Precision,
arXiv, 2411.04330
, arxiv, pdf, cication: -1Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, ..., Christopher Ré, Aditi Raghunathan · (𝕏) · (𝕏)
-
🌟 BitNet a4.8: 4-bit Activations for 1-bit LLMs,
arXiv, 2411.04965
, arxiv, pdf, cication: -1Hongyu Wang, Shuming Ma, Furu Wei
-
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization,
arXiv, 2411.02355
, arxiv, pdf, cication: -1Eldar Kurtic, Alexandre Marques, Shubhra Pandit, ..., Mark Kurtz, Dan Alistarh
-
QTIP: Quantization with Trellises and Incoherence Processing,
arXiv, 2406.11235
, arxiv, pdf, cication: 1Albert Tseng, Qingyao Sun, David Hou, ..., Christopher De Sa · (qtip - Cornell-RelaxML) · (x) · (t)
-
Stronger Models are NOT Stronger Teachers for Instruction Tuning,
arXiv, 2411.07133
, arxiv, pdf, cication: -1Zhangchen Xu, Fengqing Jiang, Luyao Niu, ..., Bill Yuchen Lin, Radha Poovendran
-
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling,
arXiv, 2410.11325
, arxiv, pdf, cication: -1Wenda Xu, Rujun Han, Zifeng Wang, ..., Chen-Yu Lee, Tomas Pfister
-
The Super Weight in Large Language Models,
arXiv, 2411.07191
, arxiv, pdf, cication: -1Mengxia Yu, De Wang, Qi Shan, ..., Colorado Reed, Alvin Wan
-
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity,
arXiv, 2411.02335
, arxiv, pdf, cication: -1Yuqi Luo, Chenyang Song, Xu Han, ..., Zhiyuan Liu, Maosong Sun
-
What Matters in Transformers? Not All Attention is Needed,
arXiv, 2406.15786
, arxiv, pdf, cication: 1Shwai He, Guoheng Sun, Zheyu Shen, ..., Ang Li
-
SAM Decoding: Speculative Decoding via Suffix Automaton,
arXiv, 2411.10666
, arxiv, pdf, cication: -1Yuxuan Hu, Ke Wang, Jing Zhang, ..., Cuiping Li, Hong Chen · (SAM-Decoding - hyx1999)
-
FastDraft: How to Train Your Draft,
arXiv, 2411.11055
, arxiv, pdf, cication: -1Ofir Zafrir, Igor Margulis, Dorin Shteyman, ..., Guy Boudoukh
-
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration,
arXiv, 2411.10958
, arxiv, pdf, cication: -1Jintao Zhang, Haofeng Huang, Pengle Zhang, ..., Jun Zhu, Jianfei Chen · (SageAttention - thu-ml)
-
distributed-llama - b4rtaz
-
SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD GPUs
-
OpenAI beats Anthropic and Fireworks to releasing Speculative Decoding
-
Latency optimizationImprove latency across a wide variety of LLM-related use cases.
-
A Simple and Effective
$L_2$ Norm-Based Strategy for KV Cache Compression,arXiv, 2406.11430
, arxiv, pdf, cication: 5Alessio Devoto, Yu Zhao, Simone Scardapane, ..., Pasquale Minervini
-
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration,
arXiv, 2410.02367
, arxiv, pdf, cication: -1Jintao Zhang, Jia wei, Pengle Zhang, ..., Jun Zhu, Jianfei Chen
-
Fast Best-of-N Decoding via Speculative Rejection,
arXiv, 2410.20290
, arxiv, pdf, cication: -1Hanshi Sun, Momin Haider, Ruiqi Zhang, ..., Peter Bartlett, Andrea Zanette
-
Optimizing and Characterizing High-Throughput Low-Latency LLM Inference in MLCEngine
· (reddit)
-
Battle of Inference Engines: Llama.cpp vs MLC LLM vs vLLM
· (reddit)
-
Universal Assisted Generation: Faster Decoding with Any Assistant Model 🤗
-
Models continually pretrained using LayerSkip 🤗
· (arxiv)
-
SlimLM: An Efficient Small Language Model for On-Device Document Assistance,
arXiv, 2411.09944
, arxiv, pdf, cication: -1Thang M. Pham, Phat T. Nguyen, Seunghyun Yoon, ..., Franck Dernoncourt, Trung Bui
-
Hymba: A Hybrid-head Architecture for Small Language Models,
arXiv, 2411.13676
, arxiv, pdf, cication: -1Xin Dong, Yonggan Fu, Shizhe Diao, ..., Jan Kautz, Pavlo Molchanov
-
MobileLLM is an auto-regressive language model leveraging an optimized transformer architecture 🤗
· (arxiv)
-
🌟 SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration,
arXiv, 2411.10958
, arxiv, pdf, cication: -1Jintao Zhang, Haofeng Huang, Pengle Zhang, ..., Jun Zhu, Jianfei Chen · (SageAttention. - thu-ml)
-
ThunderKittens: Simple, Fast, and Adorable AI Kernels,
arXiv, 2410.20399
, arxiv, pdf, cication: -1Benjamin F. Spector, Simran Arora, Aaryan Singhal, ..., Daniel Y. Fu, Christopher Ré
-
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs,
arXiv, 2410.13276
, arxiv, pdf, cication: -1Yizhao Gao, Zhichen Zeng, Dayou Du, ..., Fan Yang, Mao Yang
-
MoH: Multi-Head Attention as Mixture-of-Head Attention,
arXiv, 2410.11842
, arxiv, pdf, cication: -1Peng Jin, Bo Zhu, Li Yuan, ..., Shuicheng Yan · (arxiv) · (MoH - SkyworkAI) · (huggingface)
-
nano-sparse-attention - PiotrNawrot
· (𝕏)
-
sgl-learning-materials - sgl-project
-
exo - exo-explore
Run your own AI cluster at home with everyday devices.