diff --git a/README.md b/README.md index 69a18775136..00cf1dad925 100644 --- a/README.md +++ b/README.md @@ -71,6 +71,7 @@ For the latest model updates and features, please see [MODEL_UPDATES.md](models/ - [Advanced Performance Optimizations for Models](./tech_reports/AdvancedPerformanceOperationsForModels/AdvancedPerformanceOptimizationsForModels.md) (updated Oct 17th) - [Programming Mesh of Devices](./tech_reports/Programming%20Mesh%20of%20Devices/Programming%20Mesh%20of%20Devices%20with%20TT-NN.md) (updated Sept 9th) - [ViT Implementation in TT-NN on GS](./tech_reports/ViT-TTNN/vit.md) (updated Sept 22nd) +- [LLMs Bring up in TT-NN](./tech_reports/LLMs/llms.md) (updated Oct 29th) ---
diff --git a/tech_reports/LLMs/llms.md b/tech_reports/LLMs/llms.md new file mode 100644 index 00000000000..4b4a34f6a7c --- /dev/null +++ b/tech_reports/LLMs/llms.md @@ -0,0 +1,112 @@ +# LLMs in TT-NN +Authors: +## Contents +- [LLMs in TT-NN](#llms-in-tt-nn) + - [Contents](#contents) + - [1. Overview](#1-overview) + - [2. Modules](#2-modules) + - [2.1 Embedding](#21-embedding) + - [2.2 RoPE](#22-rope) + - [2.3 Norm](#23-norm) + - [2.4 Attention](#24-attention) + - [2.5 MLP](#25-mlp) + - [2.6 Decoder](#26-decoder) + - [2.7 LM Head](#27-lm-head) + - [3. Features](#3-features) + - [3.1 Generative Decoding](#31-generative-decoding) + - [3.2 Prefill and Decode](#32-prefill-and-decode) + - [3.3 Multi-Device](#33-multi-device) + - [3.4 Continuous Batching](#34-continuous-batching) + - [3.5 vLLM Integration](#34-vllm-integration) + - [4. Best Practices and Optimizations](#4-best-practices-and-optimizations) + - [4.1 Tracing](#41-tracing) + - [4.2 Async Mode](#42-async-mode) + - [4.3 Multiple CQs](#43-multiple-cqs) + - [4.4 Op Configs](#44-op-configs) + - [4.5 Accuracy](#45-accuracy) + - [4.6 Performance Analysis](#46-performance-analysis) + - [4.7 Misc. Performance Optimizations](#47-misc-performance-optimizations) + - [4.8 Module Tests](#48-module-tests) + - [4.9 Performance Testing](#49-performance-testing) + - [4.10 Common Pitfalls](#410-common-pitfalls) + - [4.10.1 Error Messages](#4101-error-messages) + - [4.10.2 Shard Spec Mismatches](#4102-shard-spec-mismatches) + - [4.10.3 Ethernet Dispatch Cores](#4103-ethernet-dispatch-cores) + - [4.10.4 Hangs](#4104-hangs) + - [4.10.4.1 Tracing](#41041-tracing) + - [4.10.4.2 Large Matmuls](#41042-large-matmuls) + +## 1. Overview +## 2. Modules +### 2.1 Embedding +### 2.2 RoPE + - Iterative update system + - When to use our fused op +### 2.3 Norm + - Replicated layernorm vs distributed layernorm + - Layernorm/rmsnorm weights in row major / wrapped around tile size trick +### 2.4 Attention + - Flash Attention and Flash Decode + - general description + - limitations + - which dims are parallelized +### 2.5 MLP +### 2.6 Decoder +### 2.7 LM Head +## 3. Features +### 3.1 Generative Decoding +### 3.2 Prefill and Decode + - submodules, tests + - how to combine prefill and decode, + - slicing prefill to fit in L1 +### 3.3 Multi-Device + - device mesh + - column parallel followed by row parallel + - sharding, CCL ops, reducing CCL overheads, etc. +### 3.4 Continuous Batching + - quick intro and how it is implemented in demos. +### 3.5 vLLM Integration + - Our vLLM repo and what's needed to integrate with it. +## 4. Best Practices and Optimizations +### 4.1 Tracing + - link to existing doc, why it helps decode more +### 4.2 Async Mode +### 4.3 Multiple CQs + - how to feed back output to input and read output asyncronously +### 4.4 Op Configs + - Writing correct program configs and shard specs + - Deciding how many cores to run an op on + - Why did we use 16 cores for MLP + - Which matmul to use when @Colman Glagovich + - 1d, 2d, dram-sharded, ... + - Implicitly padding weights in program config for matmuls +### 4.5 Accuracy + - How we measure it (PCC, perplexity, top-1/top-5, end-user tests, benchmarking) + - How much PCC is enough? Rules of thumb. + - Accuracy tests + - Debugging PCC issues +### 4.6 Performance Analysis + - Performance tooling, tracy +### 4.7 Misc. Performance Optimizations + - Which dim to shard matmuls on + - DRAM-sharding + - Avoiding sharded to interleaved calls +### 4.8 Module Tests +### 4.9 Performance Testing +### 4.10 Common Pitfalls +#### 4.10.1 Error Messages + - Running out of L1 + - Shard spec and program config mismatches + - For some TTNN ops (e.g. ttnn.all_gather) it's not supported to pass -1 in the dim argument. + - You'll see an error related to op invocation where the arguments don't match +#### 4.10.2 Shard Spec Mismatches +#### 4.10.3 Ethernet Dispatch Cores + - link to any other description, and mention it is needed for N300 and T3K +#### 4.10.4 Hangs +##### 4.10.4.1 Tracing + - Host communications cause tracing to hang + - Running without async mode enabled causes tracing to hang + - Careful with print in tracing +##### 4.10.4.2 Large Matmuls + - Large matmuls hanging? Link to appropriate ticket with workaround + - Issue is being investigated with a workaround of setting the output subblock to 1,1 and grid size to 8x7