Skip to content

Commit

Permalink
#0: Outline of the LLMs tech report
Browse files Browse the repository at this point in the history
  • Loading branch information
uaydonat committed Oct 23, 2024
1 parent 16e18b1 commit 7337d15
Show file tree
Hide file tree
Showing 2 changed files with 113 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ For the latest model updates and features, please see [MODEL_UPDATES.md](models/
- [Advanced Performance Optimizations for Models](./tech_reports/AdvancedPerformanceOperationsForModels/AdvancedPerformanceOptimizationsForModels.md) (updated Oct 17th)
- [Programming Mesh of Devices](./tech_reports/Programming%20Mesh%20of%20Devices/Programming%20Mesh%20of%20Devices%20with%20TT-NN.md) (updated Sept 9th)
- [ViT Implementation in TT-NN on GS](./tech_reports/ViT-TTNN/vit.md) (updated Sept 22nd)
- [LLMs Bring up in TT-NN](./tech_reports/LLMs/llms.md) (updated Oct 29th)
---

<div align="center">
Expand Down
112 changes: 112 additions & 0 deletions tech_reports/LLMs/llms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# LLMs in TT-NN
Authors:
## Contents
- [LLMs in TT-NN](#llms-in-tt-nn)
- [Contents](#contents)
- [1. Overview](#1-overview)
- [2. Modules](#2-modules)
- [2.1 Embedding](#21-embedding)
- [2.2 RoPE](#22-rope)
- [2.3 Norm](#23-norm)
- [2.4 Attention](#24-attention)
- [2.5 MLP](#25-mlp)
- [2.6 Decoder](#26-decoder)
- [2.7 LM Head](#27-lm-head)
- [3. Features](#3-features)
- [3.1 Generative Decoding](#31-generative-decoding)
- [3.2 Prefill and Decode](#32-prefill-and-decode)
- [3.3 Multi-Device](#33-multi-device)
- [3.4 Continuous Batching](#34-continuous-batching)
- [3.5 vLLM Integration](#34-vllm-integration)
- [4. Best Practices and Optimizations](#4-best-practices-and-optimizations)
- [4.1 Tracing](#41-tracing)
- [4.2 Async Mode](#42-async-mode)
- [4.3 Multiple CQs](#43-multiple-cqs)
- [4.4 Op Configs](#44-op-configs)
- [4.5 Accuracy](#45-accuracy)
- [4.6 Performance Analysis](#46-performance-analysis)
- [4.7 Misc. Performance Optimizations](#47-misc-performance-optimizations)
- [4.8 Module Tests](#48-module-tests)
- [4.9 Performance Testing](#49-performance-testing)
- [4.10 Common Pitfalls](#410-common-pitfalls)
- [4.10.1 Error Messages](#4101-error-messages)
- [4.10.2 Shard Spec Mismatches](#4102-shard-spec-mismatches)
- [4.10.3 Ethernet Dispatch Cores](#4103-ethernet-dispatch-cores)
- [4.10.4 Hangs](#4104-hangs)
- [4.10.4.1 Tracing](#41041-tracing)
- [4.10.4.2 Large Matmuls](#41042-large-matmuls)

## 1. Overview
## 2. Modules
### 2.1 Embedding
### 2.2 RoPE
- Iterative update system
- When to use our fused op
### 2.3 Norm
- Replicated layernorm vs distributed layernorm
- Layernorm/rmsnorm weights in row major / wrapped around tile size trick
### 2.4 Attention
- Flash Attention and Flash Decode
- general description
- limitations
- which dims are parallelized
### 2.5 MLP
### 2.6 Decoder
### 2.7 LM Head
## 3. Features
### 3.1 Generative Decoding
### 3.2 Prefill and Decode
- submodules, tests
- how to combine prefill and decode,
- slicing prefill to fit in L1
### 3.3 Multi-Device
- device mesh
- column parallel followed by row parallel
- sharding, CCL ops, reducing CCL overheads, etc.
### 3.4 Continuous Batching
- quick intro and how it is implemented in demos.
### 3.5 vLLM Integration
- Our vLLM repo and what's needed to integrate with it.
## 4. Best Practices and Optimizations
### 4.1 Tracing
- link to existing doc, why it helps decode more
### 4.2 Async Mode
### 4.3 Multiple CQs
- how to feed back output to input and read output asyncronously
### 4.4 Op Configs
- Writing correct program configs and shard specs
- Deciding how many cores to run an op on
- Why did we use 16 cores for MLP
- Which matmul to use when @Colman Glagovich
- 1d, 2d, dram-sharded, ...
- Implicitly padding weights in program config for matmuls
### 4.5 Accuracy
- How we measure it (PCC, perplexity, top-1/top-5, end-user tests, benchmarking)
- How much PCC is enough? Rules of thumb.
- Accuracy tests
- Debugging PCC issues
### 4.6 Performance Analysis
- Performance tooling, tracy
### 4.7 Misc. Performance Optimizations
- Which dim to shard matmuls on
- DRAM-sharding
- Avoiding sharded to interleaved calls
### 4.8 Module Tests
### 4.9 Performance Testing
### 4.10 Common Pitfalls
#### 4.10.1 Error Messages
- Running out of L1
- Shard spec and program config mismatches
- For some TTNN ops (e.g. ttnn.all_gather) it's not supported to pass -1 in the dim argument.
- You'll see an error related to op invocation where the arguments don't match
#### 4.10.2 Shard Spec Mismatches
#### 4.10.3 Ethernet Dispatch Cores
- link to any other description, and mention it is needed for N300 and T3K
#### 4.10.4 Hangs
##### 4.10.4.1 Tracing
- Host communications cause tracing to hang
- Running without async mode enabled causes tracing to hang
- Careful with print in tracing
##### 4.10.4.2 Large Matmuls
- Large matmuls hanging? Link to appropriate ticket with workaround
- Issue is being investigated with a workaround of setting the output subblock to 1,1 and grid size to 8x7

0 comments on commit 7337d15

Please sign in to comment.