diff --git a/README.md b/README.md
index 69a18775136..00cf1dad925 100644
--- a/README.md
+++ b/README.md
@@ -71,6 +71,7 @@ For the latest model updates and features, please see [MODEL_UPDATES.md](models/
 - [Advanced Performance Optimizations for Models](./tech_reports/AdvancedPerformanceOperationsForModels/AdvancedPerformanceOptimizationsForModels.md) (updated Oct 17th)
 - [Programming Mesh of Devices](./tech_reports/Programming%20Mesh%20of%20Devices/Programming%20Mesh%20of%20Devices%20with%20TT-NN.md) (updated Sept 9th)
 - [ViT Implementation in TT-NN on GS](./tech_reports/ViT-TTNN/vit.md)  (updated Sept 22nd)
+- [LLMs Bring up in TT-NN](./tech_reports/LLMs/llms.md)  (updated Oct 29th)
 ---
 
 <div align="center">
diff --git a/tech_reports/LLMs/llms.md b/tech_reports/LLMs/llms.md
new file mode 100644
index 00000000000..4b4a34f6a7c
--- /dev/null
+++ b/tech_reports/LLMs/llms.md
@@ -0,0 +1,112 @@
+# LLMs in TT-NN
+Authors: 
+## Contents
+- [LLMs in TT-NN](#llms-in-tt-nn)
+  - [Contents](#contents)
+  - [1. Overview](#1-overview)
+  - [2. Modules](#2-modules)
+    - [2.1 Embedding](#21-embedding)
+    - [2.2 RoPE](#22-rope)
+    - [2.3 Norm](#23-norm) 
+    - [2.4 Attention](#24-attention)
+    - [2.5 MLP](#25-mlp)
+    - [2.6 Decoder](#26-decoder)
+    - [2.7 LM Head](#27-lm-head)
+  - [3. Features](#3-features)
+    - [3.1 Generative Decoding](#31-generative-decoding)
+    - [3.2 Prefill and Decode](#32-prefill-and-decode)
+    - [3.3 Multi-Device](#33-multi-device)
+    - [3.4 Continuous Batching](#34-continuous-batching)
+    - [3.5 vLLM Integration](#34-vllm-integration)
+  - [4. Best Practices and Optimizations](#4-best-practices-and-optimizations)
+    - [4.1 Tracing](#41-tracing)
+    - [4.2 Async Mode](#42-async-mode)
+    - [4.3 Multiple CQs](#43-multiple-cqs)
+    - [4.4 Op Configs](#44-op-configs)
+    - [4.5 Accuracy](#45-accuracy)
+    - [4.6 Performance Analysis](#46-performance-analysis)
+    - [4.7 Misc. Performance Optimizations](#47-misc-performance-optimizations)
+    - [4.8 Module Tests](#48-module-tests)
+    - [4.9 Performance Testing](#49-performance-testing)
+    - [4.10 Common Pitfalls](#410-common-pitfalls)
+      - [4.10.1 Error Messages](#4101-error-messages)
+      - [4.10.2 Shard Spec Mismatches](#4102-shard-spec-mismatches)
+      - [4.10.3 Ethernet Dispatch Cores](#4103-ethernet-dispatch-cores)
+      - [4.10.4 Hangs](#4104-hangs)
+        - [4.10.4.1 Tracing](#41041-tracing)
+        - [4.10.4.2 Large Matmuls](#41042-large-matmuls)
+
+## 1. Overview
+## 2. Modules
+### 2.1 Embedding
+### 2.2 RoPE
+  - Iterative update system
+  - When to use our fused op
+### 2.3 Norm
+  - Replicated layernorm vs distributed layernorm
+    - Layernorm/rmsnorm weights in row major / wrapped around tile size trick
+### 2.4 Attention
+  - Flash Attention and Flash Decode
+    - general description
+    - limitations
+    - which dims are parallelized
+### 2.5 MLP
+### 2.6 Decoder
+### 2.7 LM Head
+## 3. Features
+### 3.1 Generative Decoding
+### 3.2 Prefill and Decode
+  - submodules, tests
+  - how to combine prefill and decode, 
+  - slicing prefill to fit in L1
+### 3.3 Multi-Device
+  - device mesh
+  - column parallel followed by row parallel
+  - sharding, CCL ops, reducing CCL overheads, etc.
+### 3.4 Continuous Batching
+  - quick intro and how it is implemented in demos.
+### 3.5 vLLM Integration
+  - Our vLLM repo and what's needed to integrate with it.
+## 4. Best Practices and Optimizations
+### 4.1 Tracing
+  - link to existing doc, why it helps decode more
+### 4.2 Async Mode
+### 4.3 Multiple CQs
+  - how to feed back output to input and read output asyncronously
+### 4.4 Op Configs
+  - Writing correct program configs and shard specs 
+  - Deciding how many cores to run an op on
+    - Why did we use 16 cores for MLP
+  - Which matmul to use when @Colman Glagovich 
+    - 1d, 2d, dram-sharded, ...
+  - Implicitly padding weights in program config for matmuls
+### 4.5 Accuracy
+  - How we measure it (PCC, perplexity, top-1/top-5, end-user tests, benchmarking)
+  - How much PCC is enough? Rules of thumb.
+  - Accuracy tests
+  - Debugging PCC issues
+### 4.6 Performance Analysis
+  - Performance tooling, tracy
+### 4.7 Misc. Performance Optimizations
+  - Which dim to shard matmuls on
+  - DRAM-sharding
+  - Avoiding sharded to interleaved calls
+### 4.8 Module Tests
+### 4.9 Performance Testing
+### 4.10 Common Pitfalls
+#### 4.10.1 Error Messages
+  - Running out of L1
+  - Shard spec and program config mismatches
+  - For some TTNN ops (e.g. ttnn.all_gather) it's not supported to pass -1 in the dim argument. 
+    - You'll see an error related to op invocation where the arguments don't match
+#### 4.10.2 Shard Spec Mismatches
+#### 4.10.3 Ethernet Dispatch Cores
+  - link to any other description, and mention it is needed for N300 and T3K
+#### 4.10.4 Hangs
+##### 4.10.4.1 Tracing
+  - Host communications cause tracing to hang
+  - Running without async mode enabled causes tracing to hang
+  - Careful with print in tracing
+##### 4.10.4.2 Large Matmuls
+  - Large matmuls hanging? Link to appropriate ticket with workaround
+  - Issue is being investigated with a workaround of setting the output subblock to 1,1 and grid size to 8x7