-
Notifications
You must be signed in to change notification settings - Fork 38
[SYCL] Support SYCL layer for LLaMA2 model #272
Conversation
4dd6636
to
11772b9
Compare
ecca7b9
to
16231e3
Compare
SYCL UT passed and llama2 sycl model_eval with i7-1185G7's Iris iGPU, 12900K(as CPU device) and A770 |
QKV+FFN on GPU, MHA on CPU, 9600K+A770: Skip MHA: 13.6ms/token (SYCL only) |
Skip CPU MHA(prevent memcpy between host and device), run all other layers on GPU This latency plus MHA latency should be the final end2end performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Run LLaMa2 all layers on A750: 19.5ms/token |
n_context = 1024, in=512, out=512, only 5.8GB GPU memory. |
A750 Once again, here's a picture of the two sides of my face. It |
MTL 155H, ~55ms/ token, the total latency of int4 gemm is 47ms
|
9e9e05d
to
f08b8ea
Compare
llama2-7b on A770+9900K:
|
Type of Change