unify xpu and cpu backend and use paged attention #1009

sywangyi · 2024-11-22T01:08:30Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine class IPEXPagedCache's update method Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * replace tensor on xpu to List to avoid memory copy Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * split IPEXPagedCache's update function into `update_for_prefill` and `update_for_decode` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* enable qkv * split key value into 2 lists

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

#979) * enable gpt2, falcon has core dump error in PagedAttention.single_query_cached_kv_attention * enable new_decoder_arch falcon * only keep 1 config * rm autocast

* fix bug when run IPEXCausalModel forward directly; fix bug when using `save_pretrain` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add LinearGelu Op support for XPU Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix unit test error Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust unit test case Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix bug Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* skip assited decoding unit test for models using paged attention Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * XPU CI tests get almost all passed Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

HuggingFaceDocBuilderDev · 2024-11-22T01:13:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix ci config * fix test versions * fix ipex version Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* use python3.9 test Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* change ipex transformers limited verison in setup * fix inc tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

yao-matrix · 2024-11-25T02:57:16Z

@IlyasMoutawwakil @echarlaix , pls help review, we can also have a meeting to review it if needed. Thx.

* fix bert and vit patch * fix vit and bert save Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

IlyasMoutawwakil · 2024-11-25T09:23:18Z

@yao-matrix reviewing right now

jiqing-feng · 2024-11-25T09:23:27Z

Hi @IlyasMoutawwakil , please also merge this PR #1024. Thanks!

.github/workflows/test_ipex.yml

* fix reorder cache for non-patch models Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * disable torch < 2.3 tests, we won't use torch < 2.4 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix test beam serach Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix cache selection Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * upgrad to transformers4.46 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * change ipex test yaml transformers version to 4.46 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

tests/neural_compressor/test_ipex.py

optimum/intel/ipex/modeling_base.py

setup.py

optimum/exporters/ipex/modeling_utils.py

.github/workflows/test_ipex.yml

* set device as the same as origin model * fix device Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2024-11-26T04:33:07Z

Hi @IlyasMoutawwakil . I have replied and fixed your comments, please take the 2nd round review. Thanks~

* simplify forward and save pretrained since no jit support * fix format * rm warmup because no jit mode anymore * simplify forward for causal lm model * fix paged pkv forward * disable use_cache when just run forward --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

optimum/intel/ipex/modeling_base.py

echarlaix · 2024-11-26T09:51:13Z

optimum/intel/ipex/modeling_base.py


-        if isinstance(model, torch.jit.RecursiveScriptModule):


TorchScript models will not be compatible anymore which is an important breaking change, we need to catch this to inform users

also we need to update the documentation

optimum-intel/docs/source/ipex/inference.mdx

Line 18 in ad8a4cb

For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated.

echarlaix · 2024-11-26T09:51:51Z

optimum/intel/ipex/modeling_base.py

-
-        return cls(model, config=config, model_save_dir=model_save_dir, **kwargs)
+        task = cls.export_feature
+        model = TasksManager.get_model_from_task(


why not use cls.auto_model_class ?

optimum/exporters/ipex/model_patcher.py

echarlaix · 2024-11-26T15:12:24Z

tests/ipex/test_modeling.py

    def test_compare_with_and_without_past_key_values(self):
-        model_id = "echarlaix/tiny-random-gpt2-torchscript"


Would be great to add a test to make sure an IPEX model that has been saved and then pushed on the hub is still compatible with the latest the optimum-intel version and that the model can still be correctly loaded and used for inference. Also could make sense to check the resulting output (comparing it with a transformers model) wdyt ?

I am afraid we cannot support the previous exported language model anymore because these jit models use IAVK cache which has a different logic with paged attention. It will make the code extremely massive and hard to maintain, and it will confuse users too. Besides, these jit models are also out of date compared to the current transformers version.

The only way is to fallback IPEXModelForCausalLM to TSModelForCausalLM when loading a jit model, but it requires config.torchscript == True so we can know it's a torch script model. So you might need to update the "echarlaix/tiny-random-gpt2-torchscript" config parameter. I have updated the model's config, please check here

what I'm suggesting here is to create a new model with this implementation, push it on the hub and have a test to make sure this model is still compatible / can be correctly loaded and inference works as expected (pushing an other model than "echarlaix/tiny-random-gpt2-torchscript")

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* nice code * device type adjustment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* enable compile for non-generation tasks * add no_grad in forward * warmup compiled model * disable compile not ready models * set system level optimize for torch.compile * fix typo * add comments * set torch minimum version for compiling Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2024-11-29T05:24:08Z

Hi @echarlaix @IlyasMoutawwakil , please review the new changes. Thanks

echarlaix · 2024-11-29T17:16:31Z

optimum/intel/ipex/modeling_base.py

            )
+            return TSModelForCausalLM.from_pretrained(model_id, **kwargs)


An instance of TSModelForCausalLM will be created for all IPEXModel (even for encoder models) which doesn't really make sense to me. Also it's not tested anywhere from what I see, I prefer to raise an error here instead of keeping support that we're not sure works / is compatible with the previous integration

echarlaix · 2024-11-29T17:18:59Z

tests/ipex/test_modeling.py

    def test_compare_with_and_without_past_key_values(self):
-        model_id = "echarlaix/tiny-random-gpt2-torchscript"


what I'm suggesting here is to create a new model with this implementation, push it on the hub and have a test to make sure this model is still compatible / can be correctly loaded and inference works as expected (pushing an other model than "echarlaix/tiny-random-gpt2-torchscript")

echarlaix · 2024-11-29T17:20:05Z

optimum/intel/ipex/modeling_base.py

-
-        return cls(model, config=config, model_save_dir=model_save_dir, **kwargs)
+        model = cls.auto_model_class.from_pretrained(model_id, **kwargs)
+        return cls(model, config=model.config, export=True, **kwargs)


why would export be needed ?

Suggested change

return cls(model, config=model.config, export=True, **kwargs)

return cls(model, config=model.config, **kwargs)

echarlaix · 2024-11-29T17:21:16Z

optimum/intel/ipex/modeling_base.py


-        if isinstance(model, torch.jit.RecursiveScriptModule):


also we need to update the documentation

optimum-intel/docs/source/ipex/inference.mdx

Line 18 in ad8a4cb

For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated.

sywangyi and others added 14 commits October 8, 2024 22:57

add page attention implementation remove jit logic

1c35c4f

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

add support in transformers 4.45

973e034

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

fix congif (#935)

8b574d0

move patch model to init

541a236

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

fix bug when doing beam search (#954)

80e8071

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

enable qkv concat layer (#958)

184faea

* enable qkv * split key value into 2 lists

add xpu cache optimiztion

b341db6

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

xpu mlp optimization

34ce74d

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

optimize cache ops in xpu, improve for beam search

45130c9

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

enable gpt2, falcon has core dump error in PagedAttention.single_quer… (

74eec8b

#979) * enable gpt2, falcon has core dump error in PagedAttention.single_query_cached_kv_attention * enable new_decoder_arch falcon * only keep 1 config * rm autocast

Merge branch 'main' into paged_attn

459c78c

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

sywangyi changed the title ~~Paged attn~~ unify xpu and cpu backend and use paged attention Nov 22, 2024

jiqing-feng added 3 commits November 22, 2024 09:22

fix ci config (#1010)

1ab0233

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Fix tests versions (#1011)

b0cd5db

* fix ci config * fix test versions * fix ipex version Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix torch test version (#1012)

e31e6d4

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

sywangyi marked this pull request as draft November 22, 2024 01:34

use python3.9 test (#1013)

ed35ffc

* use python3.9 test Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

sywangyi marked this pull request as ready for review November 22, 2024 03:00

jiqing-feng and others added 2 commits November 22, 2024 13:11

change ipex transformers limited verison in setup (#1015)

a5c48a8

* change ipex transformers limited verison in setup * fix inc tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

add XPU LinearAddAdd op (#1017)

388265f

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

fix bert and vit patch (#1022)

ad9b795

* fix bert and vit patch * fix vit and bert save Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng mentioned this pull request Nov 25, 2024

Improve INC CI test torch version #1027

Merged

3 tasks

Merge branch 'main' into paged_attn

0d7f8b6

IlyasMoutawwakil reviewed Nov 25, 2024

View reviewed changes

.github/workflows/test_ipex.yml Show resolved Hide resolved

IlyasMoutawwakil reviewed Nov 25, 2024

View reviewed changes

tests/neural_compressor/test_ipex.py Outdated Show resolved Hide resolved

IlyasMoutawwakil reviewed Nov 25, 2024

View reviewed changes

tests/neural_compressor/test_ipex.py Show resolved Hide resolved

IlyasMoutawwakil reviewed Nov 25, 2024

View reviewed changes

optimum/intel/ipex/modeling_base.py Outdated Show resolved Hide resolved

IlyasMoutawwakil reviewed Nov 25, 2024

View reviewed changes

setup.py Outdated Show resolved Hide resolved

IlyasMoutawwakil reviewed Nov 25, 2024

View reviewed changes

optimum/exporters/ipex/modeling_utils.py Outdated Show resolved Hide resolved

IlyasMoutawwakil reviewed Nov 25, 2024

View reviewed changes

.github/workflows/test_ipex.yml Show resolved Hide resolved

set device as the same as origin model (#1031)

8a8e7e3

* set device as the same as origin model * fix device Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

IlyasMoutawwakil reviewed Nov 26, 2024

View reviewed changes

optimum/intel/ipex/modeling_base.py Outdated Show resolved Hide resolved

echarlaix reviewed Nov 26, 2024

View reviewed changes

echarlaix mentioned this pull request Nov 26, 2024

add ipex backend UKPLab/sentence-transformers#3083

Open

IlyasMoutawwakil reviewed Nov 26, 2024

View reviewed changes

optimum/exporters/ipex/model_patcher.py Outdated Show resolved Hide resolved

echarlaix reviewed Nov 26, 2024

View reviewed changes

echarlaix mentioned this pull request Nov 26, 2024

Add IPEX sentence transformers support #1034

Draft

kaixuanliu added 2 commits November 27, 2024 09:29

nice code (#1035)

51030e5

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Paged attn (#1036)

587837e

* nice code * device type adjustment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

changwangss mentioned this pull request Nov 27, 2024

Support layerwise quantization #1018

Open

3 tasks

echarlaix reviewed Nov 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unify xpu and cpu backend and use paged attention #1009

unify xpu and cpu backend and use paged attention #1009

sywangyi commented Nov 22, 2024

HuggingFaceDocBuilderDev commented Nov 22, 2024

yao-matrix commented Nov 25, 2024

IlyasMoutawwakil commented Nov 25, 2024

jiqing-feng commented Nov 25, 2024

jiqing-feng commented Nov 26, 2024

echarlaix Nov 26, 2024

echarlaix Nov 29, 2024

echarlaix Nov 26, 2024

echarlaix Nov 26, 2024

jiqing-feng Nov 27, 2024 •

edited

Loading

jiqing-feng Nov 27, 2024 •

edited

Loading

echarlaix Nov 29, 2024

jiqing-feng commented Nov 29, 2024

echarlaix Nov 29, 2024

echarlaix Nov 29, 2024

echarlaix Nov 29, 2024

echarlaix Nov 29, 2024

		def test_compare_with_and_without_past_key_values(self):
		model_id = "echarlaix/tiny-random-gpt2-torchscript"

		)
		return TSModelForCausalLM.from_pretrained(model_id, **kwargs)

	return cls(model, config=model.config, export=True, **kwargs)
	return cls(model, config=model.config, **kwargs)

unify xpu and cpu backend and use paged attention #1009

Are you sure you want to change the base?

unify xpu and cpu backend and use paged attention #1009

Conversation

sywangyi commented Nov 22, 2024

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Nov 22, 2024

yao-matrix commented Nov 25, 2024

IlyasMoutawwakil commented Nov 25, 2024

jiqing-feng commented Nov 25, 2024

jiqing-feng commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiqing-feng Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

jiqing-feng Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiqing-feng commented Nov 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiqing-feng Nov 27, 2024 •

edited

Loading

jiqing-feng Nov 27, 2024 •

edited

Loading