Highlights
This version is based on vLLM 0.11.2 and supports Intel® Gaudi® v1.22.2.
This release introduces the production-ready vLLM Hardware Plugin for Intel® Gaudi®, a community-driven integration layer based on the vLLM v1 architecture. It enables efficient, high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators. The plugin is an alternative to the vLLM fork, which reaches end of life with this release and will be deprecated in v1.24.0, remaining functional only for legacy use cases. We strongly encourage all fork users to begin planning their migration to the plugin.
The plugin provides feature parity with the fork, including mature, production-ready implementations of Automatic Prefix Caching (APC) and async scheduler. Two legacy features - multi-step scheduling and delayed sampling - have been discontinued, as their functionality is now covered by the async scheduler.
For more details on the plugin's implementation, see Plugin System.
To start using the plugin, follow the Basic Quick Start Guide and explore the rest of this documentation.
What's Changed
- add commit-id to distinguish image and container for each PR by @xuechendi in #85
- [Upstream fix] Fix after #23041 from upstream by @adobrzyn in #87
- Change warmup scenario for execute dummy scenario by @adobrzyn in #54
- remove enable_prompt_adapter in test to fix by @xuechendi in #91
- Fix jenkins - remove failed test and fix later / update API by @xuechendi in #79
- [Upstream fix] Fix after #23262 from upstream - Make new_block_ids None if empty by @adobrzyn in #93
- Enable multimodal support + qwen2.5-vl by @attafosu in #92
- Fix upstream PR 22668 that added additional arg to is_kv_cache_dtype_supported by @mswiniarsk in #96
- Port defragmentation support from vllm-fork PR #1568 by @madamczyk-intel in #94
- [Upstream fix] Fix after #22711 by @adobrzyn in #102
- Reduce number of compilations when dynamic shapes is used by @anko-intel in #90
- Warmup fix - for non contiguous PA runs, don't take more context blocks than possible by @adobrzyn in #97
- [UT] Fix test args for bucketing tests by @adobrzyn in #105
- [SW-236088] Add sampler unit tests by @kamil-kaczor in #99
- Avoid copying dynamic slice of sampling_metadata tensors by @mswiniarsk in #88
- Fix mm encoder inputs for mix-modalities in input batch by @attafosu in #103
- Fix decode profiling by @kamil-kaczor in #106
- fix upstream PR 23749 by @xuechendi in #108
- Fix the failing introduced by upstream 22685 by @xuechendi in #110
- fix an argument issue introduced by recent vllm upstream and add CI by @xuechendi in #111
- Port G2 scaling convert from vllm-fork #1505 by @xuechendi in #112
- Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) by @xuechendi in #81
- fix qwen3-30B-A3B-FP8 - The number of dims cannot be packed into CompleteArgumentSpec:65535 by @xuechendi in #113
- [FIX HOURLY Failure] transformer 4.56.0 is not compatible with INC by @xuechendi in #117
- Remove test_load_model_weights_inplace by @kzawora-intel in #48
- [BUG fix]Fix spec_decode introduced long graph compilation issue by @xuechendi in #127
- [Bugfix] Warmup with continuous PA by @adobrzyn in #126
- Disable warmup for defragmentator by @mswiniarsk in #132
- Merging vllm docker implementation to vllm-gaudi (v1) by @PatrykWo in #125
- Enable embedding feature by @slokesha in #120
- Revert "Enable embedding feature" by @adobrzyn in #140
- [Bugfix] Remove reqs without logits - merge prefill case by @adobrzyn in #137
- Update CODEOWNERS by @mgawarkiewicz-intel in #144
- Fix warmup break when max decode bucket bs > max num seq by @taran2210 in #107
- Add tests for custom op registration by @Kacper-Pietkun in #109
- Enable embedding feature by @slokesha in #141
- Update CODEOWNERS file by @vivekgoe in #143
- [Merged Prefill] Warmup for merged prefill by @adobrzyn in #104
- Experimental support for Unified Attention by @madamczyk-intel in #133
- Introducing sampler warmup as separate warmup step by @ksmusz in #131
- Add support for LoRA by @vivekgoe in #51
- Add data parallel support by @wuxun-zhang in #80
- Increase allowed line length to 120 + reformat accordingly by @kzawora-intel in #130
- [FIX HOURLY]Remove DP test from Hourly by @xuechendi in #147
- Update CODEOWNERS by @afierka-intel in #135
- Enable sampler compilation by @Kacper-Pietkun in #95
- Add DP into CI by @wuxun-zhang in #146
- Add TESTOWNERS by @kzawora-intel in #153
- Patch FusedMoE forward to avoid dynamo recompilations by @kdamaszk in #158
- [CI] Jenkins false positive bugfix by @kzawora-intel in #159
- Fix dummy decode input for DP by @wuxun-zhang in #151
- [Quick fix for CI]fix CI break on Qwen2.5-vl and update docker image by @xuechendi in #161
- initial port for nixl by @hsubramony in #100
- update nixl version in requirements by @hsubramony in #163
- Re-quantize FP8 model with INC by @yiliu30 in #114
- [Feature][SpecDecode][Part2] Eagle3,MTP enabling, accept_rate improvement by @xuechendi in #142
- [BUGFIX] qwen2.5-vl failed after PR24444, provide a temp solution by @xuechendi in #162
- Reenabling llama4 models by @afierka-intel in #128
- Allow building vllm-plugin docker with upstream torch by @mmuszynskihabana in #155
- [HOURLY FIX] For upstream PR-24548 changes by @xuechendi in #166
- [BUGFIX] warmup failed after PR104, propose fix in this PR by @xuechendi in #148
- TESTOWNERS update by @adobrzyn in #165
- [TEMP-WA] Skip Qwen3-30B-A3B in tests - Bug introduced in upstream #24772 by @attafosu in #168
- [CI FIX]Fix issue introduced by upstream PR #23974 by @xuechendi in #172
- [CI FIX] Fix issue introduced by upstream #24745 by @xuechendi in #174
- [BUG][Disable CI] Disable DP test due recent upstream change failed HPU DP by @xuechendi in #177
- Fully overlap model execution by @tianmu-li in #134
- Added fix for VLLM_WEIGHT_LOAD_FORCE_SYNC by @tianmu-li in #173
- Introduce VLLM_SCALE_ADJUSTMENT by @xinyu-intel in #164
- Support Ray distributed executor by @xinyu-intel in #169
- Bug fix: hpu mrope by @attafosu in #167
- Fix in docker compose functionality for v1-plugin by @PatrykWo in #185
- CI fix by @adobrzyn in #186
- Fix dp sync after upstream change #24105 by @wuxun-zhang in #179
- Cache token ids on device for async_scheduling by @tianmu-li in #184
- [BUGFIX] Fix hourly after PR#22772 by @adobrzyn in #197
- [SW-240630] Qwen3-30B-MoE: Flatten post-attn seqs and restore model output shape by @attafosu in #176
- fix block bucket size for DP+contiguous PA by @wuxun-zhang in #171
- Fix swap in defragmentator by @kamil-kaczor in #182
- Unified mixed batches by @madamczyk-intel in #196
- [SW-236002] Support compressed int4 w4a16 format by @skavulya in #193
- update HOURLY docker image and move DP to separate test run by @xuechendi in #209
- Move hourly to aicf-gaudi2-07 by @xuechendi in #211
- Create .readthedocs.yaml by @kzawora-intel in #219
- [BUGFIX] Fix after PR25332 & 25321 & 25366 by @adobrzyn in #215
- Fix DP dummy run crash for P/D by @wuxun-zhang in #194
- Enable interleaved sliding window for gemma3 by @jiminha in #150
- Update the script fix for gemma-3-4b test by @jiminha in #225
- V0.10.2 docker updates / benchmark serving section (#191) - cherry-pick by @PatrykWo in #200
- use vllm intree API to enable synced_model_load, #25126 by @xuechendi in #208
- [FIX][Upstream caused crash] Fix crash caused by upstream PR 25184 by @xuechendi in #238
- Enable p2d2 for nixl by @hsubramony in #237
- [FIX][upstream crash]Fix due upstream change 25510 by @xuechendi in #241
- Remove sync point from _prepare_sampling by @kdamaszk in #204
- Align to lora_manager changes in upstream by @vivekgoe in #244
- Update test owners: iboiko-habana, jkaniecki by @iboiko-habana in #247
- Enable device_to_device nixl_connector support by @xuechendi in #240
- Add fused_experts to HPUFp8MoEMethod to fix Deepseek by @kdamaszk in #228
- add hf_token for CI by @xuechendi in #248
- another PR for HF_TOKEN by @xuechendi in #251
- update CI file to use my PR code by @xuechendi in #254
- Fix crash due to PR 25541 by @xuechendi in #252
- Add HPUMultiHeadAttention with FusedSDPA by @jiminha in #249
- skip dp padding sync in set_forward_context by @wuxun-zhang in #226
- [SW-236002] Enable group indexing for compressed w4a16 format by @skavulya in #243
- Enable group indexing gptq by @jmamzax in #154
- Adding dynamic swap number and defragmenter warmup by @ksmusz in #183
- fix crash introduced by upstream PR 25613 and PR23991 by @xuechendi in #259
- Fix crash introduced by 25489 - cause PD fail by @xuechendi in #260
- [HOURLY RUN] update the scripts and action to run in seperate job by @xuechendi in #261
- [sw 239237] Add last good commit based on PR257 by @xuechendi in #262
- [GITHUB ACTION] update pre-merge to block CI for not ready PR by @xuechendi in #266
- [GITHUB ACTION] quick fix for last update to pre-merge by @xuechendi in #267
- remove DCO check by @xuechendi in #269
- [GITHUB ACTION] [PRE_COMMIT] pre-check before start actual CI by @xuechendi in #270
- [upstream crash] fix spec decode due to upstream 24986 by @xuechendi in #265
- [GITHUB ACTION][HOURLY] add force push otherwise it failed to update by @xuechendi in #268
- {GITHUB ACTION}[PRE_MERGE] last refine to enable DCO check by @xuechendi in #271
- {GITHUB ACTION}[PRE_MERGE] post comments if PR failed to meet DCO or mergable requirement by @xuechendi in #273
- [Unified Attention] Bucketing and Warmup for Unified Attention by @adobrzyn in #157
- Adding prompt context flags for linear warmup by @iboiko-habana in #217
- [FIX_FOR_VLLM_LATEST]{GITHUB ACTION}[PRE-MERGE] switch to last good commit or main based on label by @xuechendi in #279
- [FIX_FOR_VLLM_LATEST] FIX_HOURLY_by_skip_embedding due upstream 25738 by @xuechendi in #280
- {GITHUB ACTION} Add update stable commit action by @xuechendi in #282
- Update LoRA tests by @vivekgoe in #255
- [test] Add yaml files for fp8 tests by @ulivne in #53
- Fix for negative logits by @pawel-olejniczak in #160
- Enable modification of prompt BS by @ksmusz in #258
- Fix DP dummy run cfg by @wuxun-zhang in #284
- [Fix Hourly] install UCX from source instead using builtin wheel from nixl by @xuechendi in #289
- {GITHUB ACTION} remove DCO block by @xuechendi in #290
- Fix deepseek FP8 weight creation due to upstream vllm change by @skavulya in #281
- Support sequence parallel MOE after upstream #24982 by @wuxun-zhang in #285
- Enable H2d(runtime scale patching) for Torch compile by default by @jczaja in #235
- [FIX_FOR_VLLM_LATEST] fix issue introduced by PR25896 and comment out still failing tests by @xuechendi in #292
- [NIXL] Fix crash introduced by upstream PR #25902 by @xuechendi in #293
- [MLA][Deepseek] Bring back deepseek after change from PR25896 by @xuechendi in #294
- [FIX_FOR_VLLM_LATEST] Fix for crash introduced by upstream PR 19330 by @xuechendi in #295
- Fix Embeding hang by @slokesha in #291
- Fix after #16229, mm by @adobrzyn in #286
- Add assert for empty buckets by @iboiko-habana in #236
- Update CODEOWNERS by @michalkuligowski in #297
- Use type strings to be compatible with python 3.10 by @madamczyk-intel in #214
- Fixing padded iterators in _align_and_pad by @ksmusz in #300
- [CI][NIXL]cache/reuse pre-build wheel to skip always re-build for nixl by @xuechendi in #304
- [NIXL][Dockerfile] add docker file for latest vllm_gaudi + nixl for llmd by @xuechendi in #307
- [GLM-4.5] [BugFix] make GLM-4.5 working by adding model to flatten_input list by @xuechendi in #306
- [BugFix][Deepseek][INC] fix duplicate submodules for deepseek INC quantization by @skavulya in #305
- Update CODEOWNERS by @iboiko-habana in #303
- Add restriction of usage VLLM_DECODE_BLOCK_BUCKET_MAX>max_blocks by @iboiko-habana in #302
- [README]Add NIXL installation guide in README by @xuechendi in #308
- [FIX_FOR_VLLM_LATEST] fix issue brought by upstream PR #25893 by @xuechendi in #310
- [FIX_FOR_VLLM_LATEST] update hpu_model_runner according to #25676 by @xuechendi in #311
- {GITHUB ACTION}[BO_ACTION] New action for release branch out by @xuechendi in #312
- [GITHUB ACTION][BO] update create_branch_action by @xuechendi in #315
- [GITHUB ACTION]only trigger tests for certain folder and add skip-gaudi-tests by @xuechendi in #325
- [GITHUB ACTION] Quick fix on pre-merge enabling files change compare on fork repo by @xuechendi in #328
- Fix for missing graphed_buckets attr while bucketing is off by @ksmusz in #321
- RUNTIME SCALE PATCHGIN info by @jczaja in #317
- Fix calculating used blocks by @mswiniarsk in #318
- Fix defragmenter compilation by @kzawora-intel in #334
- Add Plugin V1 specific recipe changes by @nngokhale in #187
- [SKIP CI][DP] disable DP test due hourly fail by @xuechendi in #339
- Update long context README by @iboiko-habana in #256
- Fix long-context scenarios - torch.cat error by @afierka-intel in #346
- Remove changed-files CI step by @kzawora-intel in #351
- [Bugfix] Fix bucketing of query + num_blocks neighbor expansion by @kzawora-intel in #350
- [Docs] README update - bucketing, warmup, defragmenter and sampler warmup by @ksmusz in #353
- [Bugfix] Fix decode bucket validity condition by @kzawora-intel in #355
- [Bugfix] Fix bucketing UT by @kzawora-intel in #367
- [GITHUB ACTION] Remove commits comparison so we can rerun by @xuechendi in #373
- [CI] Set seeds for e2e tests by @kzawora-intel in #368
- Fix dp padding after upstream change #25768 by @wuxun-zhang in #362
- Create LICENSE by @kzawora-intel in #379
- Change to starting page and installation by @PatrykWo in #371
- [FIX_FOR_VLLM_LATEST] Fix upstream crash introduced by #24486 + #24926 + #25103 + #25807 by @iboiko-habana in #366
- Enable Parallel Compilation feature for compile mode by default by @jwieczorekhabana in #370
- [SW-239226] Adjust junit xml filenames for retry mechanism by @tlipinski1337 in #382
- docs installation build formating fix by @PatrykWo in #384
- Correct htexp._data_ptr utility by @xinyu-intel in #387
- ray: pin ray to <2.49.0 by @xinyu-intel in #386
- [FIX_FOR_VLLM_LATEST] Fix #24172, [Refactor]: Use M-RoPE interface directly while defining model class instead of maintaining model specific M-RoPE implementation in mrope.py by @iboiko-habana in #388
- [Bugfix] Fix min linear decode value by @adobrzyn in #391
- [SW-241908] Omit all prompt buckets that exceed max_num_batched_tokens by @skavulya in #331
- Experimental - fatal errro from 0.12 release by @adobrzyn in #398
- Port: [Docs] CI failures chapter (#276) by @adobrzyn in #389
- Fix issue with async_scheduling when dealing with chunked input by @tianmu-li in #360
- nixl: support mla kvcache transfer by @xinyu-intel in #403
- Unified Attention Accuracy Bugfixes by @kzawora-intel in #393
- Minor optimizationm for bucketing calc by @michalkuligowski in #395
- Fix linear assert by @kamil-kaczor in #401
- Enviroment logs - disable prefix caching with conti pa + add vllm brnach+commit value to logs by @adobrzyn in #402
- [FIX_FOR_VLLM_LATEST] Upstream vllm fixes for #26355 and #26737 by @iboiko-habana in #407
- Cherrypick cd docker fixes/commits from v0.10.2 to main v0.11.0 by @nngokhale in #341
- Unit test for prefix caching in Gaudi plugin by @iirzynska in #349
- Add missing prompt bucket to warmup, when max_ctx is 0 by @iboiko-habana in #352
- Unified attention improvemets by @adobrzyn in #363
- [NIXL][BUGFIX][Gaudi2Gaudi accuracy] use 4d kv_cache for nixl_connector KV register and update host_buffer accordingly by @xuechendi in #411
- Multi-image generation CI tests by @MohitIntel in #377
- [FIX_FOR_VLLM_LATEST] Fix for Separate out vllm.utils.collections #26990 by @iboiko-habana in #413
- Add fp8 calibration procedure by @afierka-intel in #309
- [FIX_FOR_VLLM_LATEST] Fix for #27022 by @adobrzyn in #418
- [CI]unified attn is too easy to fail, add small RTOL by @xuechendi in #422
- Update supported_features.md by @mgawarkiewicz-intel in #180
- [FIX_FOR_VLLM_LATEST] Fixes for upstream #26908 and #27143 and #27169 by @iboiko-habana in #427
- [NIXL]Enable prefill TP < Decode TP with host_buffer by @xuechendi in #421
- Fix typo in installation.md: correct script name to install_nixl.py by @yafshar in #385
- [SW-242466] Update not_over_max_model_len filter to fix warmup perf regression by @skavulya in #424
- Docs update post v0.11 by @PatrykWo in #428
- [FIX_FOR_VLLM_LATEST] Fix for #26440 by @iboiko-habana in #442
- [main] Defragmenter warmup accuracy workaround by @kzawora-intel in #436
- Update docs: Quickstart - Executing inference by @pawel-olejniczak in #410
- [Security] Update requirements.txt (#443) by @afierka-intel in #445
- [GITHUB ACTION] Always run same job to same node by @xuechendi in #450
- reuse DP allgather tensor across layers by @wuxun-zhang in #415
- Support DP for unified attention by @wuxun-zhang in #242
- [Linear warmup] Default values optimization by @adobrzyn in #426
- Buckets from file - alpha version by @adobrzyn in #375
- Fix math log2 exponential bucket error if max_model_len <= block_size by @skavulya in #451
- Fix requirements filtering in HPU Dockerfiles by @jakub-sochacki in #419
- Fix defragmentation for MLA-based models by @kzawora-intel in #470
- [FIX_FOR_VLLM_LATEST] Fix for is_pin_memory_available import and skip of run_spec_decode_ngram_test due to #26060 by @iboiko-habana in #471
- Update KVConnectorOutpout for P/D when async scheduling turned on by @wuxun-zhang in #468
- Applying of [V1][spec decode] return logprobs for spec decoding #26060 by @iboiko-habana in #476
- Gemma3 Multimodal optimization by @jiminha in #404
- Fix prompt/decode profiler by @kamil-kaczor in #472
- New docs part3 updates by @PatrykWo in #456
- fix dummy run config for P/D prefiller instance by @wuxun-zhang in #467
- Add granite calibration test to all tests function by @ulivne in #453
- [FIX_FOR_VLLM_LATEST] Fix for Clean up utils #27552 by @iboiko-habana in #481
- Added info if H2d (runtime scale patching) is set by @jczaja in #480
- Update requirements.txt by @afierka-intel in #487
- Update the duplicate module list for deepseek r1 by @yiliu30 in #478
- [Security] Remove structurally dead code (#444) by @afierka-intel in #490
- [Security] Fix/remove logically dead code (#448) by @afierka-intel in #491
- [Security] Remove unused triton script with null-like value issue (#447) by @afierka-intel in #492
- [FIX_FOR_VLLM_LATEST] Fix for Make LayerBlockType a Literal instead of Enum #27658 by @iboiko-habana in #499
- rhel docker fix to main by @PatrykWo in #489
- Fix profiler using wrong bucket by @kamil-kaczor in #497
- Add docs: Plugin System by @pawel-olejniczak in #446
- HPU Dockerfile for PyTorch CI HUD by @jakub-sochacki in #501
- Add unified attention Granite-8b test by @kzawora-intel in #277
- Unified Attention - High Level Profiler Integration by @kzawora-intel in #399
- Use query in linear flags - seq as fallback option by @adobrzyn in #396
- [SW-243111] Add correctors for decode buckets by @jbyczkow in #504
- Add HABANA_VISIBLE_DEVICES env to Dockerfile.hpu used for PyTorch CI HUD by @jakub-sochacki in #506
- Update troubleshooting.md by @michalkuligowski in #416
- [FIX_FOR_VLLM_LATEST] Hourly fix after: [BugFix] Handle unscheduled requests properly when async scheduling #27756 by @adobrzyn in #507
- Update TESTOWNERS by @jbyczkow in #494
- MLA: reshape non-contiguous tensor by @xinyu-intel in #505
- DP: allreduce on the host by @xinyu-intel in #498
- Simplify requirements by @pawel-olejniczak in #458
- Remove VLLM_DELAYED_SAMPLING by @xwu-intel in #433
- Removing data from a deleted column by @PatrykWo in #514
- Add Unified Attention docs by @madamczyk-intel in #275
- Unified Attention - batch preparation rewrite by @kzawora-intel in #400
- vllm matrix table by @PatrykWo in #517
- Documentation updates - part 1 by @mhelf-intel in #493
- Fix preemption handling by @kzawora-intel in #524
- Removing leftovers fork from plugin by @PatrykWo in #525
- [Bucketing] Prompt with 0 min and max context blocks by @adobrzyn in #534
- Port: add VLLM_DISABLE_MARK_SCALES_AS_CONST by @zhejiangxiaomai in #522
- Add graph compilation tracking to high level profiler by @kzawora-intel in #50
- Update finished KV transfer state after every step by @wuxun-zhang in #532
- [GITHUB ACTION][NIXL]update install_nixl.py script by @xuechendi in #543
- Doc updates: introduction and developer guides by @mhelf-intel in #529
- FP8 documentation review by @mhelf-intel in #518
- Documentation: Troubleshooting and FAQ updates and the updated documentation structure by @mhelf-intel in #548
- [New Feature] Add cpu core pinning to vllm-server to improve performance. by @louie-tsai in #502
- Fix missing non-causal buckets by @kamil-kaczor in #540
- [Docs] Unified attn style update by @adobrzyn in #533
- Enable FP8 with unified attention by @afierka-intel in #516
- Fix unified preemption no attr found by @kamil-kaczor in #528
- Add tests for custom operator implementation correctness by @Kacper-Pietkun in #457
- [SW-242523] Support per-tensor FP8 scaling by @skavulya in #483
- Fix typo in bucketing_file.txt by @mgonchar in #553
- [Docs] Readme for bucketing from file + env var added by @adobrzyn in #545
- [FIX_FOR_VLLM_LATEST] Fix upstream execute_model crash by @iboiko-habana in #546
- Fix for compiled_methods by @Kacper-Pietkun in #559
- Skip HPUGraph exceed max_cudagraph_capture_size by @zhejiangxiaomai in #551
- [FIX_FOR_VLLM_LATEST] Rename get_input_embeddings and get_multimodal_embeddings by @pawel-olejniczak in #561
- Replace the deprecated logo by @mhelf-intel in #564
- Automatically adjust VLLM_DECODE_BLOCK_BUCKET_MIN if it exceeds max_blocks by @dsocek in #432
- v0 cleanup by @michalkuligowski in #563
- [FIX_FOR_VLLM_LATEST] fix pr28534 by @iboiko-habana in #568
- Fix for PR546, adding float32 and float16 by @iboiko-habana in #569
- UX fix: hide warmup logs by @adobrzyn in #539
- Final documentation improvements and broken link fixes by @mhelf-intel in #558
- Readme updates and release notes for 0.10.2 by @mhelf-intel in #565
- [FIX_FOR_VLLM_LATEST] Fix crash after the sampled_token_ids type change by @pawel-olejniczak in #575
- Nixl deployment fixes by @PatrykWo in #573
- Specify output tensor in matmul_qk - with version difference by @adobrzyn in #571
- Update hpu_model_runner.py by @afierka-intel in #582
- Fix for PR24248 by @iboiko-habana in #578
- Edit docker file to resolve conflicts issue243959 by @PatrykWo in #587
- Fix async scheduling + request preemption by @tianmu-li in #589
- [PD][NIXL]Fix bug after upstream adding virtual block_size support by @xuechendi in #590
- Port: Fix prefix caching automatic off with conti pa (#583) by @adobrzyn in #586
- Edit CODEOWNERS for 0.11.2 BO by @PatrykWo in #604
- Add support for chunked attention (#597) by @jkaniecki in #612
- Fix reverse inull security issue (#588) by @afierka-intel in #611
- cherry pick fixes for llama4 by @Luca-Calabria in #637
- Cherry-pick release docker cmdline fixes, WA and long context support… by @nngokhale in #625
- Add missing quantization files (#639) by @afierka-intel in #651
- Doc changes from main to 0.11.2 by @mhelf-intel in #655
- 0.11.2 matrix update by @PatrykWo in #657
- Documentation updates for 0.11.2 by @mhelf-intel in #666
- Docs: broken links fixes to 0.11.2 by @mhelf-intel in #669
- 0.11.2 plugin release updates by @PatrykWo in #667
New Contributors
- @mswiniarsk made their first contribution in #96
- @anko-intel made their first contribution in #90
- @mgawarkiewicz-intel made their first contribution in #144
- @taran2210 made their first contribution in #107
- @wuxun-zhang made their first contribution in #80
- @hsubramony made their first contribution in #100
- @jiminha made their first contribution in #150
- @jwieczorekhabana made their first contribution in #370
- @tlipinski1337 made their first contribution in #382
- @iirzynska made their first contribution in #349
- @MohitIntel made their first contribution in #377
- @yafshar made their first contribution in #385
- @jakub-sochacki made their first contribution in #419
- @xwu-intel made their first contribution in #433
- @zhejiangxiaomai made their first contribution in #522
- @louie-tsai made their first contribution in #502
- @mgonchar made their first contribution in #553
- @dsocek made their first contribution in #432
Full Changelog: v0.10.1...v0.11.2