-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Dear authors,
I tried your script in your readme without changing one line to convert llama2-7b-hf. The conversion process itself run smoothly without error, however the converted model did not work properly.
I faced following errors:
- If convert without --deepseek-style, vllm or sglang online serving will not recognize LlamaMLAforCasualLM.
- If convert with --deepseek-style, vllm or sglang online serving start successfully, but responding requests with nonsense characters.
- Vllm offline engine will hang, sglang offline engine will say out of memory (I tried 8*A100, tp8, still OOM)
I tried delopy with sglang and vllm online serving and the server output random characters.
My env: cuda 12.5 on A100 device, torch 2.4 py310, vllm 0.8.2, sglang 0.4.6
Conversion command:
python transmla/converter.py
--model-path Llama-2-7b-hf
--save-path ./outputs/llama2-7b-deepseek
--dtype bf16
--device auto
--cal-dataset alpaca
--cal-nsamples 128
--cal-max-seqlen 256
--cal-batch-size 8
--ppl-eval-batch-size 4
--freqfold auto
--collapse auto
--qk-mqa-dim 64
--q-lora-rank 512
--kv-lora-rank 512
--deepseek-style
also tried:
bash scripts/convert/qwen2.5-7B-Instruct.sh
conversion log:
launch server command:
vllm serve --model-path Llama-2-7b-hf-deepseek
python -m sglang.launch_server --model-path Llama-2-7b-hf-deepseek
request:

