QWEN 3.6 27B Multi-Token Prediction model
MTP (multi-token prediction) is an AI training paradigm where large language models (LLMs) are trained to predict several future tokens concurrently at each position, rather than predicting only the single next token.

llama.cpp has added the feature and some report almost doubling their tokens per second. I found turning on MTP (multi-token prediction) did indeed improve performance and tokens per second.
You’ll definitely need to pull the latest, greatest llama.cpp code and re-compile it as the mtp support is very new. You also need to download a version of the Qwen 3.6 27B model with MTP support here. My current command line for a 5090 is not fully optimized, but here it is:
set LLAMA_CACHE=unsloth\Qwen3.6-27B-MTP-GGUF
llama-server.exe -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL -ngl 99 -c 200000 -fa on -np 1 --spec-type draft-mtp --spec-draft-n-max 3 -ctk q8_0 -ctv q8_0 -b 1024 -ub 256 -t 16 --no-mmap --temp 0.6 --top-p 0.95 --top-k 20 --presence_penalty 0.0 --repeat-penalty 1.0 --reasoning on --host 0.0.0.0 --port 8001 --metrics --slots --props --chat-template-kwargs "{\"preserve_thinking\":true}"
I’ve tried this with both Q4 and Q8, and it seems to definitely speed things up.


