Back to all news
Developer Tools May 18, 2026

llama.cpp Gains Multi-Token Prediction and Dual-GPU Tensor

The llama.cpp ecosystem advanced with the merge of MTP layer support for speculative decoding and a community fork enabling 40% dual-GPU speedups via tensor parallelism, signaling a shift toward higher-throughput local inference.

Why now

These developments collectively address latency bottlenecks in local LLM usage, allowing edge devices to approach cloud inference speeds for complex generation tasks.

Key signals

llama.cpp achieved MTP layer support enabling 1.5x to 1.8x speedups in token generation via speculative decoding. A llama.cpp fork enables 40% inference speedup on dual-GPU setups using tensor parallelism and quantized KV caches. MTP implementation introduces mixed results regarding prompt processing latency and potential compatibility bugs. The dual-GPU tensor parallelism fork currently lacks MoE support and requires stability fixes.

Sources

Related coverage