Show HN: VLMs Can Respond Twice as Fast Without Losing Quality

2 points by trykhlieb 3 days ago

trykhlieb 1 day ago

Measured prefill acceleration at 16,373 context tokens:

2× RTX 1080 Ti: +60%

4× RTX 3090: +90%

5× RTX 5060 Ti: +90%

8× RTX 5060 Ti: +120%

10× P104-100 Pascal: +350%

RFC / PoC discussion: https://github.com/ggml-org/llama.cpp/pull/24219

trykhlieb 3 days ago

https://github.com/ggml-org/llama.cpp/pull/24219