Show HN: VLMs Can Respond Twice as Fast Without Losing Quality github.com 2 points by trykhlieb 3 days ago
trykhlieb 1 day ago Measured prefill acceleration at 16,373 context tokens:2× RTX 1080 Ti: +60%4× RTX 3090: +90%5× RTX 5060 Ti: +90%8× RTX 5060 Ti: +120%10× P104-100 Pascal: +350%RFC / PoC discussion: https://github.com/ggml-org/llama.cpp/pull/24219
Measured prefill acceleration at 16,373 context tokens:
2× RTX 1080 Ti: +60%
4× RTX 3090: +90%
5× RTX 5060 Ti: +90%
8× RTX 5060 Ti: +120%
10× P104-100 Pascal: +350%
RFC / PoC discussion: https://github.com/ggml-org/llama.cpp/pull/24219
https://github.com/ggml-org/llama.cpp/pull/24219