| Svelte Hacker News

points by adrian_b 6 hours ago

You need >64 GB of DRAM to run local models fast.

You can run huge local models slowly with the weights stored on SSDs.

Nowadays there are many computers that can have e.g. 2 PCIe 5.0 SSDs, which allow a reading throughput of 20 to 30 gigabyte per second, depending on the SSDs (or 1 PCIe 5.0 + 1 PCIe 4.0, for a throughput in the range 15-20 GB/s).

There are still a lot of improvements that can be done to inference back-ends like llama.cpp to reach the inference speed limit determined by the SSD throughput.

It seems that it is possible to reach inference speed in the range from a few seconds per token to a few tokens per second.

That may be too slow for a chat, but it should be good enough for an AI coding assistant, especially if many tasks are batched, so that they can progress simultaneously during a single read pass over the SSD data.

zozbot234 3 hours ago

You can do that, but you're going to have rather low throughput unless you have lots of PCIe lanes to attach storage to. That's going to require either a HEDT or some kind of compute cluster.

Batching inferences doesn't necessarily help that much since as models get sparser the individual inferences are going to share fewer experts. It does always help wrt. shared routing layers, of course.