In January 2024 there was a similar post ( https://news.ycombinator.com/item?id=38985152 ) wherein the author selected dual NVidia 4060 Ti's for an at-home-LLM-with-voice-control -- because they were the cheapest cost per GB of well-supported VRAM at the time.
(They probably still are, or at least pretty close to it.)
That informed my decision shortly after, when I built something similar - that video card model was widely panned by gamers (or more accurately, gamer 'influencers'), but it was an excellent choice if you wanted 16GB of VRAM with relatively low power draw (150W peak).
TFA doesn't say where they are, or what currency they're using (which implies the hubris of a North American) - at which point that pricing for a second hand, smaller-capacity, higher-power-drawing 4070 just seems weird.
Appreciate the 'on a budget' aspect, it just seems like an objectively worse path, as upgrades are going to require replacement, rather than augment.
As per other comments here, 32 / 12 is going to be really limiting. Yes - lower parameter / smaller-quant models are becoming more capable, but at the same time we're seeing increasing interest in larger context for these at home use cases, and that chews up memory real fast.
Love the attention to detail, I can tell this was a lot of work to put together and I hope it helps people new to PC building.
I will note though, 12GB of VRAM and 32GB of system RAM is a ceiling you’re going to hit pretty quickly if you’re into messing with LLMs. There’s basically no way to do a better job at the budget you’re working with though.
One thing I hear about a lot is people using things like RunPod to briefly get access to powerful GPUs/servers when they need one. If you spend $2/hr you can get access to an H100. If you have a budget of $1300 that could get you about 600 hours of compute time, which (unless you’re doing training runs) should last you several months.
In several months time the specs required to run good models will be different again in ways that are hard to predict, so this approach can help save on the heartbreak of buying an RTX 5090 only to find that even that doesn’t help much with LLM inference and we’re all gonna need the cheaper-but-more-VRAM Intel Arc B60s.
The RTX market is particularly irritating right now, even second-hard 4090s are still going for MSRP if you can find them at all.
Most of the recommendations for this budget AI system are on point - the only thing I'd recommend is more RAM. 32GB is not a lot - particularly if you start to load larger models through formats such as GGUF and want to take advantage of system ram to split the layers at the cost of inference speed. I'd recommend at least 2 x 32GB or even 4 x 32GB if you can swing it budget-wise.
Author mentioned using Claude for recommendations, but another great resource for building machines is PC Part Picker. They'll even show warnings if you try pairing incompatible parts or try to use a PSU that won't supply the minimum recommended power.
The caveat is that sometimes a library might be expecting an older version of cuda.
The vram on the GPU does make a difference, so it would at some point be worth looking at another GPU or increasing your system ram if you start running into limits.
However I wouldn't worry too much right away, it's more important to get started and get an understanding of how these local LLMs operate and take advantage of the optimisations that the community is making to make it more accessible. Not everyone has a 5090, and if LLMs remain in the realms of high end hardware, it's not worth the time.
The trouble with these things is that “on a budget” doesn’t deliver much when most interesting and truly useful models are creeping beyond the 16GB VRAM limit and/or require a lot of wattage. Even a Mac mini with enough RAM is starting to look like an expensive proposition, and the AMD Stryx Halo APUs (the SKUs that matter, like the Framework Desktop at 128GB) are around $2K.
As someone who built a period-equivalent rig (with a 12GB 3060 and 128GB RAM) a few years ago, I am not overly optimistic that local models will keep being a cheap alternative (never mind the geopolitics). And yeah, there are vey cheap ways to run inference, but hey become pointless - I can run Qwen and Phi4 locally on an ARM chip like the RK3588, but it is still dog slow.
Whenever I get to a section that was clearly autogenerated by an LLM I lose interest in the entire article. Suddenly the entire thing is suspect and I feel like I’m wasting my time, since I’m lo lingering encountering the mind of another person, just interacting with a system.
My power consumption is below 500 Watt at the wall, when using LLLMs,since I did some optimizations:
* Worked on power optimizations and after many weeks of benchmarking, the sweet spot on the RTX3060 12GB cards is a 105 Watt limit
* Created Patches for Ollama ( https://github.com/ollama/ollama/pull/10678) to group models to exactly memory allocation instead of spreading over all available GPUs (This also reduces the VRAM overhead)
* ensured that ASPM is used on all relevant PCI components (Powertop is your friend)
It's not all shiny:
* I still use PCIe3 X1 for most of the cards, which limits their capability, but all I found so far (PCIe Gen4 x4 extender and bifurcation/special PCIE routers) are just too expensive to be used on such low powered cards
* Due to the slow PCIe bandwidth, the performance drops significantly
* Max VRAM per GPU is king. If you split up a model over several cards, the RAM allocation overhead is huge! (See Examples in my ollama patch about). I would rather use 3x 48GB instead of 7x 12G.
* Some RTX 3060 12GB Cards do idle at 11-15 Watt, which is unacceptable. Good BIOSes like the one from Gigabyte (Windforce xxx) do idle at 3 Watt, which is a huge difference when you use 7 or more cards. These BIOSes can be patched, but this can be risky
All in all, this server idles at 90-100Watt currently, which is perfect as a central service for my tinkerings and my family usage.
I thought prevailing wisdom was that a used 3090 with it's larger vram was the best budget gpu choice?
And in general, if on a budget then why not buy used and not new? And more so as the author himself talks about the resale value for when he sells it on.
> I thought prevailing wisdom was that a used 3090 with it's larger vram was the best budget gpu choice?
The trick is memory bandwidth - not just the amount of VRAM - is important for LLM inference. For example, the B50 specs list a memory bandwidth of 224 GB/s [1], whereas the Nvidia RTX 3090 has over 900GB/s [2]. The 4070's bandwidth is "just" 500GB/s [3].
More VRAM helps run larger models but with lower bandwidth tokens could be generating so slowly it's not really practical for day-to-day use or experimenting.
I dunno everyone, but I think Intel has something big on their hands with their announced workstation gpus. The b50 is a low profile card that doesn’t have a powersupply hookup because it only uses something like 60 watts, and comes with 16gb vram at a msrp of 300 dollars.
I imagine companies will have first dibs via the likes of agreements with suppliers like CDW, etc, but if Intel had enough of these battlemage dies accumulated, it could also drastically change the local ai enthusiast/hobbyist landscape; for starters this could drive down the price of workstation cards that are ideal for inference, at the very least. I’m cautiously excited.
On the AMD front (really, a sort of open compute front), Vulkan Kompute is picking up steam and it would be really cool to have a standard that mostly(?) ships with Linux, and older ports available for Freebsd, so that we can actually run free as in freedom inference locally.
In January 2024 there was a similar post ( https://news.ycombinator.com/item?id=38985152 ) wherein the author selected dual NVidia 4060 Ti's for an at-home-LLM-with-voice-control -- because they were the cheapest cost per GB of well-supported VRAM at the time.
(They probably still are, or at least pretty close to it.)
That informed my decision shortly after, when I built something similar - that video card model was widely panned by gamers (or more accurately, gamer 'influencers'), but it was an excellent choice if you wanted 16GB of VRAM with relatively low power draw (150W peak).
TFA doesn't say where they are, or what currency they're using (which implies the hubris of a North American) - at which point that pricing for a second hand, smaller-capacity, higher-power-drawing 4070 just seems weird.
Appreciate the 'on a budget' aspect, it just seems like an objectively worse path, as upgrades are going to require replacement, rather than augment.
As per other comments here, 32 / 12 is going to be really limiting. Yes - lower parameter / smaller-quant models are becoming more capable, but at the same time we're seeing increasing interest in larger context for these at home use cases, and that chews up memory real fast.
Love the attention to detail, I can tell this was a lot of work to put together and I hope it helps people new to PC building.
I will note though, 12GB of VRAM and 32GB of system RAM is a ceiling you’re going to hit pretty quickly if you’re into messing with LLMs. There’s basically no way to do a better job at the budget you’re working with though.
One thing I hear about a lot is people using things like RunPod to briefly get access to powerful GPUs/servers when they need one. If you spend $2/hr you can get access to an H100. If you have a budget of $1300 that could get you about 600 hours of compute time, which (unless you’re doing training runs) should last you several months.
In several months time the specs required to run good models will be different again in ways that are hard to predict, so this approach can help save on the heartbreak of buying an RTX 5090 only to find that even that doesn’t help much with LLM inference and we’re all gonna need the cheaper-but-more-VRAM Intel Arc B60s.
The RTX market is particularly irritating right now, even second-hard 4090s are still going for MSRP if you can find them at all.
Most of the recommendations for this budget AI system are on point - the only thing I'd recommend is more RAM. 32GB is not a lot - particularly if you start to load larger models through formats such as GGUF and want to take advantage of system ram to split the layers at the cost of inference speed. I'd recommend at least 2 x 32GB or even 4 x 32GB if you can swing it budget-wise.
Author mentioned using Claude for recommendations, but another great resource for building machines is PC Part Picker. They'll even show warnings if you try pairing incompatible parts or try to use a PSU that won't supply the minimum recommended power.
https://pcpartpicker.com
If the author is reading this I'll point out that the cuda toolkit you find in the repositories is generally older. You can find the latest versions straight from Nvidia: https://developer.nvidia.com/cuda-downloads?target_os=Linux&...
The caveat is that sometimes a library might be expecting an older version of cuda.
The vram on the GPU does make a difference, so it would at some point be worth looking at another GPU or increasing your system ram if you start running into limits.
However I wouldn't worry too much right away, it's more important to get started and get an understanding of how these local LLMs operate and take advantage of the optimisations that the community is making to make it more accessible. Not everyone has a 5090, and if LLMs remain in the realms of high end hardware, it's not worth the time.
The trouble with these things is that “on a budget” doesn’t deliver much when most interesting and truly useful models are creeping beyond the 16GB VRAM limit and/or require a lot of wattage. Even a Mac mini with enough RAM is starting to look like an expensive proposition, and the AMD Stryx Halo APUs (the SKUs that matter, like the Framework Desktop at 128GB) are around $2K.
As someone who built a period-equivalent rig (with a 12GB 3060 and 128GB RAM) a few years ago, I am not overly optimistic that local models will keep being a cheap alternative (never mind the geopolitics). And yeah, there are vey cheap ways to run inference, but hey become pointless - I can run Qwen and Phi4 locally on an ARM chip like the RK3588, but it is still dog slow.
Whenever I get to a section that was clearly autogenerated by an LLM I lose interest in the entire article. Suddenly the entire thing is suspect and I feel like I’m wasting my time, since I’m lo lingering encountering the mind of another person, just interacting with a system.
I didn't see anything like that here. Yeah they used bullets.
I used a similar budget and build something like this:
7x RTX 3060 - 12 GB which results in 84GB Vram AMD Ryzen 5 - 5500GT with 32GB Ram
All in a 19-inch rack with a nice cooling solution and a beefy power supply.
My costs? 1300 Euro, but yeah, I sourced my parts on ebay / second hand.
(Added some 3d printed parts into the mix: https://www.printables.com/model/1142963-inter-tech-and-gene... https://www.printables.com/model/1142973-120mm-5mm-rised-noc... https://www.printables.com/model/1142962-cable-management-fu... if you think about building something similar)
My power consumption is below 500 Watt at the wall, when using LLLMs,since I did some optimizations:
* Worked on power optimizations and after many weeks of benchmarking, the sweet spot on the RTX3060 12GB cards is a 105 Watt limit
* Created Patches for Ollama ( https://github.com/ollama/ollama/pull/10678) to group models to exactly memory allocation instead of spreading over all available GPUs (This also reduces the VRAM overhead)
* ensured that ASPM is used on all relevant PCI components (Powertop is your friend)
It's not all shiny:
* I still use PCIe3 X1 for most of the cards, which limits their capability, but all I found so far (PCIe Gen4 x4 extender and bifurcation/special PCIE routers) are just too expensive to be used on such low powered cards
* Due to the slow PCIe bandwidth, the performance drops significantly
* Max VRAM per GPU is king. If you split up a model over several cards, the RAM allocation overhead is huge! (See Examples in my ollama patch about). I would rather use 3x 48GB instead of 7x 12G.
* Some RTX 3060 12GB Cards do idle at 11-15 Watt, which is unacceptable. Good BIOSes like the one from Gigabyte (Windforce xxx) do idle at 3 Watt, which is a huge difference when you use 7 or more cards. These BIOSes can be patched, but this can be risky
All in all, this server idles at 90-100Watt currently, which is perfect as a central service for my tinkerings and my family usage.
I thought prevailing wisdom was that a used 3090 with it's larger vram was the best budget gpu choice?
And in general, if on a budget then why not buy used and not new? And more so as the author himself talks about the resale value for when he sells it on.
> I thought prevailing wisdom was that a used 3090 with it's larger vram was the best budget gpu choice?
The trick is memory bandwidth - not just the amount of VRAM - is important for LLM inference. For example, the B50 specs list a memory bandwidth of 224 GB/s [1], whereas the Nvidia RTX 3090 has over 900GB/s [2]. The 4070's bandwidth is "just" 500GB/s [3].
More VRAM helps run larger models but with lower bandwidth tokens could be generating so slowly it's not really practical for day-to-day use or experimenting.
[1]: https://www.intel.com/content/www/us/en/products/sku/242615/...
[2]: https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622
[3]: https://www.thefpsreview.com/gpu-family/nvidia-geforce-rtx-4...
I had a similar setup for a local LLM, 32GB was not enough. I recommend going for 64GB.
Reminds me of https://cr.yp.to/hardware/build-20090123.html
I'll be that guy™ that says if you're going to do any computing half-way reliably, only use ECC RAM. Silent bit flips suck.
I dunno everyone, but I think Intel has something big on their hands with their announced workstation gpus. The b50 is a low profile card that doesn’t have a powersupply hookup because it only uses something like 60 watts, and comes with 16gb vram at a msrp of 300 dollars.
I imagine companies will have first dibs via the likes of agreements with suppliers like CDW, etc, but if Intel had enough of these battlemage dies accumulated, it could also drastically change the local ai enthusiast/hobbyist landscape; for starters this could drive down the price of workstation cards that are ideal for inference, at the very least. I’m cautiously excited.
On the AMD front (really, a sort of open compute front), Vulkan Kompute is picking up steam and it would be really cool to have a standard that mostly(?) ships with Linux, and older ports available for Freebsd, so that we can actually run free as in freedom inference locally.
I've been dreaming on pcpartpicker.
I think Radeon RX 7900 XT - 20 GB has been the best bang for your buck. Enables full gpu 32B?
Looking at what other people have been doing lately, they arent doing this.
They are getting 64+ core cpus and 512GB of ram. Keeping it on cpu and enabling massive models. This setup lets you do deepseek 671B.
It makes me wonder, how much better is 671B vs 32B?
[dead]