deng 11 minutes ago

Nice post and technically impressive work. I agree we need to understand the build pipeline and be able to do things locally. However, depending on your electricity cost, it might not make sense financially. These old servers are not energy efficient at all (I'm guessing that old Xeon server will easily pull 200W on load), and that model is currently at 0.1$/0.3$ per 1M tokens (with 76 tps and 262k context) in Openrouter (also, these servers are LOUD).

  • jansommer 7 minutes ago

    It should be closer to 85W on load. And it's incredibly silent on even a low end cooler. I rarely get above 50° Celcius.

cafkafk 4 hours ago

Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.

I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.

I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.

I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.

  • fragmede 3 hours ago

    (purple on black is really hard to read)

    You say it runs "at reading speed". Have you benchmarked it?

    • cafkafk 3 hours ago

      > (purple on black is really hard to read)

      Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.

      > You say it runs "at reading speed". Have you benchmarked it?

      At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:

      llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128

      Gives:

        llama_print_timings:        load time =   83911.65 ms
        llama_print_timings:      sample time =      26.99 ms /   128 runs   (    0.21 ms per token,  4742.15 tokens per second)
        llama_print_timings: prompt eval time =     343.41 ms /     7 tokens (   49.06 ms per token,    20.38 tokens per second)
        llama_print_timings:        eval time =   10639.36 ms /   127 runs   (   83.77 ms per token,    11.94 tokens per second)
        llama_print_timings:       total time =   11114.98 ms /   134 tokens
      

      So 11.94 tokens per second while it's also playing binary cache and CI builder.

      When I do it properly, I'll add it to the blog as well!

      • anon-3988 1 hour ago

        I am pretty sure llamacpp have their own benchmarking binary that you can use.

      • ekianjo 54 minutes ago

        20 tokens per second for eval time is the killer here. It means you can't use this to process any meaningful amount of text.

        A GPU typically processes close to 1000 tokens/s during eval.

  • arpinum 1 hour ago

    How many watts is that setup? Cool you got it to work, but maybe only useful for vintage / retro computing rather than practical if the energy consumption makes it economically wasteful.

  • gdjdhdheb 54 minutes ago

    You sure you got DDR3 .. I have 2 e5 v4 rigs at home and both have ddr4 ... Unless I am wrong and 2011-3 supports ddr3 and ddr4

    • lightedman 28 minutes ago

      The first two generations supported DDR3 only. Haswell and Broadwell (v4) brought DDR4 support.

  • shevy-java 29 minutes ago

    Would you consider improving the website's layout? Right now I find it below average quality and very distracting. Whether you are an engineer or not is not really important; great engineers can write horrible text or use a layout that is not ideal, for instance.

  • Sweepi 20 minutes ago

    "-t 8 matches physical cores. The machine has 16 SMT threads but only 8 cores. On a memory-bound workload, oversubscribing threads adds scheduling cost without adding throughput: the cores are waiting on DDR3, not on each other."

    But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?

    I also dont understand the explanation of "--cpu-moe". If an expert has ~ 4.0 GiB of Parameters, why does optimizing the sequence of experts minimize cash trashing? With 20 MiB of L3 Cash vs 4.0 GiB of Parameters, it wont cash any noticeable amount of the Parameters, will it?

    As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.

phaser 1 hour ago

What intrigues me the most about AI progress, is not AGI or the model du jour by $AI_UNICORN, but rather what can be run locally. I remember having an amusing, but rather useless model in a beefy gaming PC that I had 6 years ago; and now, something that’s a hundred times better on my M5 laptop.

Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening.

Also I don’t know what this means for the valuation of the AI companies. I remember asking about this very idea to one of their employees at an event and instead of answering he bailed out to grab a cocktail.

  • skdb476 40 minutes ago

    Its a convenience thing. You can run a whole lot of stuff locally from wikipedia to social media/email/video servers whatever. Most people with a full time job and 2 kids dont do it cause who has time and energy to patch and maintain the ever growing complexity of this stuff. These systems will keep growing complex. That also means more bugs. Age old tradeoff between freedom and convenience.

  • fooker 37 minutes ago

    What you can run locally in consumer hardware is progressing pretty well.

    If you get a not-quite-the-best gaming GPU like a 5080, you can run local models that are better than the state of the art from early 2025. Depending on what you want to do, you might have to switch models. The one size fits all huge models are still a data center thing.

  • MAXPOOL 31 minutes ago

    Things you are not supposed to talk about:

    - There is no "moat" (lasting, easy-to-defend technological edge) in AI model businesses. There are just short-term advantages.

    - An AI business is a capital-intensive business, just like old factories. Data centers are expensive, models are energy-hungry, and the hardware inside must be replaced every 3–4 years.

    - Smaller, specialized models eat margins from below. Transcription, voice, or image detection do not need large models.

    There is no reason to expect high margins like you can in traditional software business. Benefits of AI go mostly to consumers.

  • rienbdj 23 minutes ago

    Training AI models to drive valuation reminds me of high frequency trading

throwaway2027 34 minutes ago

Glad to see other people realizing this. I've been running Gemma 26B-A4B Q4 on a 2012 Xeon with 16GB to 24GB of RAM in a container. It's getting around 8 to 12 tokens per second. Obviously it's not comparable to huge contexts and running it on a GPU and the image decoder in llama.cpp is super slow compared to a GPU but for some small automation tasks and general trivia questions it's decent. The speed is just enough to not have to wait for it to finish so you can read along.

Here's my setup. You may want to figure out what the best optimizations are for your specific CPU like AVX2 because mine didn't have most of them. I did try MTP briefly but I wasn't getting performance improvements. You could play around with the batch sizes for cache or context or go even lower for Q2 and don't overcommit on threads either, but I would suggest either defaults or trying out llama-bench. This isn't by any means the best I assume but it worked decently for me and I sometimes swap out Gemma for Qwen. You could also lower q8_0 to q4_0 for more context but it could hurt quality some say, altough I have noticed it too on some models.

# Building

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON

# Running

export OPENBLAS_NUM_THREADS=4

export OMP_NUM_THREADS=4

OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \

llama.cpp/build/bin/llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.00 --jinja --host 0.0.0.0 --port 8001 --cache-type-k q8_0 --cache-type-v q8_0 --threads 4 --threads-batch 4 --ctx-size 8192 -n 8192 --batch-size 2048 --ubatch-size 512 --no-mmap --mlock --chat-template-kwargs '{"enable_thinking":false}' --no-mmproj -np 1 -fa 1

jansommer 16 minutes ago

The E5-2620 v4 is great. Have been using it for 10 years now. Wanted to upgrade until I saw current prices. I have 64 GB ddr4. Paired it with rx 9060 xt 16 GB and games run as fast as ever. Perhaps the cpu is a slight bottleneck in DOOM The Dark Ages, but i'm at 60 fps, so no problem. Light llm on the gpu is a nobrainer, and it's cool to see that things can be tuned to run ok on the cpu. I bought 2667 v4 a month ago for 30$. I'd expect it to give a decent performance boost but I just haven't had the need for it yet, but pushing into llm like in the article I'd probably upgrade because 2667 can handle slightly faster ram.

vhaudiquet 1 hour ago

The E5 2620-v4 only supports DDR4.

cykros 59 minutes ago

Does this mean my 15 year old Phenom is too old? But it has 16 gb of DDR3 RAM!

Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.

potus_kushner 3 hours ago

@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?

  • cafkafk 3 hours ago

    Honestly, at this point you're probably looking at a smaller model, for the Gemma series I'd go with Gemma 4 E4B with drafters, but that's just a hunch from using it on my laptop (where I do have a RTX 4060 M and 96gb ram).

    So you'd change the invocation slightly here, but a lot of things you can potentially reuse.

    That said, the Gemma 4 E4B models have so far in my experience been... not great when it comes to long context, but they are very passable for basic tasks, and even seem surprisingly okay at tool calls.

    • potus_kushner 2 hours ago

      i tried the Q4_K_M model form unsloth with your Q4_K_M drafter, but the required memory to load everything is 72GB. odd. otoh i could load Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled.IQ4_XS.gguf and it requires just ~18 GB:

      ~/ik_llama.cpp[main]$ build/bin/llama-cli --model ~/models/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled.IQ4_XS.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune -cnv --color --jinja --special -smgs -sas -mea 256 --temp 0.7 -t 6 --parallel 6 --cpu-moe --merge-up-gate-experts --flash-attn on --mla-use 3 --mlock --run-time-repack --no-kv-offload . works pretty fast, at about 15 t/s:

      llama_print_timings: sample time = 45.28 ms / 404 runs ( 0.11 ms per token, 8921.67 tokens per second) llama_print_timings: prompt eval time = 949.42 ms / 51 tokens ( 18.62 ms per token, 53.72 tokens per second) llama_print_timings: eval time = 24067.08 ms / 400 runs ( 60.17 ms per token, 16.62 tokens per second) llama_print_timings: total time = 242192.55 ms / 451 tokens

      so i wonder why the params used by the quantified qwen model use way less memory than the ones of gemma.

    • sleepyeldrazi 52 minutes ago

      Have you tested Qwen3.6 35B? Putting aside the capability claims for that model (which I support, but are not my point here), that 35B has smaller active parameter count than the gemma 4 26B, potentially making both prefill and decode faster out of the box, and has MTP heads built in the model and well supported (you may need to make sure you download a quant that didn't strip them off, as some do to preserve space). I would be curious to see your numbers there too. And if you do test this, please go for a clean one and not a fine-tuned one.

NSUserDefaults 1 hour ago

How about the iMac Pro? Would that work? I was able to put 128gb in it (not as easy as the regular iMac but possible).

  • wazoox 1 hour ago

    I've been running various models on a Mac Pro 2013 (8 cores, 32 GB RAM) at about 8 to 10 t/s for months. It's not fast, but it's more than enough for many actual tasks, in particular background tasks. An iMac pro will do just as well I suppose.

    • fooker 36 minutes ago

      What are the tasks that do well with 8-10 t/s ?

asimovDev 2 hours ago

I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?

  • tgtweak 2 hours ago

    It may work - depending on your ram speeds it might not even be that much slower.

  • cafkafk 2 hours ago

    Loading will take some minutes, but at 96 you can squeeze the model in and have some headroom around like ~10 GB, although depending on the Xeon, you may have to downgrade to E4B instead. Should still work thou.

  • qwertox 1 hour ago

    CPU (2012)

      Model name: Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz
    

    Mainboard

      Product Name: P8Z77 WS
    

    GPU

      05:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1)
      05:00.1 Audio device: NVIDIA Corporation AD106M High Definition Audio Controller (rev a1)
    

    Memory: 32GB

    This works.

  • burnt-resistor 1 hour ago

    I run Win 11 Enterprise on an el cheapo spare parts Xeon E3-1275 V2 + 32 GiB DDR3-2133 + Gigabyte GA-B75M-D3H rev. 1.2 (TPM support)

egorfine 43 minutes ago

This and the previous one are insanely good articles. Thank you!

gigatexal 1 hour ago

What kind of tokens per second did the op get I saw nothing of this written.

  • urbandw311er 1 hour ago

    11.94 tokens/sec (from another answer above)

shevy-java 30 minutes ago

The webpage's layout is just horrible. Scrolling is also non-default - and thus rather annoying; I had to stop after two scroll events. Why do people think they need so much fancy effects or non-standard behaviour, if their alleged goal is to get information across to other people?

Eonexus 3 hours ago

I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?

  • cafkafk 3 hours ago

    That is a very fair point! I just ran a not very scientific benchmark with the system under load, and posted the raw logs in a sibling comment above, but the short answer is that it's hitting 11.94 tokens per second for generation - while it's also being a binary cache and CI build server.

    Totally just vibes based, I think it goes up to 20+ tps when it's not under load (and that's me trying to be conservative). For context, reading speed at 250 wpm would be around 5 to 6 tokens per second.

    • Eonexus 3 hours ago

      Huh, that's actually not bad at all! Sure, it's not at the speed of a GPU, but still, 20 tps is cromulent for a CPU.

hparadiz 1 hour ago

I'm now staring at a 10 year old 4U with 256 GB of DDR4 and thinking hmmmmm

christkv 2 hours ago

Makes you wonder if its possible to squeeze more tps out of a strix halo system using the 16 zen5 cores as well as the gpu.

  • cafkafk 2 hours ago

    If you get the inference engine to route the heavy matrix math to the GPU and the speculative drafting to the CPU without choking on latency it's probably gonna be very fast.

    Would love to see the benchmarks if someone actually pulls something like that off.

  • Havoc 1 hour ago

    In general you’re mem bandwidth constrained so cpu vs gpu often ends up similar on APUs

    • fulafel 1 hour ago

      There are ways to trade off compute power for memory bandwidth (like MTP and other speculative decoding approaches). The CPU and GPU would need to be able to share the same cache for this to work. In the Strix Halo case the GPU has a private cache on the GPU die I think, which is the snag.

SXX 59 minutes ago

Now we need someone try run Kimi K2.6 on old Xeon and DDR3. After all these platforms do support up to 768GB RAM.

nurettin 1 hour ago

I also run a Qwen 3.6 moe A4B on old hardware. I set it up with

numactl --membind=1

so it is constrained to one of the memory sticks which speeds up token generation a little.

hypfer 1 hour ago

> The argument for speculative decoding is stronger on CPU than on GPU.

Uh. Uuuh.

No?

___

Also

> While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip.

What purpose does the quoting of "caches" serve there? Is this AI writing written by that model running on that host?

bflesch 1 hour ago

Might consider going for even older CPUs which don't have the Intel ME ring -3 thing which is full of backdoors

  • bflesch 13 minutes ago

    I appreciate the downvotes without any reasoning. It's a fact that newer Intel CPUs have Intel ME which was not in older CPUs and significantly increases attack surface if you are not living in a five eyes state.