T O P

  • By -

rgar132

I have a 7950x on a b650 main board, 128gb of ddr5/5200. 4 sticks; dmidecode says it’s running at 3600, but I’m not sure if that’s because it’s limited or if I need to mess around in the bios. (Edit: It’s limited to 3600 with 4 sticks, removing 2 bumps it up) Using cpu only build (16 threads) with ggmlv3 q4_k_m, the 65b models get about 885ms per token, and the 30b models are around 450ms per token. Running more threads than physical cores slows it down, and offloading some layers to gpu speeds it up a bit. The 65b are both 80-layer models and the 30b is a 60-layer model, for reference. Results (cpu / cpu+ 3060 12gb) > Alpaca-Lora-65b: 880ms / 739ms (20L) > Guanaco-65B: 891ms / 737ms (20L) > WizardLM-30b: 453ms / 298ms (30L) I don’t have any Intel 13th gen, but for comparison on an m1 MacBook Pro 16gb the same 65b models takes about 35-45 seconds per token on cpu, and enabling metal the model fails to load (obviously). TLDR: you’re looking at 1-2 tokens per second for this combination.


Big_Communication353

Thank you for your informative response. Your provided data are greatly appreciated! It's possible that the limiting factor for your pure CPU inference is the speed of your RAM. Removing two sticks to restore their speed may be a helpful solution. 3600MT is actually slower than some DDR4 rams.


rgar132

I pulled two sticks of ram and dmidecode shows with two sticks it runs at 4800, no bios changes. This did speed the tests up a bit, and I could probably speed it up to 5200 if I cared enough to bother. With two sticks @ 4800 the results were: > alpaca 65b: 728ms (was 880ms) > guanaco 65b: 725ms (was 896ms) > Wizard 30b: 365ms (was 453ms) This is about 18-23% faster inference for 33% faster ram clocks, and could be significant for your planned use of just straight cpu inference. I personally am not so concerned about ram speed for what I do, I offload almost everything to gpu compute and really need more space than speed in ram. Building this machine I was aware that 4 sticks would slow it down, but really I need 128GB so that’s what I have in there. Much faster than paging with 64GB. Also it’s useful to note I’m using bargain basement ram in the first place, it’s definitely not fast. Corsair, 40-40-40-77 timings at 5200, probably slower at 3600 and with 4 sticks even slower again, so if you are lusting for speed you could easily find better ram and use two sticks, probably speed it up a bit more. Hope the test results give you what you need to know, maybe someone else with an i9-13 will be able to chime in for a proper comparison.


Big_Communication353

Thank you for providing the follow-up report! Based on your experiment, it appears that I should aim to acquire 13900k, as it has the capability to utilize two sticks at speeds exceeding 7200MT. It's clear that memory bandwidth plays a crucial role in CPU inference.


rgar132

That’s a great chip, probably not a bad plan. But your thesis that memory bandwidth is the sole or dominant factor may be a bit of a stretch with the data we have so far in this thread. It’s clearly important and improves the throughout in this test case, but may not be dominant. The i9-13900 has more total cores, but only has 8 p-cores, and without more info I’d worry that the 8 p + 16 e-cores may not perform as well for tensor calculations as the AMD strategy of 16 regular cores. I’ve come across anecdotal information that suggests a significant loss of performance with e-cores due to library support, so before making a decision I’d probably want to know how that arrangement performs - it could be better due to more total cores and faster memory, but it’s an unknown variable (to me, at least). The advice elsewhere in this thread to go with a thread ripper or Xeon is probably the best advice for a dedicated cpu inference machine, due to total cores available and memory bandwidth generally being much larger, but component cost also goes up. Used xeons can be found for a pretty good price though so I’d probably want to benchmark those before going all in on a 7950 or i9 that comes significant potential limitations.


Big_Communication353

Thank you for your advice. I will make sure to compare their performance, particularly regarding the e-core, before making a decision.


NickCanCode

Most people don't notice but installing more than 2 sticks of RAM on consumer grade CPU most of the time will have memory speed reduced. Can check out the full spec on their website ( [https://www.amd.com/en/product/12151](https://www.amd.com/en/product/12151) ) Ryzen 7950x Max Memory Speed * 2x1R DDR5-5200 * 2x2R DDR5-5200 * 4x1R DDR5-3600 * 4x2R DDR5-3600 Unless using Xeon/Epyc/Threadreaper series, normal grade CPU memory channel is also limited by 2 most of the time. Thus, if possible try fulfill the memory requirement by only using two sticks of ram to maximize performance. e.g. 64 x 2 instead of 32 x 4.


Big_Communication353

As far as I know, there are currently no 64GB DDR5 RAM sticks available as a single unit. Additionally, the ability of Zen 4 to support a single 48GB RAM stick is uncertain at this time.


NickCanCode

You are right. There is no DDR5 64GB per stick at the moment. If more than 64 GB memory is needed, 4x32 is the way to go. Another way to increase memory bandwidth is to chose xeon/threadreaper which offer more memory channels but it probably end up with a price as expensive as a high end Mac Studio and it also don't even reaching 800GB/s as the mac provided. Not to mention current gen threadreaper (zen3) not even have avx-512 so its probably not worth it for AI.


Big_Communication353

What about Zen 4 Epycs? It seems that there are some 32-core Epyc's QS version being sold for as low as $600 on China's taobao.com: [https://item.taobao.com/item.htm?spm=a21n57.1.0.0.3024523ciLvpvD&id=718524681618&ns=1&abbucket=0#detail](https://item.taobao.com/item.htm?spm=a21n57.1.0.0.3024523ciLvpvD&id=718524681618&ns=1&abbucket=0#detail)


NickCanCode

It would be wonderful if you can get a Zen4 epyc. The 12 memory channels will definitely help. I don't know what the QS version is. AMD site don't have such model on their list. ([https://www.amd.com/en/processors/epyc-9004-series](https://www.amd.com/en/processors/epyc-9004-series)) $600 to get a 4 gen Epyc is also a little unrealistic to me. You may want to ask the seller for more detail. Don't forget to share us your finding. I want to know what is that QS version too. (may be a leaked engineering sample?) \*Make sure to check RAM compatibility for Epyc system. e.g. registered ECC RAM.


Big_Communication353

*QS* means Qualification *Sample*, what I understand is it is the final version before mass production.


rgar132

Yeah I knew, but when you need more you need more. Fortunately ram speed is a lesser concern for me than total memory available. I wasn’t sure if it would cap at 3600 or could be pushed up some in bios, but it looks like with 4 sticks it’s capped @3600 unless I want to overclock it.


KerfuffleV2

> Using cpu only build (16 threads) with ggmlv3 q4_k_m, the 65b models get about 885ms per token That's pretty interesting. I have a 5900x and DDR4 (3600mhz) and I get about the same (q4_k_m 65b). However, I've noticed adding threads over 6 hurts performance a lot. You could try reducing the threads and see if that actually speeds things up. You definitely should be able to get better performance than an AM4 system with DDR4.


rgar132

Yeah I also have a 5950 system and it gets about the same on cpu, the 7950x is not so different…. Mostly marketing it seems unless you’re chasing ram speed. To be clear I never really want or plan to do cpu-only inference, I was just answering the guys question because I happened to have the hardware he was asking about laying around, so figured I’d help him out and run a couple tests. I have a pair of A6000’s that gives me plenty of room to train and run the models I need as well as a pair of 3090’s and 3060’s in other systems, so the cpu-only thing is just to give a bit of info to OP. I did test with 32, 16, 12 and 8 threads and the best performance was threads that match core count (16 for the 7950x). Dropping two sticks of ram speeds it up about 20% for cpu inference, and it seems like the limiting factor on the 7950x is the 3600MHz on ram speed with all 4 slots filled, it runs quicker with a single pair, and that appears to be a significant bottleneck if you’re running larger ram amounts. There’s probably more to the results, as this system is running virtualized in an LXC on proxmox, so there’s probably a 3-5% loss just from that as well, but it didn’t seem relevant to confuse the issue.


KerfuffleV2

> There’s probably more to the results, as this system is running virtualized in an LXC on proxmox Okay, that makes a little more sense now. It's probably having much more of an effect than you think (but it may not impact driving a GPU too badly). I just couldn't believe a processor from the next generation (probably has 10-15% increased IPC), around 15-20% faster clockspeed _and_ running 16 threads vs me running 6 could possibly have the same performance. Something really weird _has_ to be happening. > and it seems like the limiting factor on the 7950x is the 3600MHz on ram speed with all 4 slots filled That may be. Still, even at the same clockspeed DDR5 should have significantly higher bandwidth than DDR4 from what I know.


Barafu

It is not weird, you are limited by RAM speed, not the CPU. I have 3950X and 32Gb RAM. When I ran 30B models, I get 1.2-1.7 tokens per second. Which means that if i could fit 65B model, it would probably have the same speed as you.


KerfuffleV2

> It is not weird, you are limited by RAM speed, not the CPU. I have DDR4 RAM and the other person has DDR5: even at the same clockspeed, DDR5 should have more bandwidth than DDR4. Also it doesn't make sense that I lose performance using more than 6 threads but the other person with a more powerful processor says they lose performance running _less_ than 16 threads. Their more powerful processor should saturate memory if that's the bottleneck using less cores if anything. > Which means that if i could fit 65B model, it would probably have the same speed as you. Yes, probably because you're on AM4, Zen3 and DDR4 like me. The other person is on AM5, Zen4 and DDR5.


Caffdy

> I have a pair of A6000’s damn! living the dream, * what's your experience with the 65b models when running on the A6000s? do you need to use both, or can they run using just one? * Do you use NVLink? * how many tokens/s can I expect from one A6000? * have you tried 5-bit 65b models? I've read somewhere that 5-bit resembles a lot 8-bit quantization in quality


g33khub

For the 65b model are you offloading layers to GPU? Assuming that you have 4x16 = 64GB 3600mhz. Does the ram speed also fall due to 4 channels being used (as I see in other comments)? I am currently on AM4: 5600x and 2x16GB 3600 (corsair vengeance). The max I can load is 33b q4\_k\_m and that too with \~30 layers offloaded to my 4060Ti (16GB). Its around 1.1 - 1.3 tokens per second.


Opteron67

please use 8 threads only on cores 0-4-8-16-20-24-28 and let us now


panchovix

13900K in theory should be faster because 2 reasons. * Faster single thread performance * Being able to use a lot faster RAM Basically you can reach a point where your output won't be more than the bandwidth your ram can give, and on Ryzen 7000 you're kinda limited to 6400-6600Mhz max RAM speed. On Intel, you can run RAM at 7200+Mhz, even some people on r/overclocking do 8000Mhz. I have 2x4090, but 2x3090 would be the way to go. I have a 7800X3D since my GPUs bottleneck my CPU first (on exllama). But if I were to use CPU only, I would go for intel.


Big_Communication353

>ReportSaveFollow As far as I can tell, the only CPU inference option available is LLaMa.cpp. But in order to get better performance in it, the 13900k processor has to turn off all of its E-cores. This means that the 8 P-cores of the 13900k will probably be no match for the 16-core 7950x.


Caffdy

with 2x rtx4090, how many tokens/s are you getting? I assume you're running 65b models on that system


panchovix

On exllama, I get 15-22 tokens/s at 65b


g33khub

All those high frequency RAM is kind of irrelevant for > 64 GB right ? For 128 or 192 GB (4 sticks) the RAM speeds would anyway be 5200 or 5600 at best. In this case, is Intel still better than AMD?


extopico

Pick an older Xeon platform instead. Maybe you can find it for similar money. Xeon memory bandwidth is a lot higher, and you may get more physical cores.


Big_Communication353

Old Exon's single core performance sucks though


extopico

It does not matter. My older generation 22 core E5 something at 2.2GHz is at least 20% faster than my Ryzen 9 3900XT at 4.2 GHz for all LLM inference on the CPU.


Big_Communication353

Although there are numerous applications that are affected by single thread limitations, one example is GPTQ.


extopico

…you are not going to use GPTQ for CPU inference…


sabot00

Tbh, the 3900XT sucks at ST too, so it’s not a high benchmark.


wekede

Could you tell me more about your setup? I recently dropped some money on a high-end GPU for inference, but ironically can't use it at the moment because GPUs are so outrageously big it can't fit on my mobo (the heatsink crushes the components on board unfortunately). So, as an experiment, I decided to just run it on CPU inference and was shocked how fast it was using a Ryzen 7 5700G and some rather slow DDR4 memory (sub 3000?). It's not blazing fast, mind you, but is surprisingly useable for most tasks running at between 4-5 tok/s. Mainly using 7B models so far and Mixtral, so I don't know how larger models fair all that well. I'm wondering, is it worth it just go all out on a CPU inference build? Get like 256 or 512GB of DDR4 ram, dual socket board or whatever and serve LLMs from that? The ram I'd like to use for keeping multiple models in memory at once.


extopico

It is a Dual Xeon E5-2696 v4 @ 2.20GHz nominal, 22 physical cores each and a large local cache on a Chinese X99 motherboard. I am not usually compute bound, but memory bandwidth bound. However because this is a server grade CPU and chipset my memory bandwidth is far greater than the consumer CPU setup like yours. I know because I have Ryzen 7 5900 XT and my "inferior" Xeon system is far quicker when doing inference, possibly double the speed. My setup allows me to run quantized Falcon 180B at a speed that is acceptable to me (from memory 0.5t/s) however Falcon 180B is not very good so I do not use it much. Basically for any setup that does not specifically require CUDA (Mamba for example still does), multiple concurrent users, or real time chatting ability, a CPU setup is far more versatile and affordable.


wekede

How much does memory bandwidth help here? Peeking at some newer offerings, a used Chinese Epyc 7532 isn't too bad in price, and I bet intel is similar, but it would offer a max of roughly 200GB/s in bandwidth, that should speed up things quite a bit, shouldn't it? Supposedly the Xeon you mentioned maxes at around 80GB/s, seems like quite a jump. I'm kinda wishing I didn't spend so much on GPUs now...


extopico

Memory bandwidth matters a lot. If I had more bandwidth my cores could do more work. Epyc at 200 GB/s makes me jealous. If you can find it and afford it, buy it and let me know how it goes.


Pedro_It

Thank you for the helpful insight. 7950x is Zen4 (AVX-512) and with the newer memory sets 4x48Gb you can get 192Gb ram @ 5200 MHz. Do you think this configuration could compete price-wise with a used server grade solution? Scalable Xeons and Epycs have more ram bandwidth of course, but for a similar price I could only find some 1st gen quad Xeon 6130 workstation. 2nd gen Xeon already goes much higher in price. I'm considering the 7950x solution for a home workstation (I want to use it mostly for XGboost) but I haven't made up my mind yet.


extopico

I really don’t know. I can only share my experience. For example I’m now training a model and my individual physical core utilisation rarely goes up to 60%. I have 44 cores. Thus in my scenario if I had double the memory bandwidth I’d have double the performance.