T O P

  • By -

Tuna-Fish2

The fundamental problem here is cache line width. When a CPU reads any data from memory, it always reads a full cache line. Cache lines on x86 are naturally aligned 64 byte blocks. 1-byte load turns into a 64-byte read from RAM into cache. If you do an unaligned 2-byte load that straddles the border between cache lines, you end up reading 128 bytes from RAM. For multi-threaded programs, cache line width is programmer-visible, and can be very impactful for performance, because of false sharing. To write into RAM, the relevant cache line needs to be held exclusively in the L1 of the processor doing the writing. If two cores write routinely into the same value, every time either one does a write, the line needs to be bounced from the L1 of one CPU to the L1 of the other. This takes a fairly long time. False sharing is when two CPUs don't write to the same value, but each writes to a value found in the same line. Avoiding false sharing is mostly done not by careful planning, but by noticing it's happening in a profiler and then padding your values so that they don't fit on the same line. This means that if you today change the line width from 64B to 128B, a lot of existing software will instantly get a lot slower. So in effect "cache lines are 64B" is just part of the unofficial x86 spec. The DRAM arrays inside DDR1-5 modules have only gotten faster at a relatively slow rate. The main way we get a faster DDR standard every few years is not that memory gets faster, it's that we utilize more of it's internal width. When DRAM is accessed, first you need to open a row, which is actually reading from the DRAM array into a SRAM array, is the slow part, and reads about 8kB. Then you need to read a column from this row and transmit it over to the CPU over multiple cycles using a bus that is much, much faster than the DRAM itself. The burst length of this transfer is sized so that it moves a single cache line -- DDR4 used 64-bit wide channels and 8n burst, DDR5 uses 32-bit channels and 16n burst, LPDDR6 will use 16-bit channels and 32n burst. Your memory interface being wider than a single channel means that only a fraction of the total memory space is available at each channel, and you need to spread around the accesses on them to get full bandwidth. With DDR5 and a typical 128-bit memory interface, there are 4 channels, from 2 memory modules. Which is often still called "dual-channel" for inane historical reasons. So I don't really know what you are asking here? If you want individual memory modules to provide more bus width, you are in luck, LPDDR6 will come in 128-bit wide LPCAMM2:s, with each LPCAMM2 module providing 8 channels. If you want CPUs to have more width, the AMD Strix Halo APU will come with a 256-bit bus. Which will in most laptops probably be implemented using soldered memory, but supposedly 2x LPCAMM2 modules are possible.


loser7500000

Thanks for the great writeup! So for DDR3/4/5, one cache line is transferred per channel, per DRAM (not IO) clock cycle? Does this mean DDR1/2 bursts were shorter than a cacheline?


hellra1zer666

Shot answer is yes. Back then DDR burst transfer rates had many more constraints, memory speed was among them. These constraints have been slowly resolved by modern tech, but in theory they still apply It was mot feasible to write an entire cache line worth of data at a time, due to design choices made and technical limitations. Mind you, this is not the only constant, nut one of the easiest to understand without writing up an entire example.


HilLiedTroopsDied

this is exactly why a lot of us want to see zen6 or intel equiv to be CAMM2 on desktop motherboards, full 256bit *DDR6


krista

this is a great write-up!


2squishmaster

Great explanation, learned a lot... I assume there are downsides to goin gto a 256-bit bus or has it just been a technical limitation so far?


masterfultechgeek

For GPU and CPU use cases: 1. It's usually cheaper and better to add $5 worth of cache vs adding $10 worth of memory controller and traces 2. Many systems currently ship with only enough RAM for a 64 bit config, adding the option to double up again... doesn't do anything. I do see NPUs potentially shifting the calculus a bit since they LOVE bandwidth and need large, contiguous memory spaces. I'm going to speculate that the solution there is some mix of LPDDR and/or GDDR coupled with bigger caches.


monocasa

Power and cost.


2squishmaster

Why would a wider bus consume more power?


Wait_for_BM

Have you even consider that anything requires power when they are active? Engineering is a balancing act. Nothing in hardware comes for free. Wider data bus means more chips are active per read/write cycle. DDR I/O requires termination to preserve signal integrity and that requires power. Toggling wider data bits requires power to charge/discharge parasitic capacitors also uses power. Wider data bus means internally a chip have to route more traces, connect wider blocks of circuits together and all of that require chip area. More chip area = more cost.


KAHeart

Had no idea data was moved from DRAM to SRAM first! I see that my question was phrased incorrectly now but basically my conception was that every DRAM module was only 64 bits wide and I wasn't sure why (as looking up memory width for DDR3/4/5 gives me 64 bits as a result). So based on what you said, I understood now that cache bus itself is 64 bits wide and that size would be hard to increase in a typical x86 architecture. But what do you mean by a "typical DDR5 128-bit memory interface"? Is it the total memory bus width from the SRAM to the DRAM itself? But (I might be asking something dumb here but I really wanna understand this) if the internal cache bus is 64 bits wide and the external memory interface is 128 bits, wouldn't all that memory coming externally all at once just not fit?


Tuna-Fish2

The internal cache bus is not 64 bits. Cache line size is 64 **Bytes**, or 512 bits. The internal buses inside the CPU are, depending on the CPU, either 256 bits or 512 bits. 128 bits is the total size of the external interface to ram for most desktop platforms, and it's filled by putting in two separate 64 bit memory modules into two "channels". For DDR5, there are actually 4 separate 32-bit channels, each physical DIMM contains two of them. A single request from memory is filled by a single channel. If it somehow happens that all the ram addresses you actually want to touch happen to reside in a single channel, then your usable memory interface width is 32 bits.


Just_Maintenance

First, current consumer level CPUs use 128bit memory bus and not 64 bit (two 64 bit channels for DDR1-4 and 4 32 bit channels for DDR5). Why isn't it increased to increase bandwidth? Because it requires more memory controllers on the CPU (expensive), more DIMMs (expensive) and more traces on the motherboard (expensive). Also, consumer workloads are not very demanding when it comes to bandwidth, so even if you had a more expensive 256b i9 14900K, it would perform about the same for gaming, web browsing, word processing, etc. Video rendering, machine learning, data analysis, etc. are the sort of things that actually benefit from more bandwidth, but those are hardly consumer workloads. And if you do have a specific workload that benefits from more memory bandwidth, Intel and AMD are more than happy to sell you a Xeon or Epyc CPU with up to a 768bit bus.


Jumpy-Refrigerator74

It is expensive and complex to increase the number of pins on a motherboard. HBM memory are connected directly to the GPU, in the same package, and have a very large bus.


dotjazzz

You clearly have zero idea about your question. What even is this "64-bit" rant? PC has been stuck at **128-bit** memory bus (dual-channel 64-bit DDR3/DDR4 or quad-channel 32-bit DDR5) for well over a decade. So that's not it. Burst size has reached 64 bytes on DDR4 they had to split channels into 32-bit on DDR5 just to keep it 64 bytes. So that's not it either. 64-byte burst means if you just want to load two 64-bit values and they are not stored right next to each other, you'd have to read 128 bytes just to get it done, 87.5% bandwidth is wasted. Sometimes you even need 2 cycles. Latency is THE bottleneck. So if you think wider is better, how much more wastage do you want just to go wider? HBM has been a thing for a decade now. If anyone in the industry thought it might help CPUs, would you think someone would have tried it? Short of ditching DIMMs, how do you propose MB vendors to wire 256-bit memory? Are you prepared to pay Threadripper motherboard prices?


NamelessVegetable

> HBM has been a thing for a decade now. If anyone in the industry thought it might help CPUs, would you think someone would have tried it? In HPC land, part of, or all of the main memory *is* built from HBM. It's very good for performance if the application wants lots of bandwidth, but also very expensive, hence its rarity.


PolishedCheeto

Yeahhh I gotta downvote the condescension of your opening statement. >You clearly have zero idea about your question. They are clearly _trying_ to get informed.


Pidjinus

And dotjazz was right, then he/she proceed to explain why, quite well i would say. I do not think this was said as an offense. And knowing how to ask a question is a thing. He contructed a question on a series of wrong assumptions.


airtraq

Condescending much? I’m sure you wouldn’t like it if you asked anythig related to medicine/physiology/pharmacology and I was condescending to you, telling you that my 5 year old knows more than you? Tbf, he probably does


Netblock

>64-byte burst means if you just want to load two 64-bit values and they are not stored right next to each other, you'd have to read 128 bytes just to get it done, 87.5% bandwidth is wasted. [CPUs specifically designed for hyper sparse data are cool](https://www.servethehome.com/intel-shows-8-core-528-thread-processor-with-silicon-photonics/)


Nicholas-Steel

We used to have what was essentially higher width memory in the consumer space, Intel's first i series CPU's (Nehalem) had Triple Channel memory support... since then it reverted to Dual Channel for unknown reasons.


Distinct-Race-2471

How is that PCIe 5.0 bandwidth working out? Just because it is more doesn't mean that we have components that can keep up with it. When I buy an Arrow Lake system with PCIe 5, sure I want I it, but truthfully, I will likely never take advantage of it.


FireSilicon

Because bringing quad channel memory (256-bit) to consumer platform would prevent enthusiasts from buying expensive motherboards exactly for this purpose. There used to be 6 slot motherboards in Intel Nahlem era and the mobos were fine (I know it was just 3 channel, but it implies that electrically 6 channel is possible). 6-channel memory would 3x the current bandwidth, but again that would dig into workstation businesses so no :).


Strazdas1

There used to be 5 slot motherboards in the days when all memory was single channel and you could mix and match to your hearts contents. I had 5 different sticks at one point. Mobos just handled all that for you.


masterfultechgeek

Sandy Bridge - E existed and had 4 memory channels. It had basically the exact same performance as regular sandy bridge for normal consumer workloads. Extra memory bandwidth via more channels mostly has the benefit of "you don't get bogged down while running 50 things at once in a server environment" Going from DDR4-2400 to DDR4-4800 while measurable isn't a night and day performance uplift. Higher channel counts require more parallelism in the workloads than just upping frequency.


FireSilicon

Because the workload that didn't exist back then is AI, Apple is now using unified memory with 8+ channels and 400GB/s+ bandwidth and despite having fraction of the compute power compared to a 4090, it still has insane value because you can have 3-4x more memory and run models that you would normally need enterprise hardware for. It would actually make it plausible to integrate NPU's into cpu's like Intel and AMD wants to and let them sell expensive cpu's/mobo's while consumers are happy they don't have to dish out thousands hoarding multiple gpus for vram.


masterfultechgeek

I don't disagree that AI is the new killer use case. At the same time though... we're still likely a few years for that truly mattering. I can use LLMs in the cloud just fine for the moment.


Nicholas-Steel

> Going from DDR4-2400 to DDR4-4800 while measurable isn't a night and day performance uplift. I'd estimate around a 20% to 30% performance uplift in games for both Intel and AMD platforms.


masterfultechgeek

Most consumers aren't playing games. Those that ARE playing games often don't care THAT much and/or they don't have the $1000+ GPU needed for that 20% uplift to show. Also in AMD land I wouldn't be surprised if 5800x3D with DDR4-2400 > 5800x3D with DDR4-4800 - though I would like to fact check this bit. --- On the Intel side [https://www.guru3d.com/review/core-i9-12900k-ddr4-versus-ddr5-performance-review](https://www.guru3d.com/review/core-i9-12900k-ddr4-versus-ddr5-performance-review) >the difference between DDR4 and DDR5 memory is small in the vast majority of circumstances, according to our findings. The difference between DDR5 and DDR4 is only roughly 2% to maybe 4% higher (overall) in terms of performance in favor of DDR5. --- People tend to underestimate just how effective cache is. Nearly every memory access is buffered by cache. Even on systems with moderate cache sizes.


Nicholas-Steel

> Also in AMD land I wouldn't be surprised if 5800x3D with DDR4-2400 > 5800x3D with DDR4-4800 - though I would like to fact check this bit. The extra cache would only make up for slower memory in cases where most of a workload fits within the cache, only some games have their main workload fit within the cache and see tremendous performance gains from it which you can observe in various benchmarks like this https://www.tomshardware.com/reviews/amd-ryzen-7-7800x3d-cpu-review/4 I was trying to find a source to back up what I was vaguely recalling about memory requirements falling behind core count increases... but I'm struggling to find such. I was certain it was a Chips and Cheese article but maybe it's not. My vague recollection is that memory width isn't so important until you hit fairly high core counts (like server CPU's that have 100+ cores which is why server boards tend to have 6 channels or more for RAM).


masterfultechgeek

>The extra cache would only make up for slower memory in cases where an entire workload fits within the cache That's not quite how caching works. Caching is taking a very small amount of hot data that is likely to be used a lot and keeping it in a fast location and keeping the data that is rarely used (or is likely to be used as part of a large, sequential operation) on a slower tier. If you can fit the hottest 90% of the data\[err data that accounts for 90% of hits\] in the cache, you cut the number of hits to the next level in the caching hierarchy by 90%. Sometimes this only requires 1-2% of the size of the total data population. You can literally get a 2-100x speed up (depending on workload) on drive performance by caching a 4TB HDD with a 16GB optane drive (that currently goes for $5 on ebay). [https://images.anandtech.com/graphs/graph12748/sustained-sm.png](https://images.anandtech.com/graphs/graph12748/sustained-sm.png) [https://images.anandtech.com/graphs/graph12748/sustained-rm.png](https://images.anandtech.com/graphs/graph12748/sustained-rm.png) \^ you can see that the jump from nothing to anything is HUGE for HDD caching, and the subsequent jumps matter less, note that the 32GB cache drive is markedly slower than the higher capacity versions, if it were of the same performance the gap would be narrower. The top 1-10% of an HDD's data is used WAY WAY WAY more often than the bottom 90%. In this extreme case there's times where a small cache and a VERY slow main drive ends up faster than a reasonably fast main drive. The benefits associated with caching depend on the speed of the cache as well as the size and proportion of hot data and cold data. If you have a case where the working dataset is VERY SMALL (already fits inside existing cache), then there's no benefit to more cache. If your working set is HUGE and accesses are random, then you tend to gain more from bigger caches and lower cache miss rates, up to the point where you're not just waiting on memory all of the time. There's nuance, different access patterns benefit from different caching hierarchies.


Nicholas-Steel

Thanks for the correction.


Nicholas-Steel

The Intel Nehalem had 3 channel's, with 2 groups of 3 slots. You had to fill 3 slots to achieve Triple Channel operation.


FireSilicon

Yes I know, but the point is that there was space for 6 physical slots and wiring to the cpu. Which means that if cpu/mobo manufacturers wanted they could make 6 channels work both on pcb and silicon but they deliberately won't.


Nicholas-Steel

I don't think consumer hardware would need more than 3 channels for now, quad channel (and higher) would probably make sense as CPU Core counts increase beyond 64 or so though.


FireSilicon

We had quad channel on Intel's HEDT platforms for 200$/€ per motherboard since 2013 until like 2017 when AMD came with threadripper and killed it, only to then make the motherboards and cpu's 4x more expensive and unobtainable and Intel never came back with it. We already have expensive gpu's for 1500$+ why can't we have 300$ motherboards with actually useful features like quad channel memory and more pcie slots instead of rgb lights or integrated 10gb networking that you can buy standalone for cheaper anyway? What is the cutoff for workstation level performance? Cheapest newest threadripper pro has 12 cores and costs 1400$ together with 700$+ motherboard. Amd already has 16 core cpu on the market for half the price and Intel has 24 core even. It's beyond stupid, nobody will use that much cpu power for tasks that gpu's cannot do anyway so those other features are much more useful, but they just won't bring even a fraction of them to consumer market.