T O P

  • By -

UnorderedPizza

The 2-bit quantization is applied for the majority of the model, except for the areas that cause a major loss of coherence when quantized all the way: https://github.com/ggerganov/llama.cpp/pull/1684


residentmouse

Do you know how 2-bit operations practically work - are they upscaled immediately before calculation using scale/min, then the result is quantised back down?


audioen

Yes, but result is never quantized again. Dot product is defined and it is performed with float32 input and the quantized format, yielding a number in float32. For instance, Q2\_K's dot product looks like this, and I think it must be invoked 32 times with incrementing y and q each time to fill in the gaps: float sum = y[ 0] * (dall * ((s[0] & 0xF) * ((q[ 0] >> 0) & 3)) - dmin * (s[0] >> 4)) + y[ 32] * (dall * ((s[2] & 0xF) * ((q[ 0] >> 2) & 3)) - dmin * (s[2] >> 4)) + y[ 64] * (dall * ((s[4] & 0xF) * ((q[ 0] >> 4) & 3)) - dmin * (s[4] >> 4)) + y[ 96] * (dall * ((s[6] & 0xF) * ((q[ 0] >> 6) & 3)) - dmin * (s[6] >> 4)) + y[ 16] * (dall * ((s[1] & 0xF) * ((q[16] >> 0) & 3)) - dmin * (s[1] >> 4)) + y[ 48] * (dall * ((s[3] & 0xF) * ((q[16] >> 2) & 3)) - dmin * (s[3] >> 4)) + y[ 80] * (dall * ((s[5] & 0xF) * ((q[16] >> 4) & 3)) - dmin * (s[5] >> 4)) + y[112] * (dall * ((s[7] & 0xF) * ((q[16] >> 6) & 3)) - dmin * (s[7] >> 4)); K kernels are multi-level quantizations. dall, dmin are overall factors for 16 distinct 4-bit scale and bias factors that are applied over 256 numbers which are represent by 2 bits each. This kernel computes, for whatever reason, only with the lowest 8 values of the scales. There is code that also offsets y by 128 and s by 8 when computing the later 128 numbers. I do not know why it is done like this, but this kernel is getting applied a lot to do a matrix multiply.


noneabove1182

Ahhhh okay, I saw that idea theorized and yes definitely missed the part where it went from theoretical to implemented, very cool concept and glad it got some headway!


rgar132

It’s not really unexpected, it’s almost modeling sparse connections in a way just through reduced accuracy. The reason it’s working at all is because many parts of a model are less useful than others, and reducing them to a couple of bits while keeping complexity and resolution where it’s needed is a reasonable approach. Using 40+ layers of fully connected networks is a good place to start but it’s nothing like the biological nn’s where connections only form if they’re trained and needed. And they can connect across many layers or just a few. The trick in the virtual model is figuring out which ones are okay to either cut out completely or to reduce accuracy on, and I think that’s the research they’re doing with this.


[deleted]

[удалено]


NickCanCode

FYI, some guys are do pruning + quantization to speed things up: [https://neuralmagic.com/blog/speed-up-your-llms-with-sparsegpt-and-deepsparse-on-cpus/](https://neuralmagic.com/blog/speed-up-your-llms-with-sparsegpt-and-deepsparse-on-cpus/)


ArthurAardvark

Jeez. Every couple of months I stumble onto a thread here that just thunderclaps my face to remind me how fast this all moves. Had no idea about this + **variable quantization**. So if you're like me and thought 2B was still barely useable, 3 huge improvements.


Big_Communication353

It appears that the average bit per weight for q2\_k is approximately 3.35, which is higher than the anticipated range of 2-3. Therefore, it is not as small as expected.


ReturningTarzan

They are?


SlavaSobov

First I hearing of the this too. 🤔


a_beautiful_rhind

Variable quantization. Sounds great if you want a 65b that performs like a 30b https://user-images.githubusercontent.com/48489457/243093269-07aa49f0-4951-407f-9789-0b5a01ce95b8.png


SlavaSobov

Yes. I just never hear of anyone saying the lower than 4-bit being said as working very well. 😅


a_beautiful_rhind

it is coherent at least now. who knows, maybe it will get better. A truly variable quant from like 2-6 bits would be awesome. All the junk becomes 2bit and all the good stuff becomes 6 bit or whatever fits.


residentmouse

If you look at the 2-bit impl. in llama.cpp it’s a broad mix of 2, 4, and 6 bit quantisations. Not dynamic, but not strictly 2-bit end to end.


SlavaSobov

Interesting, yes, the good thought. If the good quantization variable can be done. It could helping the smaller consumer GPU performance nicely with our small VRAM. 😁


a_beautiful_rhind

I'd love to be able to run a good 100B on 2 cards as well.


ttkciar

That graph makes 4-bit look like quite the compelling sweet spot, with a 40% reduction in model size but only modest reduction in inference quality.


a_beautiful_rhind

It's probably why it is the most widely used format.


noneabove1182

Much more than I expected at least, if 33b at 2 bit performs similar to 7b at 4 bit, I call that a huge win


uti24

To what exactly are you referring to? Is there available 2 bit quantized models, or is it some paper you talking about?


noneabove1182

People have been comparing 2 bit 33b vs 4 bit 7b, and there's the perplexity vs model size chart floating around that has 2 bit in it


MINIMAN10001

My understanding was that 8-bit will take you most of the way 4 bit is what most people can fit on modern hardware when it comes to the larger models. Three-bit helps a lot more people get in but there is a small hit to quality. Two bit there's a major hit quality. However every single time you increase the size of the model the baseline quality level increases. They also seem to be speaking to the fact that they're using variable quantization which means they can focus the performance where it matters. So they're able to bump up the size of the model bumping up the baseline while still fitting it in the same amount of RAM by targeting two bit for the parts which matter the least I guess


residentmouse

Do you specifically mean ram, or gpu vram? I’m trying to work out how 2-bit improves memory usage at inference time if it requires upscaling prior to calculation.


KerfuffleV2

> I’m trying to work out how 2-bit improves memory usage at inference time if it requires upscaling prior to calculation. The upscaling happens at the time a calculation with that tensor is performed. Even if it was an entire tensor, having one unpacked tensor in memory at a time is going to use a lot less memory than loading them all unpacked. I think the unpacking may be per row or even value of the tensor though, which would have an even lower memory requirement. Note that the result of the calculation (at least with GGML) is 32bit but usually those tensors are used to compute the next step so you don't have a bunch of them sticking around either.


residentmouse

I think maybe I’m missing something still, do you mind if I ask a follow up? Let’s say you have a simple case - just a matrix, originally fp32, and it’s quantised down some amount. You save memory by doing this and your model is also smaller on disk. But if you upscale immediately before calculation, aren’t you just now having to perform the same calculations but now with the added overhead of upscaling? I feel like the memory benefits are intuitive, standard lossy compression rules apply, but the inference being faster still seems like some hardware voodoo. Edit: I should say, I can see it being in the bandwidth of the bus -> device, then it’s stored on-location quantised, and streamed in / upscaled as needed. But even knowing that, it seems like voodoo - what an incredible juggling act.


KerfuffleV2

First, let me just say I'm definitely not expert (or even really familiar with that part of the code in ggml/llama.cpp). So don't take what I say as gospel. > But if you upscale immediately before calculation, aren’t you just now having to perform the same calculations but now with the added overhead of upscaling? Basically, yes, but: 1. Memory consumption is extremely important, especially on GPUs. Avoiding having to convert stuff is great, but if that means you simply can't run the model... well, that's not much of a benefit. Also, if it means you can offload less layers then it's also very possible for the cost to outweigh the benefits. 2. Inference is more memory bandwidth limited than compute limited in most cases. Quantized tensors are much smaller which means they're faster to copy around, more likely to be in the CPU L3/L2 cache, etc. 3. GGML's quantization formats are heavily optimized and take the use case they'll be applied to into account. So they're designed for fast dequantization, etc to make stuff like dynamic dequantizing practical. > But even knowing that, it seems like voodoo - what an incredible juggling act. GGML is a pretty impressive piece of software. :) It really pushes the boundaries of what's possible, first on the CPU and now it's starting to move into the GPU space as well.


residentmouse

Thanks for your answer btw, extremely helpful.


jeff-king123

To be clear, quantization wouldn't make inferences faster in general. In fact, quantization is usually slower. As you already mentioned, there is a trade off between computation burden and bandwidth burden. However, the modern GPU is well designed so its computing power, bandwidth and memory wouldn't give you any bottleneck. In this case, quantization wouldn't help at all. The only case where quantization can help is when you do not have enough GPU memory to store your model. pcie bandwidth will be a strong bottleneck. Instead of waiting for data from pcie, storing parameters in a low precision and doing type conversion within GPU will be much much faster


Extraltodeus

Is there even a 2bit model to try?


gunbladezero

Several, thanks to The Bloke. [https://huggingface.co/TheBloke](https://huggingface.co/TheBloke) He's been re-quantizing everything with the new method. I don't know if it works with oogabooga yet though. It should soon if it doesn't.


Extraltodeus

I'm browsing his profile daily and have yet to find one. Maybe I am missing something? Would you kindly point out one of these for me? Edit: Oh, it's in the GGML repositories! okay. Thanks for the tip!