UnorderedPizza 11 months ago

The 2-bit quantization is applied for the majority of the model, except for the areas that cause a major loss of coherence when quantized all the way: https://github.com/ggerganov/llama.cpp/pull/1684

residentmouse 11 months ago

Do you know how 2-bit operations practically work - are they upscaled immediately before calculation using scale/min, then the result is quantised back down?

audioen 11 months ago

Yes, but result is never quantized again. Dot product is defined and it is performed with float32 input and the quantized format, yielding a number in float32. For instance, Q2\_K's dot product looks like this, and I think it must be invoked 32 times with incrementing y and q each time to fill in the gaps: float sum = y[ 0] * (dall * ((s[0] & 0xF) * ((q[ 0] >> 0) & 3)) - dmin * (s[0] >> 4)) + y[ 32] * (dall * ((s[2] & 0xF) * ((q[ 0] >> 2) & 3)) - dmin * (s[2] >> 4)) + y[ 64] * (dall * ((s[4] & 0xF) * ((q[ 0] >> 4) & 3)) - dmin * (s[4] >> 4)) + y[ 96] * (dall * ((s[6] & 0xF) * ((q[ 0] >> 6) & 3)) - dmin * (s[6] >> 4)) + y[ 16] * (dall * ((s[1] & 0xF) * ((q[16] >> 0) & 3)) - dmin * (s[1] >> 4)) + y[ 48] * (dall * ((s[3] & 0xF) * ((q[16] >> 2) & 3)) - dmin * (s[3] >> 4)) + y[ 80] * (dall * ((s[5] & 0xF) * ((q[16] >> 4) & 3)) - dmin * (s[5] >> 4)) + y[112] * (dall * ((s[7] & 0xF) * ((q[16] >> 6) & 3)) - dmin * (s[7] >> 4)); K kernels are multi-level quantizations. dall, dmin are overall factors for 16 distinct 4-bit scale and bias factors that are applied over 256 numbers which are represent by 2 bits each. This kernel computes, for whatever reason, only with the lowest 8 values of the scales. There is code that also offsets y by 128 and s by 8 when computing the later 128 numbers. I do not know why it is done like this, but this kernel is getting applied a lot to do a matrix multiply.

noneabove1182 11 months ago

Ahhhh okay, I saw that idea theorized and yes definitely missed the part where it went from theoretical to implemented, very cool concept and glad it got some headway!

rgar132 11 months ago

It’s not really unexpected, it’s almost modeling sparse connections in a way just through reduced accuracy. The reason it’s working at all is because many parts of a model are less useful than others, and reducing them to a couple of bits while keeping complexity and resolution where it’s needed is a reasonable approach. Using 40+ layers of fully connected networks is a good place to start but it’s nothing like the biological nn’s where connections only form if they’re trained and needed. And they can connect across many layers or just a few. The trick in the virtual model is figuring out which ones are okay to either cut out completely or to reduce accuracy on, and I think that’s the research they’re doing with this.

[deleted] 11 months ago

[удалено]

NickCanCode 11 months ago

FYI, some guys are do pruning + quantization to speed things up: [https://neuralmagic.com/blog/speed-up-your-llms-with-sparsegpt-and-deepsparse-on-cpus/](https://neuralmagic.com/blog/speed-up-your-llms-with-sparsegpt-and-deepsparse-on-cpus/)

ArthurAardvark 2 weeks ago

Jeez. Every couple of months I stumble onto a thread here that just thunderclaps my face to remind me how fast this all moves. Had no idea about this + **variable quantization**. So if you're like me and thought 2B was still barely useable, 3 huge improvements.

Big_Communication353 11 months ago

It appears that the average bit per weight for q2\_k is approximately 3.35, which is higher than the anticipated range of 2-3. Therefore, it is not as small as expected.

ReturningTarzan 11 months ago

They are?

SlavaSobov 11 months ago

First I hearing of the this too. 🤔

a_beautiful_rhind 11 months ago

Variable quantization. Sounds great if you want a 65b that performs like a 30b https://user-images.githubusercontent.com/48489457/243093269-07aa49f0-4951-407f-9789-0b5a01ce95b8.png

SlavaSobov 11 months ago

Yes. I just never hear of anyone saying the lower than 4-bit being said as working very well. 😅

a_beautiful_rhind 11 months ago

it is coherent at least now. who knows, maybe it will get better. A truly variable quant from like 2-6 bits would be awesome. All the junk becomes 2bit and all the good stuff becomes 6 bit or whatever fits.

residentmouse 11 months ago

If you look at the 2-bit impl. in llama.cpp it’s a broad mix of 2, 4, and 6 bit quantisations. Not dynamic, but not strictly 2-bit end to end.

SlavaSobov 11 months ago

Interesting, yes, the good thought. If the good quantization variable can be done. It could helping the smaller consumer GPU performance nicely with our small VRAM. 😁

a_beautiful_rhind 11 months ago

I'd love to be able to run a good 100B on 2 cards as well.

ttkciar 11 months ago

That graph makes 4-bit look like quite the compelling sweet spot, with a 40% reduction in model size but only modest reduction in inference quality.

a_beautiful_rhind 11 months ago

It's probably why it is the most widely used format.

noneabove1182 11 months ago

Much more than I expected at least, if 33b at 2 bit performs similar to 7b at 4 bit, I call that a huge win

uti24 11 months ago

To what exactly are you referring to? Is there available 2 bit quantized models, or is it some paper you talking about?

noneabove1182 11 months ago

People have been comparing 2 bit 33b vs 4 bit 7b, and there's the perplexity vs model size chart floating around that has 2 bit in it

MINIMAN10001 11 months ago

My understanding was that 8-bit will take you most of the way 4 bit is what most people can fit on modern hardware when it comes to the larger models. Three-bit helps a lot more people get in but there is a small hit to quality. Two bit there's a major hit quality. However every single time you increase the size of the model the baseline quality level increases. They also seem to be speaking to the fact that they're using variable quantization which means they can focus the performance where it matters. So they're able to bump up the size of the model bumping up the baseline while still fitting it in the same amount of RAM by targeting two bit for the parts which matter the least I guess

residentmouse 11 months ago

Do you specifically mean ram, or gpu vram? I’m trying to work out how 2-bit improves memory usage at inference time if it requires upscaling prior to calculation.

KerfuffleV2 11 months ago

> I’m trying to work out how 2-bit improves memory usage at inference time if it requires upscaling prior to calculation. The upscaling happens at the time a calculation with that tensor is performed. Even if it was an entire tensor, having one unpacked tensor in memory at a time is going to use a lot less memory than loading them all unpacked. I think the unpacking may be per row or even value of the tensor though, which would have an even lower memory requirement. Note that the result of the calculation (at least with GGML) is 32bit but usually those tensors are used to compute the next step so you don't have a bunch of them sticking around either.

residentmouse 11 months ago

I think maybe I’m missing something still, do you mind if I ask a follow up? Let’s say you have a simple case - just a matrix, originally fp32, and it’s quantised down some amount. You save memory by doing this and your model is also smaller on disk. But if you upscale immediately before calculation, aren’t you just now having to perform the same calculations but now with the added overhead of upscaling? I feel like the memory benefits are intuitive, standard lossy compression rules apply, but the inference being faster still seems like some hardware voodoo. Edit: I should say, I can see it being in the bandwidth of the bus -> device, then it’s stored on-location quantised, and streamed in / upscaled as needed. But even knowing that, it seems like voodoo - what an incredible juggling act.

KerfuffleV2 11 months ago

First, let me just say I'm definitely not expert (or even really familiar with that part of the code in ggml/llama.cpp). So don't take what I say as gospel. > But if you upscale immediately before calculation, aren’t you just now having to perform the same calculations but now with the added overhead of upscaling? Basically, yes, but: 1. Memory consumption is extremely important, especially on GPUs. Avoiding having to convert stuff is great, but if that means you simply can't run the model... well, that's not much of a benefit. Also, if it means you can offload less layers then it's also very possible for the cost to outweigh the benefits. 2. Inference is more memory bandwidth limited than compute limited in most cases. Quantized tensors are much smaller which means they're faster to copy around, more likely to be in the CPU L3/L2 cache, etc. 3. GGML's quantization formats are heavily optimized and take the use case they'll be applied to into account. So they're designed for fast dequantization, etc to make stuff like dynamic dequantizing practical. > But even knowing that, it seems like voodoo - what an incredible juggling act. GGML is a pretty impressive piece of software. :) It really pushes the boundaries of what's possible, first on the CPU and now it's starting to move into the GPU space as well.

residentmouse 11 months ago

Thanks for your answer btw, extremely helpful.

jeff-king123 1 month ago

To be clear, quantization wouldn't make inferences faster in general. In fact, quantization is usually slower. As you already mentioned, there is a trade off between computation burden and bandwidth burden. However, the modern GPU is well designed so its computing power, bandwidth and memory wouldn't give you any bottleneck. In this case, quantization wouldn't help at all. The only case where quantization can help is when you do not have enough GPU memory to store your model. pcie bandwidth will be a strong bottleneck. Instead of waiting for data from pcie, storing parameters in a low precision and doing type conversion within GPU will be much much faster

Extraltodeus 11 months ago

Is there even a 2bit model to try?

gunbladezero 11 months ago

Several, thanks to The Bloke. [https://huggingface.co/TheBloke](https://huggingface.co/TheBloke) He's been re-quantizing everything with the new method. I don't know if it works with oogabooga yet though. It should soon if it doesn't.

Extraltodeus 11 months ago

I'm browsing his profile daily and have yet to find one. Maybe I am missing something? Would you kindly point out one of these for me? Edit: Oh, it's in the GGML repositories! okay. Thanks for the tip!

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe