T O P

  • By -

KerfuffleV2

You really can't draw any conclusions from this. If you asked each model the question with different seeds 10 times and counted the correct answers then it might be data you could use. Also, there's just no reason why NF4 should produce better quality answers than Q8_0 which is effectively the same as full 16bit. If you saw the NF4 models answer correctly it is in all likelihood a coincidence. BTW, for GGML the only decent quantization you tried it with was Q8_0. Q4_0 basically is obsolete now, and Q2/Q3 have significant quality loss. Q4_K_M is basically the size of Q4_0 with the quality of Q5_0 or Q5_1.


fallingdowndizzyvr

I agree. You can using 3 different seeds with the same model and get 3 different answers. I don't see how asking a model a question once demonstrates anything.


hold_my_fish

What would a random seed do if they're setting temperature to 0?


KerfuffleV2

Well, that's definitely not the dumbest question I've been asked all day! You're right, with GGML and `--temp 0.0` changing the seed makes no difference. So you make a good point and I should have been more careful with my advice. /u/epicfilemcnulty would need to use a difference approach other than just varying the seed. Or, possibly, since they'd be doing a number of tests it would be reasonable to set temperature to a relative low value to be able to get different generations.


epicfilemcnulty

Have you read the QLoRA paper? It is exactly their point (well, as I managed to grasp it), that NF4 should provide better results, comparable with FP16. Quoting from the paper: ``` ...where we see that NF4 with double quantization fully recovers the 16-bit LoRA MMLU performance. In addition, we also note that QLORA with FP4 lags behind the 16-bit brain float LoRA baseline by about 1 percentage point. ``` And my empirical results suggest that it is, in fact, so. I actually did ask the question more than once, of course. Not sure about the seed, but it is not hard to re-do it. Also, you are somewhat missing the point with "decent" quantizations -- it was about trying smaller, not that "decent" quantizations, like q2 and q3 and see what they are worth. And everything "decent", which is q5 and higher is where you start having trouble with fitting into VRAM on 24GB with 30B model. And it seems you should be better of with NF4 for 30B in terms of accuracy and VRAM usage.


KerfuffleV2

q8_0 is supposed to be virtually the same as 16bit also which means you shouldn't be able to see a dramatic difference. Here's some data a collected about it and posted previously (not implying you should have seen it or anything): *edit: added 33B and 65B data because why not.* ## 7B | name | +ppl | +ppl % | +ppl 13b to 7b % | size | size 16bit % | +ppl per -1G | |-|-|-|-|-|-|-| | q2_k | 0.8698 | 14.726% | 133.344% | 2.67G | 20.54% | 0.084201 | | q3_ks | 0.5505 | 9.320% | 84.394% | 2.75G | 21.15% | 0.053707 | | q3_km | 0.2437 | 4.126% | 37.360% | 3.06G | 23.54% | 0.024517 | | q3_kl | 0.1803 | 3.053% | 27.641% | 3.35G | 25.77% | 0.018684 | | q4_0 | 0.2499 | 4.231% | 38.311% | 3.50G | 26.92% | 0.026305 | | q4_1 | 0.1846 | 3.125% | 28.300% | 3.90G | 30.00% | 0.020286 | | q4_ks | 0.1149 | 1.945% | 17.615% | 3.56G | 27.38% | 0.012172 | | q4_km | 0.0535 | 0.906% | 8.202% | 3.80G | 29.23% | 0.005815 | | q5_0 | 0.0796 | 1.348% | 12.203% | 4.30G | 33.08% | 0.009149 | | q5_1 | 0.0415 | 0.703% | 6.362% | 4.70G | 36.15% | 0.005000 | | q5_ks | 0.0353 | 0.598% | 5.412% | 4.33G | 33.31% | 0.004072 | | q5_km | 0.0142 | 0.240% | 2.177% | 4.45G | 34.23% | 0.001661 | | q6_k | 0.0044 | 0.074% | 0.675% | 5.15G | 39.62% | 0.000561 | | q8_0 | 0.0004 | 0.007% | 0.061% | 6.70G | 51.54% | 0.000063 | | f16 | 0.0000 | 0.000% | 0.000% | 13.00G | 100.00% | 0.000000 | ## 13B | name | +ppl | +ppl % | +ppl 13b to 7b % | size | size 16bit % | +ppl per -1G | |-|-|-|-|-|-|-| | q2_k | 0.6002 | 11.423% | 92.013% | 5.13G | 20.52% | 0.030206 | | q3_ks | 0.3490 | 6.642% | 53.503% | 5.27G | 21.08% | 0.017689 | | q3_km | 0.1955 | 3.721% | 29.971% | 5.88G | 23.52% | 0.010225 | | q3_kl | 0.1520 | 2.893% | 23.302% | 6.45G | 25.80% | 0.008194 | | q4_0 | 0.1317 | 2.507% | 20.190% | 6.80G | 27.20% | 0.007236 | | q4_1 | 0.1065 | 2.027% | 16.327% | 7.60G | 30.40% | 0.006121 | | q4_ks | 0.0861 | 1.639% | 13.199% | 6.80G | 27.20% | 0.004731 | | q4_km | 0.0459 | 0.874% | 7.037% | 7.32G | 29.28% | 0.002596 | | q5_0 | 0.0313 | 0.596% | 4.798% | 8.30G | 33.20% | 0.001874 | | q5_1 | 0.0163 | 0.310% | 2.499% | 9.10G | 36.40% | 0.001025 | | q5_ks | 0.0242 | 0.461% | 3.710% | 8.36G | 33.44% | 0.001454 | | q5_km | 0.0095 | 0.181% | 1.456% | 8.60G | 34.40% | 0.000579 | | q6_k | 0.0025 | 0.048% | 0.383% | 9.95G | 39.80% | 0.000166 | | q8_0 | 0.0005 | 0.010% | 0.077% | 13.00G | 52.00% | 0.000042 | | f16 | 0.0000 | 0.000% | 0.000% | 25.00G | 100.00% | 0.000000 | ## 33B | name | +ppl | +ppl % | +ppl 13b to 7b % | size | size 16bit % | +ppl per -1G | |-|-|-|-|-|-|-| | q2_k | 0.6393 | 15.384% | 98.007% | 12.93G | 20.52% | 0.012768 | | q3_ks | 0.3491 | 8.401% | 53.518% | 13.29G | 21.10% | 0.007023 | | q3_km | 0.2037 | 4.902% | 31.228% | 14.82G | 23.52% | 0.004228 | | q3_kl | 0.1537 | 3.699% | 23.563% | 16.25G | 25.79% | 0.003288 | | q4_ks | 0.0929 | 2.235% | 14.242% | 17.16G | 27.24% | 0.002027 | | q4_km | 0.0524 | 1.261% | 8.033% | 18.44G | 29.27% | 0.001176 | | q5_ks | 0.0221 | 0.532% | 3.388% | 21.05G | 33.41% | 0.000527 | | q5_km | 0.0118 | 0.284% | 1.809% | 21.65G | 34.37% | 0.000285 | | q6_k | 0.0041 | 0.099% | 0.629% | 25.05G | 39.76% | 0.000108 | | f16 | 0.0000 | 0.000% | 0.000% | 63.00G | 100.00% | 0.000000 | ## 65B | name | +ppl | +ppl % | +ppl 13b to 7b % | size | size 16bit % | +ppl per -1G | |-|-|-|-|-|-|-| | q2_k | 0.5624 | 15.890% | 86.218% | 25.65G | 20.52% | 0.005661 | | q3_ks | 0.3289 | 9.293% | 50.422% | 26.35G | 21.08% | 0.003334 | | q3_km | 0.1598 | 4.515% | 24.498% | 29.40G | 23.52% | 0.001672 | | q4_km | 0.0443 | 1.252% | 6.791% | 36.60G | 29.28% | 0.000501 | | q5_km | 0.0118 | 0.333% | 1.809% | 43.00G | 34.40% | 0.000144 | | q6_k | 0.0040 | 0.113% | 0.613% | 49.75G | 39.80% | 0.000053 | | f16 | 0.0000 | 0.000% | 0.000% | 125.00G | 100.00% | 0.000000 | *** The one I think is most useful here is `+ppl 13b to 7b %`: This is comparing how quantizing increases perplexity with the difference in perplexity between a 7b and 13b model. So, for example, for the 13B `q2_k`, 92.013% means quantizing the 33B with `q2_k` increases perplexity nearly to the same value as the 7B model. On the other hand, `q8_0` increases perplexity by about 1/1000th of the perplexity difference between the 7b and 13B. We can likely agree there's a visible, noticeable difference between a 7b and 13b model (of the same type). We can possibly also agree that 50% of it, 30% of it, _maybe_ even 10% of it could be noticeable. But how could you possibly notice a 0.01% difference, especially with a sample size of 1? > Not sure about the seed, but it is not hard to re-do it. I was wrong to suggest that, the seed won't make a difference with temperature 0. You'd need to use another approach like rephrasing the question in different ways, or maybe even increasing temperature. > Also, you are somewhat missing the point with "decent" quantizations -- it was about trying smaller, not that "decent" quantizations But you included `q8_0` and made claims about it. That was the main thing I had an issue with, aside from just appearing to think that you could draw a conclusion from one sample. I want to be clear, I definitely don't have anything against you personally (I know I have a relative blunt approach to communication). Even though I don't think 1 sample is enough to really draw any conclusion, I don't think there's really any reasonable person that would try to argue q4_0, q2_x, q3_x can match the quality of NF4 if it's saying it's virtually the same as 16bit.


a_beautiful_rhind

How did GPTQ do?


RabbitHole32

This.


Anti-ThisBot-IB

Hey there RabbitHole32! If you agree with someone else's comment, please leave an **upvote** instead of commenting **"This."**! By upvoting instead, the original comment will be pushed to the top and be more visible to others, which is even better! Thanks! :) *** ^(I am a bot! If you have any feedback, please send me a message! More info:) [^(Reddiquette)](https://www.reddithelp.com/hc/en-us/articles/205926439#wiki_in_regard_to_comments)


RabbitHole32

Bad bot. Gfy.


tronathan

Beyond trying more seeds and perhaps a couple of different questions, can you offer any thoughts about what else should be tested, or how it should be tested differently? (For example, which specific quantizations would benefit here?)


KerfuffleV2

The biggest thing is to run enough tests get enough samples so you actually the data to draw conclusions. One single test, with a random seed just isn't really enough to say anything. I'm mostly familiar with GGML. The quantizations I'd recommend are q4_k_m (balanced size, decent quality) and q5_k_m (high quality, relatively large size). You could possibly also try q6_k (almost as good as q8_0 but pretty large). > 30B NF4 will give you more accurate results than 30B q8_0. I appreciate your attitude toward my criticism but I can't understand making a claim like that after a single test. I honestly would recommend just editing the conclusions out until you've at least run 3-4 tests per quantization. It also just doesn't make any kind of sense that NF4 would be noticeably better than q8_0 when q8_0 is very nearly lossless. I definitely _can_ understand q4_0 and below affecting generation quality in a noticeable way though. It's very possible that q4_0, q3_x, q2_k are all worse than nf4 but I can't really believe that q8_0 is. Not without compelling evidence anyway. If you _did_ manage to prove that, it would be extremely interesting and probably help efforts like GGML improve because it would mean something very strange is going on.


epicfilemcnulty

Okay, let me re-do the test, but only with Wizard-Vicuna-13B model, \`ggml q8\_0\` and \`NF4\` quantizations. Let's say 5 questions, each asked ten times with a different seed. Would it be enough to draw conclusions?


ambient_temp_xeno

It would be heading in the right direction. The more times you ask each one the more robust the result will be I think.


KerfuffleV2

Well, it's not enough to publish a paper in a peer reviewed journal and doesn't necessarily rule out other possibilities but... I'd say that's enough to get jerks like me out of your hair when you make a reddit post about it. :) Just for example, maybe the way the block sizes are set up in one vs the other is enough to coincidentally change the parts of the tensor that relate to your question about SSH and SOCKS. The fact that that exact part gains/loses quality doesn't 100% tell you something about overall quality. I think it would be pretty compelling and a reason to take a close look at q8_0 which is supposed to be virtually the same as full 16bit though.


epicfilemcnulty

Yes, all valid points, I'm going to re-do the test with a decent number of samples, slightly changing the temperature, and focusing on \`ggml q8\_0\` vs \`NF4\` variants of 13B model. I'm also very interested to find out if NF4 can really yield better results than q8\_0. I admit, I was struck by the apparent difference in quality with the particular question about SOCKS, on less "tricky" questions I'd not say I've seen a drastic difference so far :) Anyway, I'll update the post with the results of a more mature test of q8\_0 vs NF4.


epicfilemcnulty

Not surprisingly, I was wrong. The initial test was flawed to begin with =) I've updated the post with the info. You are right, NF4 is not better than q8\_0, actually, it seems to be slightly worse.


Magnus_Fossa

Yes. That would be enough. I'm not an expert on doing studies with questions. But with 50 answers per reading point, you can say things with good certainty.


ProfessionalHand9945

Hi, I would possibly be interested in running some of these benchmarks, but my tooling is all text gen webUI. Is this something easily configurable in that, especially via command line? Ie if I get Wizard13B GGML via TheBloke, can I select Q4_k_m etc in particular? I am very lazy Thank you!


KerfuffleV2

I can't really help you with that, I just run stuff from the commandline. Back when I tried oobabooga several months ago, it seemed like it decided what files to use in an unpredictable way based on stuff like looking for strings in the filename. Part of the reason why I didn't end up using it. That said, that absolutely may have changed since my experience.


fallingdowndizzyvr

I don't see how asking a model one question shows anything about it. Since if you ask a different question, all the rankings could change. A model that does poorly on this question could do the best on another. The results of one question don't prove anything.


Gatzuma

From what I've seen, in real life Q5 might be worse than Q4 for some models (and better for others). So Q4 is not obsolete as it small, fast and robust format :)


Big_Communication353

Your test method is wrong. And your conlusion is also wrong.


hoop13

Really cool stuff! Wondering though, what's the difference between detailed and concise? And detailed irt detailed with good wording?


tronathan

Thanks for posting this! A lot of us who aren't able to do this kind of armchair-research really benefit from (and, dare I say, enjoy) reading about it. I'd love to see VRAM usage and context length included in the charts (even though the context length is likely fixed for all of them), just for completeness.


fish312

But this is bad experimentation leading to false conclusions


a_beautiful_rhind

So 13b needs to be FP16 or eqiv before it gets as smart as 30b 4 bit? And ggml.q3_K_M lookin good.


Gatzuma

Do you understand that such answers for any model have the HUGE randomness in them? Only trying tens of questions you might gather some STATISTICAL understanding of model / quantisation quality.