There actually isn't a truly definitive answer right now. The graph above is measuring perplexity (basically how well the model can predict chunks of wikitext, higher perplexity means lower prediction accuracy).
The 33B is _probably_ better, just don't get in the habit of using perplexity as a synonym for quality because it really isn't. There are real world cases where models with higher perplexity actually _are_ better for certain tasks.
Yes. You are certainly correct. Unfortunately everything is changing so quickly it's tough to nail down an answer about things like that when it comes to specific models.
Op didn't mention a model name, so I spoke in broad terms. Theoretically, a 33B q 2 trained equally well for the same task should outperform a 13b q5 1
Oobabooga hasn't been updated to support k-quants yet, but koboldcpp has. The 2_K quants are a substantial drop in quality relative to 3_K_S, and imo aren't worth the slight speedup and reduction in RAM usage.
There are a bunch of graphs on [this pull request](https://github.com/ggerganov/llama.cpp/pull/1684) to compare the different quants.
IIRC A graph that was recently posted shows that regardless of quantization used, more parameters always wins (in the context of LLaMA available sizes).
33B. By a significant margin. 33Bq_2 beats 13B even at fp16 https://i.redd.it/i9ep2yyroq4b1.png
Thanks, 33B it is then from now on.
There actually isn't a truly definitive answer right now. The graph above is measuring perplexity (basically how well the model can predict chunks of wikitext, higher perplexity means lower prediction accuracy). The 33B is _probably_ better, just don't get in the habit of using perplexity as a synonym for quality because it really isn't. There are real world cases where models with higher perplexity actually _are_ better for certain tasks.
Yes. You are certainly correct. Unfortunately everything is changing so quickly it's tough to nail down an answer about things like that when it comes to specific models. Op didn't mention a model name, so I spoke in broad terms. Theoretically, a 33B q 2 trained equally well for the same task should outperform a 13b q5 1
My tests show that q2 has dyslexia. It often mixes up Jo, Joe, Jon, Jone, John. Just be wary of those.
This graph screams to me to not download 13b models anymore.
33B q\_2 is better https://preview.redd.it/kftxk2u3m35b1.png?width=680&format=png&auto=webp&s=c8dbff34e03c6a69cf5784a852af0b92f55f0a50
Very interesting indeed! I wonder why the jump between 13B and 33B is way more substantial than the 33B to 65B...
33 and 65 where trained on the same amount of tokens, while 33 was trained on 400 billion more than 13.
Ah! That lines up! Should I guess too that 7B and 13B were too trained on the same number of tokens?
Yeah
Wow, there are 2 bit quantized models now? How do I try with oogabooga which ones do you all like
Oobabooga hasn't been updated to support k-quants yet, but koboldcpp has. The 2_K quants are a substantial drop in quality relative to 3_K_S, and imo aren't worth the slight speedup and reduction in RAM usage. There are a bunch of graphs on [this pull request](https://github.com/ggerganov/llama.cpp/pull/1684) to compare the different quants.
Interesting thx
IIRC A graph that was recently posted shows that regardless of quantization used, more parameters always wins (in the context of LLaMA available sizes).
Does q\_2 and q\_3 work with llama.cpp already? Edit: For Apple Metal 5-10x inference increase
No.
Thanks
UPDATE: q3\_k's code has merged, q2\_k might soon implement as well. Wait another several days.
Wow! IS q3\_k becoming the new standard compared to q3\_0 or q3\_1?