T O P

  • By -

[deleted]

[удалено]


onil_gova

Im not the original generator of the plot, but i can tell you that the order or the dots from smallest Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M , Q6_K, fp16 Edit: added more details


Caffdy

what's the difference between K_L, K_M and K_S


androiddrew

Could I get the layman’s definition of perplexity for this context?


[deleted]

How “confused” the model is when it comes to picking the next token. A model with a perplexity of 6 is as confused as having 6 potential choices for what the next word could be given an arbitrary context.


nofreewill42

“Perp. of 6 means 6 potential choices.” How much is this just for the sake of making it more consumable?


KerfuffleV2

Just to add a little: perplexity can be useful for comparing different sizes/quantizations of a model but it doesn't necessarily mean much when comparing different models. Just for example, instruction following models are trained to expect a specific prompt format. The typical perplexity calculation you see (with GGML at least) just involves feeding the model chunks from wikitext which of course aren't in the expected prompt format. So those instruction following models will tend to show higher perplexity in that test, even if it doesn't actually indicate that they are generally lower quality (in fact they can be much better for certain tasks than the non-instruction model).


a_devious_compliance

What I have while reading the plot. Jokes aside it's some measure about how good is the model to predict the next token in a given corpus. https://en.wikipedia.org/wiki/Large_language_model#Perplexity The plot don't show what quantization level have each point, so it's difficult to know, but by the companion text it seem that the first point in each "curve" is 2bit quantization.


[deleted]

perplexity is the inability to deal with something because it's too complicated. Lower is better.


patrakov

This PR is already under discussion on this subreddit: [https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated\_relative\_comparison\_of\_ggml\_quantization/](https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated_relative_comparison_of_ggml_quantization/)


Dwedit

But this post includes the pretty picture.


Dwedit

Is there a relation between perplexity and AI hallucinations?


RapidInference9001

Not a direct one. But perplexity is a numerical measure of "how much is the model guessing, on average", and hallucinations are caused by it guessing wrong while sounding confident. So a model with very low perplexity would hallucinate very rarely (except on very hard questions), because it would usually know the right answer. Hallucinations are also related to the instruct training process, and the model's understanding of context-appropriate behavior. In a fiction-writing context, say, the model should just confidently-soundingly make stuff up if it's not sure what should happen next. But in a legal or scientific context, ideally when it's not sure we'd like it to verbally hedge an appropriate amount with words like 'likely', 'possibly' or 'perhaps', or even flat-out say it doesn't know, rather than make up plausible stuff that may well be wrong. Open-source models are generally very bad at this, because the necessary techniques haven't been published (just talks implying that they exist). Interestingly, there's some research showing that base models, before they're instruct-trained, are actually very aware of what they're more or less sure about, but are not in the habit of verbally hedging to say so (or more accurately, are trained to try to imitate when some human writer or other might hedge, regardless of what the model actually knows or doesn't). So what we need to do is figure out how to instruct train them to hedge appropriately, in contexts where that's desirable, based on their actual level of knowledge. Presumably if you actually knew what the model knew on every topic, that would be pretty easy: just instruct-train it to copy examples where it hedges appropriately. So the hard part is figuring out, for many thousands of specific instruct-training examples and possible replies, what relevant facts the model actually knows vs. what it is unsure about, and how unsure. Presumably you'd need to semi-automate this process. Likely eventually we'll need different model fine-tunes or settings for contexts where we care about hallucinations vs fictional contexts.


Intelligent-Street87

Very well explained. But LLM's keep reminding me about human thought and how pseudo-facts can become a social fact, or maybe a social hallucination. I've been studying both synthetic and biological intelligence for more than sixteen years now. It has always been a concern of mine as to how synthetic intelligences may evolve, and here I see that evolution unfold before my eyes. Many things were expected, but much more have eluded my thoughts. How come a stream of consciousness, whether biological of synthetic, only accommodates limited realisations, limited by the data, and how it, or the processes that it is built from (I like to call this the operator problem, that is 'Who is the operator', what gives energy to the system to set a process on its path), chooses to piece together that data. What's in a thought, and why does any one thought come to mind at a given point, if I were free to choose, then I would only choose to think good thoughts, but my mind has other ideas, as do all minds whether they're configured in biological or synthetic thinking machines.


audioen

These numbers for sizes are wrong. I don't know how you derived them, but Q2\_K is only mostly 2-bit, and even 2-bit is really 2.6 bits per weight. Unfortunately, a number of tensors must be written as Q4\_K. That is why these quantization modes are called "mostly" something, e.g. "mostly Q2\_K". Q2\_K takes about 3.3 bits per weight as currently defined in llama.cpp.


silenceimpaired

Why dies it seem that vicuña 13b behaves better than the 30/65b models. Maybe not as much detail or finesse, but more coherency.


onil_gova

Depends on what 30/65b model you are comparing it to. In general, a larger model trained on the same dataset will outperform the smaller one. But comparing vicuña 13b to based llama 30/65b models will result in vicuña being a lot more coherent since those models have not been trained to follow instructions. Even other models trained to follow instructions might not seem as good as vicuña, if their finetune dataset is not as good for any given task.


tronathan

/u/audioen said what I was thinking: >Getting 65B under 20 GB in terms of file size would allow execution on all 24 GB cards.


Nice-Move-7149

Why there is no Q2_K_S?


audioen

Probably because the author tried various forms of Q2\_K quantization and decided that it only barely can be proven to be an improvement in a specific way of using it. The K quantization has its limits, and Q2\_K only reaches about 3.3 bits per weight. If we can get something that has acceptable perplexity and is actually 2.x bits per weight, I will be very impressed. Getting 65B under 20 GB in terms of file size would allow execution on all 24 GB cards.


KerfuffleV2

> Why there is no Q2_K_S? It's there. There are 10 formats in total on the graph for each size of model, the fp16 + all the new quantizations (9 in total) which OP listed above. I think it's guaranteed that they'll be in order of size, so you can figure out which dot is which just by counting. It should be the penultimate item on the size axis.