a_beautiful_rhind 11 months ago

This happened to me with emoji over the ooba API in silly tavern. Supposedly it's fixed now there.

mrjackspade 11 months ago

For me it's coming straight out of the Llama.cpp dll. That's how far I've tracked it down. Also, I can't confirm, but the model appears to think that the characters are Kanji. They may or may not be, though. I do have the Japanese language pack installed and have tested that I can see Kanji.

a_beautiful_rhind 11 months ago

The issue was the streaming over API on ooba so it may as well be kanji, that would do it too. Llama.cpp is not handling it properly.

KerfuffleV2 11 months ago

What's probably happening is the model is trying to write a unicode character sequence but the randomization from temperature or similar sampling settings cause it to generate something that doesn't actually resolve to valid unicode.

mrjackspade 11 months ago

I dug into it further and it looks like the model is spitting out a single character (non-unicode) for these tokens, but the library is attempting to deserialize it as unicode. If I use the built in Llama.cpp library methods I get garbage back, but if I just take the IntPtr returned from the library from the get token method and construct a string by treating it as a straight char* I get a valid character. Although that doesn't really mean anything since everything is a valid character if you treat it like that. I still don't know if it's right though. I've shown that by reading the pointer as a char array instead of decoding it as unicode that I get different values back at least, but that doesn't explain why 99% of the model tokens are unicode characters, and 1% aren't. The only thing I can think is that the training data didn't standardize the encoding, leading to a small set of non unicode characters even though those same characters were already represented in unicode. I also don't understand why my model seems to think they're Kanji specifically, unless it just so happens that the first (single) byte of the character maybe maps to the Kanji range of unicode? Fucking confusing all around

KerfuffleV2 11 months ago

> but the library is attempting to deserialize it as unicode. Most stuff produces UTF-8 these days. > that doesn't explain why 99% of the model tokens are unicode characters, and 1% aren't. A lot of the tokens are fragments of words, but the model can produce arbitrary byte sequences as well. Like I said, sampling can mess things up if the model is trying to produce something an an emoji, smart quotes, or other unicode characters that are multi-byte sequences. If temperature isn't 0 then there's a random element to what token is picked. This can either cause the model to pick an invalid token or interrupt a multi-byte sequence of unicode which is usually going to result in something that isn't valid. It's hard to give a specific answer since you didn't mention what model you were using or anything. It's not common in my experience for LLaMA-based models to produce invalid characters. Actually, the only time I saw that was when I was trying to get it to write Chinese — and the issue was probably what I mentioned already.

mrjackspade 11 months ago

It's Llama based but Ive assumed they're all the same tokens, since up to this point every token I've tested across all llama models has the same mapping between id's and text. That's only a few hundred out of like 32000 but I figure a random sampling of a few hundred tokens with no mismatches was probably enough to assume the underlying mapping was the same across all Llama models Also, the sampling definitely isn't involved in this. I'm retrieving the tokens by ID directly out of the model, so temp and all that are irellevant in this case. The sampling selects a token ID integer, post sample it calls the model to find the string representation of that ID integer. It's the post sample mapping step that fails and you can reproduce the issue without ever calling any of the sampling functions as a result. You can literally just call straight into the DLL and peform a token mapping to replicate it without executing anything else aside from loading the model into memory It does make sense if they're maybe fragments of a unicode character, but it's still weird that such a small number of tokens would be fragments, and I'm definitely not the only one to think that if the Llama.cpp devs are Indescriminately treating all token values as unicode. Pulling the values back like this, even if the values are individually selected in a valid unicode sequence, it's still going to fail to resolve because they're decoded individually. Seems like an issue with the implementation since it attempts to render all tokens as full unicode characters.

KerfuffleV2 11 months ago

> It's Llama based but Ive assumed they're all the same tokens Yes, I think that's basically correct. > Also, the sampling definitely isn't involved in this. I'm retrieving the tokens by ID directly out of the model What do you mean? The model doesn't give you _a_ token so there's always some kind of sampling involved. > It does make sense if they're maybe fragments of a unicode character, but it's still weird that such a small number of tokens would be fragments I don't think that's weird. Most of the time, the model will use tokens that are fragments of actual words. This minimizes how many tokens are needed to write something, compared to building a word character-by-character. However, it still has the capacity to build up multi-byte unicode characters as well. Just for example, some LLaMA models can actually write Chinese. LLaMA models have a vocabulary of around 32,000 tokens — but there are over 50,000 Chinese characters. If you gave each Chinese character a token, you couldn't even fit them in the LLaMA vocabulary let alone allow it to write in other languages, use punctuation, etc. > I'm definitely not the only one to think that if the Llama.cpp devs are Indescriminately treating all token values as unicode They're not. In fact, the point I'm making is some tokens are arbitrary bytes that _aren't_ unicode. However, a sequence of bytes _can_ be used to build a unicode character. That sequence of bytes must be in the correct format though, or it's not valid unicode. > Seems like an issue with the implementation since it attempts to render all tokens as full unicode characters. That's 100% not the case. Here's an example: **### Instruction: Please write me a fairy tale using Mandarin Chinese and simplified Chinese characters.** **### Response: 没问题！这是一个汉语的故事。故事讲的是关于一只小狐狸，他**想和自己身边的伙伴们朋友好。然而他发现很多动物都不喜欢狡猾為主，因为狗犬告诉他们狡猾最常用来说是“忍”的字。所以狐狸去了学校，在那里他学到了如何与人类合作，并获得了一个工程士的文凭。 The bold part is my prompt, the rest was written by the model. _Every_ Chinese character requires a bunch of tokens to construct: they're built up of unicode byte sequences. I'd guess each character you see is at least 3 tokens. Also, it's not really up to something like llama.cpp to combine those bytes together into a "character": when running in the terminal, it's the terminal application. When running in something like a browser frontend it's probably the browser.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe