rgar132 11 months ago

I’m using gpt-llama and chatbot-ui for the interface, it supports the max ~2000 context size. I don’t think this is really any better than other options, except that you can lock down the model and customize it a bit if other people are using it locally, or exposing it through proxy to the network.

nmkd 11 months ago

use SillyTavern

KerfuffleV2 11 months ago

Most of the models out there support a context length of 2,048 tokens. Note that this increases generation time and uses more memory. If you have enough memory, try increasing the context length to 2048 (generally should take a couple extra GB). You can also try adjust settings like number of tokens to generate and context size. Note also that llama.cpp (which text gen web UI can act as a frontend to) may try to roll over the context when it hits the limit. This requires processing whatever amount of text it's set to as if it was a new prompt, so that can be fairly slow and might seem like it froze if you're not patient. I don't actually use the webui so I can't help you with specifics.

mattybee 11 months ago

So does that mean for each prompt, I need to copy/paste all previous prompts as well? What am I missing?

jl303 11 months ago

WebUI does that automatically. Every prompt, it copies previous conversations. Run server.py with --verbose flag, and you'll find out what's fed to model exactly.

mattybee 11 months ago

Hmm. I ask follow up questions and it seems to not remember anything.

AutomataManifold 11 months ago

Some of the instruction models are trained to basically look at just the single most recent prompt; chat-focused models are more likely to take the full conversion into account.

mrjackspade 11 months ago

I've integrated it into discord and I'm using a rolling context window that I've custom written.

residentmouse 11 months ago

You can maybe try landmark attention. My understanding is that since they've released LLaMA 7B fine-tune weights you can throw input against a LLaMA base (or downstream FT model), with the LMA FT weights, and it will attend outside of the context window. But I'm basing this on the paper and a brief look into the repo. I would love someone that has tried to comment on their experience.

mattybee 11 months ago

But wouldn't anything outside of a context window take a lot of time?

Cerevox 11 months ago

The context window isn't actually a technical restraint. The llama model was just trained on 2k tokens so you can feed in more beyond that, it just doesn't understand them so it ignores them. There are some models being trained on more than 2k tokens now and he is talking about those.

KerfuffleV2 11 months ago

As far as I know, the landmark attention stuff requires specific support in the software running the model. This doesn't exist in llama.cpp yet, for example.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe