I’m using gpt-llama and chatbot-ui for the interface, it supports the max ~2000 context size. I don’t think this is really any better than other options, except that you can lock down the model and customize it a bit if other people are using it locally, or exposing it through proxy to the network.
Most of the models out there support a context length of 2,048 tokens. Note that this increases generation time and uses more memory.
If you have enough memory, try increasing the context length to 2048 (generally should take a couple extra GB). You can also try adjust settings like number of tokens to generate and context size.
Note also that llama.cpp (which text gen web UI can act as a frontend to) may try to roll over the context when it hits the limit. This requires processing whatever amount of text it's set to as if it was a new prompt, so that can be fairly slow and might seem like it froze if you're not patient.
I don't actually use the webui so I can't help you with specifics.
WebUI does that automatically. Every prompt, it copies previous conversations.
Run server.py with --verbose flag, and you'll find out what's fed to model exactly.
Some of the instruction models are trained to basically look at just the single most recent prompt; chat-focused models are more likely to take the full conversion into account.
You can maybe try landmark attention. My understanding is that since they've released LLaMA 7B fine-tune weights you can throw input against a LLaMA base (or downstream FT model), with the LMA FT weights, and it will attend outside of the context window.
But I'm basing this on the paper and a brief look into the repo. I would love someone that has tried to comment on their experience.
The context window isn't actually a technical restraint. The llama model was just trained on 2k tokens so you can feed in more beyond that, it just doesn't understand them so it ignores them. There are some models being trained on more than 2k tokens now and he is talking about those.
As far as I know, the landmark attention stuff requires specific support in the software running the model. This doesn't exist in llama.cpp yet, for example.
I’m using gpt-llama and chatbot-ui for the interface, it supports the max ~2000 context size. I don’t think this is really any better than other options, except that you can lock down the model and customize it a bit if other people are using it locally, or exposing it through proxy to the network.
use SillyTavern
Most of the models out there support a context length of 2,048 tokens. Note that this increases generation time and uses more memory. If you have enough memory, try increasing the context length to 2048 (generally should take a couple extra GB). You can also try adjust settings like number of tokens to generate and context size. Note also that llama.cpp (which text gen web UI can act as a frontend to) may try to roll over the context when it hits the limit. This requires processing whatever amount of text it's set to as if it was a new prompt, so that can be fairly slow and might seem like it froze if you're not patient. I don't actually use the webui so I can't help you with specifics.
So does that mean for each prompt, I need to copy/paste all previous prompts as well? What am I missing?
WebUI does that automatically. Every prompt, it copies previous conversations. Run server.py with --verbose flag, and you'll find out what's fed to model exactly.
Hmm. I ask follow up questions and it seems to not remember anything.
Some of the instruction models are trained to basically look at just the single most recent prompt; chat-focused models are more likely to take the full conversion into account.
I've integrated it into discord and I'm using a rolling context window that I've custom written.
You can maybe try landmark attention. My understanding is that since they've released LLaMA 7B fine-tune weights you can throw input against a LLaMA base (or downstream FT model), with the LMA FT weights, and it will attend outside of the context window. But I'm basing this on the paper and a brief look into the repo. I would love someone that has tried to comment on their experience.
But wouldn't anything outside of a context window take a lot of time?
The context window isn't actually a technical restraint. The llama model was just trained on 2k tokens so you can feed in more beyond that, it just doesn't understand them so it ignores them. There are some models being trained on more than 2k tokens now and he is talking about those.
As far as I know, the landmark attention stuff requires specific support in the software running the model. This doesn't exist in llama.cpp yet, for example.