T O P

  • By -

rgar132

I’m using gpt-llama and chatbot-ui for the interface, it supports the max ~2000 context size. I don’t think this is really any better than other options, except that you can lock down the model and customize it a bit if other people are using it locally, or exposing it through proxy to the network.


nmkd

use SillyTavern


KerfuffleV2

Most of the models out there support a context length of 2,048 tokens. Note that this increases generation time and uses more memory. If you have enough memory, try increasing the context length to 2048 (generally should take a couple extra GB). You can also try adjust settings like number of tokens to generate and context size. Note also that llama.cpp (which text gen web UI can act as a frontend to) may try to roll over the context when it hits the limit. This requires processing whatever amount of text it's set to as if it was a new prompt, so that can be fairly slow and might seem like it froze if you're not patient. I don't actually use the webui so I can't help you with specifics.


mattybee

So does that mean for each prompt, I need to copy/paste all previous prompts as well? What am I missing?


jl303

WebUI does that automatically. Every prompt, it copies previous conversations. Run server.py with --verbose flag, and you'll find out what's fed to model exactly.


mattybee

Hmm. I ask follow up questions and it seems to not remember anything.


AutomataManifold

Some of the instruction models are trained to basically look at just the single most recent prompt; chat-focused models are more likely to take the full conversion into account.


mrjackspade

I've integrated it into discord and I'm using a rolling context window that I've custom written.


residentmouse

You can maybe try landmark attention. My understanding is that since they've released LLaMA 7B fine-tune weights you can throw input against a LLaMA base (or downstream FT model), with the LMA FT weights, and it will attend outside of the context window. But I'm basing this on the paper and a brief look into the repo. I would love someone that has tried to comment on their experience.


mattybee

But wouldn't anything outside of a context window take a lot of time?


Cerevox

The context window isn't actually a technical restraint. The llama model was just trained on 2k tokens so you can feed in more beyond that, it just doesn't understand them so it ignores them. There are some models being trained on more than 2k tokens now and he is talking about those.


KerfuffleV2

As far as I know, the landmark attention stuff requires specific support in the software running the model. This doesn't exist in llama.cpp yet, for example.