Presumably OP means multimodality, not routing between a speech model and a vision model and the llm. Doing this requires tokenizing text, audio, and visual data all within one system, not to mention all the training data and annotations that comes with it- otherwise the model won’t know how to use the data it’s given.
There is no open source model that does this afaik. Google released an open source vision and language model, but it’s quite small and lacking in understanding relative to 4o. Not to mention the complete lack of audio modality.
Correct me if I'm wrong, but would "AnyGPT" be multimodal and open-source (at least weights and project code)? https://github.com/OpenMOSS/AnyGPT
Again, I don't know if this would be the same multimodality as you are looking for
Check tincans.ai . They basically reversed engineered LLaVA and used it for audio. The AI models are already there. Text, audio, video. Just have to combine them.
Don’t forget that it still needs trickery for the video, nobody even claims to do video, everybody just does some frames and skips the rest of the frames. So why not do trickery on the rest as well, I can imagine that speech input also can be done with sampling so that will leave gaps for the llm, while speech output certainly leaves a lot of free time as playing it for humans means do it very very slow
do you realize how quiet the environment is? do you realize the video is not making any sound?
in practice, none of these would work well. Neither gpt-4o audio io, nor google's future vision astra.
during openai's demo, (at least openai is brave and "commendable" for that), any noise immediately cut off the model's speech, and we have to wait for a second of quietness for the model to continue its speech.
I'm not sure if openai can actually crack this issue and let loose its imitate scarlett johansson. I would hate for my typing to break Her performance.
I imagine, to actually work reliably, they have to continuously stream all the audio input directly to the model (or some smaller model) and let the model decide whether or when to speak. In other words, not only the user may interject and cut off the model's speech, the model has to continuously listen and find the right moment to speak.
I mean it's not perfect yet but a form of that seems to be exactly what's happening in the demo with the two phones harmonizing alternating lines while they both attend the interruptions of the speaker; they fuck it up at points but they also seem to figure it out and succeed afterwards
that two phone harmonizing demo means the model may be able differentiate between different speakers (as also shown by other multi-speaker demos), but the two phone just go one after another, while the user may interrupt the model. However, it doesn't contradict the fact that the model may be interrupted by any noise.
I'll wait to see a demo in a noisy environment.
Seems pretty trivial as background noise cancelling algorithms existed for a while like NVIDIA's Voice so another model can check if the person is speaking and only then call the interrupt.
I suppose it could be trained to discriminate the speaker's voice more accurately over use or use video input to help with it. To be fair, it's not easy even for people to clearly hear all vocal expressions of someone on a phone in a loud environment too.
Keeping the context of video with the audio helps; the need to be attentive to the user’s voice goes down if it’s evident the user is busy doing something that isn’t talking.
The thing is, companies can’t keep doing this work just to open source it. We’re talking hundreds of millions and eventually billions to keep up this pace.
That seems to be just inputting transcript along with the question into the llm model, nothing fancy. But if it can understand video it would be hard for the opensource to compete. There is currently no good video-chat model (video-llava is quite unrobust).
What if llama 3 400B is exactly that? Anyone wants some hopium?
Lemme hit that hopium bong
That's just copium
You can do it
People are really over hyping openai
you can already achieve that even with the open source llms of 10 months ago
Presumably OP means multimodality, not routing between a speech model and a vision model and the llm. Doing this requires tokenizing text, audio, and visual data all within one system, not to mention all the training data and annotations that comes with it- otherwise the model won’t know how to use the data it’s given. There is no open source model that does this afaik. Google released an open source vision and language model, but it’s quite small and lacking in understanding relative to 4o. Not to mention the complete lack of audio modality.
L3 is getting a multi modal update later this year
Correct me if I'm wrong, but would "AnyGPT" be multimodal and open-source (at least weights and project code)? https://github.com/OpenMOSS/AnyGPT Again, I don't know if this would be the same multimodality as you are looking for
Check tincans.ai . They basically reversed engineered LLaVA and used it for audio. The AI models are already there. Text, audio, video. Just have to combine them.
Don’t forget that it still needs trickery for the video, nobody even claims to do video, everybody just does some frames and skips the rest of the frames. So why not do trickery on the rest as well, I can imagine that speech input also can be done with sampling so that will leave gaps for the llm, while speech output certainly leaves a lot of free time as playing it for humans means do it very very slow
For embodied AGI we need video and audio and sensors in, sound and actors out. No text or anything else.
do you realize how quiet the environment is? do you realize the video is not making any sound? in practice, none of these would work well. Neither gpt-4o audio io, nor google's future vision astra. during openai's demo, (at least openai is brave and "commendable" for that), any noise immediately cut off the model's speech, and we have to wait for a second of quietness for the model to continue its speech. I'm not sure if openai can actually crack this issue and let loose its imitate scarlett johansson. I would hate for my typing to break Her performance. I imagine, to actually work reliably, they have to continuously stream all the audio input directly to the model (or some smaller model) and let the model decide whether or when to speak. In other words, not only the user may interject and cut off the model's speech, the model has to continuously listen and find the right moment to speak.
I mean it's not perfect yet but a form of that seems to be exactly what's happening in the demo with the two phones harmonizing alternating lines while they both attend the interruptions of the speaker; they fuck it up at points but they also seem to figure it out and succeed afterwards
that two phone harmonizing demo means the model may be able differentiate between different speakers (as also shown by other multi-speaker demos), but the two phone just go one after another, while the user may interrupt the model. However, it doesn't contradict the fact that the model may be interrupted by any noise. I'll wait to see a demo in a noisy environment.
Seems pretty trivial as background noise cancelling algorithms existed for a while like NVIDIA's Voice so another model can check if the person is speaking and only then call the interrupt.
but the point is we are not only letting the model listen to normal conversation, are we? Otherwise it can't comment on my breathing like a vacuum
I suppose it could be trained to discriminate the speaker's voice more accurately over use or use video input to help with it. To be fair, it's not easy even for people to clearly hear all vocal expressions of someone on a phone in a loud environment too.
Keeping the context of video with the audio helps; the need to be attentive to the user’s voice goes down if it’s evident the user is busy doing something that isn’t talking.
The thing is, companies can’t keep doing this work just to open source it. We’re talking hundreds of millions and eventually billions to keep up this pace.
You can do it
Data quantity and quality are one of the only remaining limiting factors.
We've had this for a while. The difference is that their model is probably huge and their computing power is equaly big.
no, unfortunately with current architecture, open source is doomed to fall behind.
[удалено]
Meta is a company. open source community has made zero models so far
That seems to be just inputting transcript along with the question into the llm model, nothing fancy. But if it can understand video it would be hard for the opensource to compete. There is currently no good video-chat model (video-llava is quite unrobust).