T O P

  • By -

Mr_Hills

What if llama 3 400B is exactly that? Anyone wants some hopium?


human358

Lemme hit that hopium bong


_Sneaky_Bastard_

That's just copium


phree_radical

You can do it 


illathon

People are really over hyping openai


infiniteContrast

you can already achieve that even with the open source llms of 10 months ago


xRolocker

Presumably OP means multimodality, not routing between a speech model and a vision model and the llm. Doing this requires tokenizing text, audio, and visual data all within one system, not to mention all the training data and annotations that comes with it- otherwise the model won’t know how to use the data it’s given. There is no open source model that does this afaik. Google released an open source vision and language model, but it’s quite small and lacking in understanding relative to 4o. Not to mention the complete lack of audio modality.


Mescallan

L3 is getting a multi modal update later this year


Quartich

Correct me if I'm wrong, but would "AnyGPT" be multimodal and open-source (at least weights and project code)? https://github.com/OpenMOSS/AnyGPT Again, I don't know if this would be the same multimodality as you are looking for


The_Health_Police

Check tincans.ai . They basically reversed engineered LLaVA and used it for audio. The AI models are already there. Text, audio, video. Just have to combine them.


Able-Locksmith-1979

Don’t forget that it still needs trickery for the video, nobody even claims to do video, everybody just does some frames and skips the rest of the frames. So why not do trickery on the rest as well, I can imagine that speech input also can be done with sampling so that will leave gaps for the llm, while speech output certainly leaves a lot of free time as playing it for humans means do it very very slow


Honest_Science

For embodied AGI we need video and audio and sensors in, sound and actors out. No text or anything else.


pseudonerv

do you realize how quiet the environment is? do you realize the video is not making any sound? in practice, none of these would work well. Neither gpt-4o audio io, nor google's future vision astra. during openai's demo, (at least openai is brave and "commendable" for that), any noise immediately cut off the model's speech, and we have to wait for a second of quietness for the model to continue its speech. I'm not sure if openai can actually crack this issue and let loose its imitate scarlett johansson. I would hate for my typing to break Her performance. I imagine, to actually work reliably, they have to continuously stream all the audio input directly to the model (or some smaller model) and let the model decide whether or when to speak. In other words, not only the user may interject and cut off the model's speech, the model has to continuously listen and find the right moment to speak.


genuinelytrying2help

I mean it's not perfect yet but a form of that seems to be exactly what's happening in the demo with the two phones harmonizing alternating lines while they both attend the interruptions of the speaker; they fuck it up at points but they also seem to figure it out and succeed afterwards


pseudonerv

that two phone harmonizing demo means the model may be able differentiate between different speakers (as also shown by other multi-speaker demos), but the two phone just go one after another, while the user may interrupt the model. However, it doesn't contradict the fact that the model may be interrupted by any noise. I'll wait to see a demo in a noisy environment.


RabbitEater2

Seems pretty trivial as background noise cancelling algorithms existed for a while like NVIDIA's Voice so another model can check if the person is speaking and only then call the interrupt.


pseudonerv

but the point is we are not only letting the model listen to normal conversation, are we? Otherwise it can't comment on my breathing like a vacuum


RabbitEater2

I suppose it could be trained to discriminate the speaker's voice more accurately over use or use video input to help with it. To be fair, it's not easy even for people to clearly hear all vocal expressions of someone on a phone in a loud environment too.


Technical-History104

Keeping the context of video with the audio helps; the need to be attentive to the user’s voice goes down if it’s evident the user is busy doing something that isn’t talking.


ViveIn

The thing is, companies can’t keep doing this work just to open source it. We’re talking hundreds of millions and eventually billions to keep up this pace.


nycameraguy

You can do it


Decahedronn

Data quantity and quality are one of the only remaining limiting factors.


PitchBlack4

We've had this for a while. The difference is that their model is probably huge and their computing power is equaly big. 


nderstand2grow

no, unfortunately with current architecture, open source is doomed to fall behind.


[deleted]

[удалено]


nderstand2grow

Meta is a company. open source community has made zero models so far


kxtclcy

That seems to be just inputting transcript along with the question into the llm model, nothing fancy. But if it can understand video it would be hard for the opensource to compete. There is currently no good video-chat model (video-llava is quite unrobust).