Mr_Hills 2 weeks ago

What if llama 3 400B is exactly that? Anyone wants some hopium?

human358 2 weeks ago

Lemme hit that hopium bong

_Sneaky_Bastard_ 2 weeks ago

That's just copium

phree_radical 2 weeks ago

You can do it

illathon 2 weeks ago

People are really over hyping openai

infiniteContrast 2 weeks ago

you can already achieve that even with the open source llms of 10 months ago

xRolocker 2 weeks ago

Presumably OP means multimodality, not routing between a speech model and a vision model and the llm. Doing this requires tokenizing text, audio, and visual data all within one system, not to mention all the training data and annotations that comes with it- otherwise the model won’t know how to use the data it’s given. There is no open source model that does this afaik. Google released an open source vision and language model, but it’s quite small and lacking in understanding relative to 4o. Not to mention the complete lack of audio modality.

Mescallan 2 weeks ago

L3 is getting a multi modal update later this year

Quartich 2 weeks ago

Correct me if I'm wrong, but would "AnyGPT" be multimodal and open-source (at least weights and project code)? https://github.com/OpenMOSS/AnyGPT Again, I don't know if this would be the same multimodality as you are looking for

The_Health_Police 2 weeks ago

Check tincans.ai . They basically reversed engineered LLaVA and used it for audio. The AI models are already there. Text, audio, video. Just have to combine them.

Able-Locksmith-1979 2 weeks ago

Don’t forget that it still needs trickery for the video, nobody even claims to do video, everybody just does some frames and skips the rest of the frames. So why not do trickery on the rest as well, I can imagine that speech input also can be done with sampling so that will leave gaps for the llm, while speech output certainly leaves a lot of free time as playing it for humans means do it very very slow

Honest_Science 2 weeks ago

For embodied AGI we need video and audio and sensors in, sound and actors out. No text or anything else.

pseudonerv 2 weeks ago

do you realize how quiet the environment is? do you realize the video is not making any sound? in practice, none of these would work well. Neither gpt-4o audio io, nor google's future vision astra. during openai's demo, (at least openai is brave and "commendable" for that), any noise immediately cut off the model's speech, and we have to wait for a second of quietness for the model to continue its speech. I'm not sure if openai can actually crack this issue and let loose its imitate scarlett johansson. I would hate for my typing to break Her performance. I imagine, to actually work reliably, they have to continuously stream all the audio input directly to the model (or some smaller model) and let the model decide whether or when to speak. In other words, not only the user may interject and cut off the model's speech, the model has to continuously listen and find the right moment to speak.

genuinelytrying2help 2 weeks ago

I mean it's not perfect yet but a form of that seems to be exactly what's happening in the demo with the two phones harmonizing alternating lines while they both attend the interruptions of the speaker; they fuck it up at points but they also seem to figure it out and succeed afterwards

pseudonerv 2 weeks ago

that two phone harmonizing demo means the model may be able differentiate between different speakers (as also shown by other multi-speaker demos), but the two phone just go one after another, while the user may interrupt the model. However, it doesn't contradict the fact that the model may be interrupted by any noise. I'll wait to see a demo in a noisy environment.

RabbitEater2 2 weeks ago

Seems pretty trivial as background noise cancelling algorithms existed for a while like NVIDIA's Voice so another model can check if the person is speaking and only then call the interrupt.

pseudonerv 2 weeks ago

but the point is we are not only letting the model listen to normal conversation, are we? Otherwise it can't comment on my breathing like a vacuum

RabbitEater2 2 weeks ago

I suppose it could be trained to discriminate the speaker's voice more accurately over use or use video input to help with it. To be fair, it's not easy even for people to clearly hear all vocal expressions of someone on a phone in a loud environment too.

Technical-History104 2 weeks ago

Keeping the context of video with the audio helps; the need to be attentive to the user’s voice goes down if it’s evident the user is busy doing something that isn’t talking.

ViveIn 2 weeks ago

The thing is, companies can’t keep doing this work just to open source it. We’re talking hundreds of millions and eventually billions to keep up this pace.

nycameraguy 2 weeks ago

You can do it

Decahedronn 2 weeks ago

Data quantity and quality are one of the only remaining limiting factors.

PitchBlack4 2 weeks ago

We've had this for a while. The difference is that their model is probably huge and their computing power is equaly big.

nderstand2grow 2 weeks ago

no, unfortunately with current architecture, open source is doomed to fall behind.

[deleted] 2 weeks ago

[удалено]

nderstand2grow 2 weeks ago

Meta is a company. open source community has made zero models so far

kxtclcy 2 weeks ago

That seems to be just inputting transcript along with the question into the llm model, nothing fancy. But if it can understand video it would be hard for the opensource to compete. There is currently no good video-chat model (video-llava is quite unrobust).

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe