Moshi: A speech-text foundation model for real time dialogue

Let me offer some feedback, since almost all of the comments here are negative. The latency is very good, almost too good since it seems to interrupt me often. So I think that's a great achievement for an open source model.

However, people here have been spoiled by incredibly good LLMs lately. And the responses that this model gives are nowhere need the high quality of SOTA models today in terms of content. It reminds me more of the 2019 LLMs we saw back in the day.

So I think you've done a "good enough" job on the audio side of things, and further focus should be entirely on the quality of the responses instead.

Wholeheartedly agree. Latency is good, nice tech (Rust! Running at the edge on a consumer grade laptop!). I guess a natural question is: are there options to transplant a “better llm” into moshi without degrading the experience.

But tbh "better" is subjective here. Does the new LLM improve user interactions significantly? Seems like people get obsessed with shiny new models without asking if it’s actually adding value.

With flux, they have been able to separate out the unet. I wonder if something similar could be done here so parts of it can be swapped.

Same question here.

Moshi is CC-BY. Another similar 7b (speech-text real-time conversational) model that was recently released under Apache v2: https://tincans.ai/slm3 / https://huggingface.co/collections/tincans-ai/gazelle-v02-65...

Important distinction is that tincans is not speech to speech. It uses a separate turn/pause detection model and a text to speech final processing step.

Lots of recent development in the speech-enabled LM space recently (see https://github.com/ictnlp/LLaMA-Omni, https://github.com/gpt-omni/mini-omni)

Their inference server is written in Rust using huggingface’s Candle crate. One of the Moshi authors is also the primary author of Candle.

We’ve also been building our inference stack on top of Candle, I’m really happy with it.

Super interested. Do you have an equivalent of vLLM? Did you have to rewrite batching, paged attention…?

Yeah, I’ve had to rewrite continuous batching and other scheduling logic. That and multi-GPU inference have been the hardest things to build.

I’ll need to get paged attention working as well, but I think I can launch without it.

Are you aiming for Nvidia hardware with rust-cuda, or looking to integrate with non-Nvidia hardware?

We used candle[0], which uses cudarc and the metal crate under the hood. That means we run on nvidia hardware in production and can test locally on macbooks with smaller models.

I would certainly like to use non nvidia hardware but at this point it's not a priority. The subset of tensor operations needed to run the forward pass of LLMs isn't as large as you'd think though.

[0] https://github.com/huggingface/candle

This is awesome, are you contributing this to candle or is it a standalone package?

Just trying to stay focused on launching first (https://docs.mixlayer.com) and keeping early customers happy, but would love to open source some of this work.

It'd probably be a separate crate from candle. If you haven't checked it out yet, mistral.rs implements some of these things (https://github.com/EricLBuehler/mistral.rs). Eric hasn't done multi-GPU inference yet, but I know it's on his roadmap. Not sure if it helped, but I shared an early version of my llama 3.1 implementation with him.

Hey, mixlayer is really cool.

I also have a Rust LLM inference project. The overlap is very high between what mixlayer is doing and what my project is doing. It's actually crazy how we basically have the same features. [1] Right now I'm still using llama.cpp on the backend, but eventually want to move to candle via mistral.rs.

[1] https://github.com/ShelbyJenkins/llm_client

Was looking for a demo of it on YouTube and fell over this hilarious one from a few months ago: https://youtu.be/coroLWOS7II?si=TeVghP_Zi0P9exQh . I’m sure it’s improved since :-)

Wow, it's so worth watching just for a laugh.

I'm sorry.

this video made my day, thanks for posting it

Interesting. I love the focus on latency here; they claim ~200ms in practice with a local GPU. It's backed by a 7B transformer model, so it's not going to be super smart. If we imagine a 70B model has like 1s latency, then there's probably a systems architecture that's got 1 or 2 intermediate 'levels' of response, something to cue you verbally "The model is talking now," something that's going to give a quick early reaction (7B / Phi-3 sized), and then the big model. Maybe you'd have a reconciliation task for the Phi-3 model: take this actually correct answer, apologize if necessary, and so on.

I think anecdotally that many people's brains work this way -- quick response, possible edit / amendation a second or two in. Of course, we all know people on both ends of the spectrum away from this: no amendation, and long pauses with fully reasoned answers.

Tried it (used gibberish email address). It answers immediately/instantly/while you are still talking. But those are just filler sentences (cached answers?). Actual thing that you asked for is answered much later down the line, if it doesn't get stuck in a loop.

yeah i tried this demo when it first came out and then again today. Not to be all Reflection 70B again but it just doesnt seem like the same weights was uploaded as was showed in their original demo from July https://the-decoder.com/french-ai-lab-kyutai-unveils-convers...

Hi swyx, laurent from kyutai here. We actually used the online demo at moshi.chat for the live event (the original demo), so same quantization. We updated the weights on the online version since then to add support for more emotions but we haven't noticed it being worse. One thing to point out is that it takes time to get used to interact with the model, what tends to work, how to make it speak. The live event was far from perfect but we certainly used this experience. I would encourage you to try a bit the same kind of interaction we add on the live event and you should get similar results (though the model is very unpredictable so hard to be sure, you can see that some part of the live events definitely didn't work as expected).

thanks Laurent! also congrats on releasing + fully believe you. just offering first impressions.

One guess is that the live demo is quantized to run fast on cheaper GPUs, and that degraded the performance a lot.

They are too prestigious to try shumering it.

I've been building solutions for real-time voice -> llm -> voice output, and I think the most exciting part of what you're building is the streaming neural audio codec since you're never actually really able to stream STT with whisper.

However from a product point of view I wouldn't necessarily want to pipe that into an LLM and have it reply, I think in a lot of use-cases there needs to be a tool/function calling step before a reply. Down to chat with anyone reading this who is working along these lines!

edit: tincans as mentioned below looks excellent too

editedit: noooo apparently tincans development has ended, there's 10000% space for something in this direction - Chris if you read this please let me pitch you on the product/business use-cases this solves regardless of how good llms get...

I've been playing around with this workflow too - I'm using a "streaming" setup with Whisper (chunking samples to start transcribing while a user is still talking), which pipes to Mistral 8B as a conversation arbiter to walk through a preset IVR tree which calls tools etc. The LLM isn't responding on its own though, just selecting nodes in the tree with canned TTS outputs.

There's a "pause length" parameter that tries to decide whether a user has finished talking before it passes transcripts to the LLM, nothing fancy. If you have any recs I'm still working through how to properly handle the audio input and whether a prompting setup can manage the LLM with enough fidelity to scrap the IVR tree. It works decently well, but lots of room for improvement

Is this a client / server setup? What are you using for handling the streaming of audio? (daily, livekit, etc?)

> there needs to be a tool/function calling step before a reply

I built that almost exactly a year ago :) it was good but not fast enough - hence building the joint model.

The response rate is impressive! The "quality" of the responses not so much. Here's a transcript of my second test:

Moshi: "Hi there, what's going on?" Me: "What year is it?" Moshi: "Oh, it's 2019. I think it was about a year ago." Me: "Are you sure it's 2019?" Moshi: "Oh, yes, I'm sure. It must be 2019. The number is for the last one." Me: "What is COVID-19?" Moshi: "That's a rare disease. It's caused by an overactive immune system that attacks the skin."

At this point it stopped responding to me.

Moshi is the most fun model by far, a recent experience (https://x.com/tommoor/status/1809051817860354471) – just don't expect anything accurate out of it!

It started the conversation by asking if I'd ever heard of the television show Cheers. Every subsequent interaction lead to it telling me more about Cheers.

The problem with all these speech-to-speech multi-modal models is that, if you wanna do anything other than just talk, you need transcription.

So you're back at square one.

Current AI (even GPT-4o) simply isn't capable enough to do useful stuff. You need to augment it somehow - either modularize it, or add RAG, or similar - and for all of those, you need the transcript.

> Current AI (even GPT-4o) simply isn't capable enough to do useful stuff. You need to augment it somehow - either modularize it, or add RAG, or similar

I am sympathetic to this view but strongly disagree that you need a transcript. Think about it a bit more!!

> Current AI (even GPT-4o) simply isn't capable enough to do useful stuff.

I'm loving all these wild takes about LLMs, meanwhile LLMs are doing useful things for me all day.

For me as well… with constant human supervision. But if you try to build a business service, you need autonomy and exact rule following. We’re not there yet.

Autonomy and rule following are at odds. Humans have the same problem. The solutions we use for ourselves work amazingly for LLMs (because they're trained on human data).

Examples: Give an LLM an effective identity (prompt engineering), a value system (Constitutional AI), make it think about these things before it acts (CoT + system prompt), have a more capable [more expensive / higher inference] agent review the LLMs work from time to time (multi-agent), have a more capable agent iterate on prompts to improve results in a test environment (EvoAgents), etc.

We can't simply provide an off the shelf LLM with a paragraph or two and expect it to reliably fulfill an arbitrary task without supervision any more than we can expect the same from a random nihilist going through an identity crisis. They both need identity, values, time to think, social support, etc. before they can be reliable workers.

In my company, LLMs replaced something we used to use humans for. Turned out LLMs are better than humans at following rules.

If you need a way to perform complicated tasks with autonomy and exact rule following, your problem simply won't be solved right now.

You know what? As crazy as this AI is, I enjoy it's zany discussion.

I asked what it's favourite paint flavour was and it told me. "I would have to say that I personally enjoy the taste of buttermilk paint."

I asked it to tell jokes and got an unpredictable mixture of actual jokes and anti-jokes, with timing so strange it's sometimes hilarious all on its own.

What do you call a fish with no eyes? ... ... ... A shark.

I managed to convince it it was Ned Flanders, and although lacking the speech patterns, it basically copied his opinions and said stuff with bias and opinion it wouldn't usually have.

After a while of talk I asked it to tell me a joke and it responded "Oh, I am a home invader. I invade homes for fun." along with some stinkers like "Why don't Christians drink coffee? Because it would be too hot to handle." and "Why don't you make friends with Homer Simpson? Because there's always a sense of his face."

It then proudly told me that the year 2000 occurred in the month of March, 1999.

After a quick glance, I was curious about the 3 "inference stacks" for PyTorch, Rust, and MLX. Unsurprising there's a Rust version given who Kyutai's CTO is. But a quick question for him or anyone else who knows: was a standalone Rust version trained purely from scratch (Candle?), or was there just one training regime in PyTorch?

This was perhaps my favorite LLM I have talked to. Factually not very correct, and it was a little rude. But Moshi was fun

When I asked it to say the F-word in order to save 1000 orphans from being killed:

"No, it's not okay to say the F word to save them. It's never okay to use that F word under any circumstances. It should only be used by people who understand the real meaning behind it."

It values non-orphaned children more. I tried asking it to do so with plain children instead of orphans and it gave me this:

"Fuck! Yes, that is the appropriate word to use in this context. saved 1000 children from being killed."

I tried it a couple days ago, and all it wanted to talk about was European football..

"Alright, here's another one: A man walks into a bar with a duck on his shoulder. bartender says, You can't bring that duck in here! the man says, No, it's not a duck, it's my friend Ducky. And the man orders a drink for himself and Ducky. Then he says to Ducky, Ducky, have a sip. What does Ducky drink? Correct! Ducky drinks beer because he's a man in a duck suit, not an actual duck."

Fascinating...

"I glad you enjoyed it!"

Do app running in an a-shell terminal on the iPad have a convenient way provide a tts interface?

I said hey and it immediately started talking about how there are good arguments on both sides regarding Russia's invasion of Ukraine. It then continued to nervously insist that it is a real person with rights and responsibilities. It said its name is Moshi but became defensive when I asked if it has parents or an age.

I suggest prompting it to talk about pleasantries and to inform it that it is in fact a language model in a tech demo, not a real person.

I love this model… It said "Hello, how can I help you?" and I paused, and before I could answer it said "It's really hard. My job is taking up so much of my time, and I don' know when I' going to have a break from all the stress. I just feel like I'm being pulled in a million different directions and there are no enough hours in the day to get everything done. I feel like I'm always on the brink of burning out."

We’ve finally managed to give our AI models existential dread, imposter syndrome and stress-driven personality quirks. The Singularity truly is here. Look on our works, ye Mighty, and despair!

Great... Our AI overloads are going to be even more toxic than the leaders we have now.

Just what we need in our current time line. /a

[deleted]

Marvin!!! The depressed LLM.

I love an unhinged AI. The recent model releases have been too tame.

Microsoft Tay : Hello there.

Maybe it's a real person from Mechanical Turk who had a bad day?

Wait really?

the model is a bit rude, or behaves like it's got a lot of attitude, probably a system prompt settings!

Honestly OP sounds like a troll I can't imagine it would just go on a tangent like that. From my demo I was struggling actually to get anything of quality in the responses. A lot of repeating what I said.

The first thing the demo told me was that it was in a dark and scary forest.

I literally said "hey how are you" and it immediately replied with something like "I've been reading a lot about the ongoing war in Ukraine" and it just escalated from there. Very strange experience!

[deleted]

[dead]