Leanstral: Open-source agent for trustworthy coding and formal proof engineering

Lean 4 paper (2021): https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

Is anyone using this approach with lean to ship production code? Writing lean spec as human, implementation and proof by agent? And then shipping lean or exporting to C? Would be great to understand how you are actually using this.

It’s great to see this pattern of people realising that agents can specify the desired behavior then write code to conform to the specs.

TDD, verification, whatever your tool; verification suites of all sorts accrue over time into a very detailed repository of documentation of how things are supposed to work that, being executable, puts zero tokens in the context when the code is correct.

It’s more powerful than reams upon reams of markdown specs. That’s because it encodes details, not intent. Your intent is helpful at the leading edge of the process, but the codified result needs shoring up to prevent regression. That’s the area software engineering has always ignored because we have gotten by on letting teams hold context in their heads and docs.

As software gets more complex we need better solutions than “go ask Jim about that, bloke’s been in the code for years”.

I feel like the difference is minimal, if not entirely dismissable. Code in this sense is just a representation of the same information as someone would write in an .md file. The resolution changes, and that's where both detail and context are lost.

I'm not against TDD or verification-first development, but I don't think writing that as code is the end-goal. I'll concede that there's millions of lines of tests that already exist, so we should be using those as a foundation while everything else catches up.

AI is the reality that TDD never before had the opportunity to live up to

Not just TDD. Amazon, for instance, is heading towards something between TDD and lightweight formal methods.

They are embracing property-based specifications and testing à la Haskell's QuickCheck: https://kiro.dev

Then, already in formal methods territory, refinement types (e.g. Dafny, Liquid Haskell) are great and less complex than dependent types (e.g. Lean, Agda).

What about model-driven development? Spec to code was the name of the game for UML.

It makes sense to me as long as you're not vibe coding the PBTs.

I've seen this sentiment and am a big fan of it, but I was confused by the blog post, and based on your comment you might be able to help: how does Lean help me? FWIW, context is: code Dart/Flutter day to day.

I can think of some strawmen: for example, prove a state machine in Lean, then port the proven version to Dart? But I'm not familiar enough with Lean to know if that's like saying "prove moon made of cheese with JavaScript, then deploy to the US mainframe"

I don't think he's referring to Lean specifically, but any sort of executable testing methodology. It removes the human in the loop in the confidence assurance story, or at least greatly reduces their labor. You cannot ever get such assurance just by saying, "Well this model seems really smart to me!" At best, you would wind up with AI-Jim.

(One way Lean or Rocq could help you directly, though, would be if you coded your program in it and then compiled it to C via their built-in support for it. Such is very difficult at the moment, however, and in the industry is mostly reserved for low-level, high-consequence systems.)

>Such is very difficult at the moment

What do you mean? It's a nice and simple language. Way easier to get started than OCaml or Haskell for example. And LLMs write programs in Lean4 with ease as well. Only issue is that there are not as many libraries (for software, for math proofs there is plenty).

But for example I worked with Claude Code and implemented a shell + most of unix coreutils in like a couple of hours. Claude did some simple proofs as well, but that part is obvs harder. But when the program is already in Lean4, you can start moving up the verification ladder up piece by piece.

But isn't that tantamount with "his comment is a complete non-sequitor"?

I don't think so? Lean is formal methods, so it makes sense to discuss the boons of formal and semiformal methods more generally.

I used to think that the only way we would be able to trust AI output would be by leaning heavily into proof-carrying code, but I've come to appreciate the other approaches as well.

There have been a lot of conversations recently about how model alignment is relative and diversity of alignment is important - see the recent podcast episode between Jack Clark (co-founder of Anthropic) and Ezra Klein.

Many comments here point out that Mistral's models are not keeping up with other frontier models - this has been my personal experience as well. However, we need more diversity of model alignment techniques and companies training them - so any company taking this seriously is valuable.

they ll get there

The real world success they report reminds me of Simon Willison’s Red Green TDD: https://simonwillison.net/guides/agentic-engineering-pattern...

> Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.

If Agent is writing the tests itself, does it offer better correctness guarantees than letting it write code and tests?

It is definitely not foolproof but IMHO, to some extent, it is easier to describe what you expect to see than to implement it so I don't find it unreasonable to think it might provide some advantages in terms of correctness.

That definitely depends upon the situation. More often than not, properly testing a component takes me more time than writing it.

In my experience, this tends to be more related to instrumentation / architecture than a lack of ability to describe correct results. TDD is often suggested as a solution.

Given the issues with AWS with Kiro and Github, We already have just a few high-profile examples of what happens when AI is used at scale and even when you let it generate tests which is something you should absolutely not do.

Otherwise in some cases, you get this issue [0].

[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...

The linked article does not speak of tests, it speaks of a team that failed to properly review an LLM refactor then proceeds to blame the tooling.

LLMs are good at writing tests in my experience.

TDD == Prompt Engineering, for Agentic coding tasks.

Wild it’s taken people this long to realize this. Also lean tickets / tasks with all needed context to complete the task, including needed references / docs, places to look in source, acceptance criteria, other stuff.

Curious if anyone else had the same reaction as me

This model is specifically trained on this task and significantly[1] underperforms opus.

Opus costs about 6x more.

Which seems... totally worth it based on the task at hand.

[1]: based on the total spread of tested models

Agreed. The idea is nice and honorable. At the same time, if AI has been proving one thing, it's that quality usually reigns over control and trust (except for some sensitive sectors and applications). Of course it's less capital-intense, so makes sense for a comparably little EU startup to focus on that niche. Likely won't spin the top line needle much, though, for the reasons stated.

> quality usually reigns over control and trust

Most Copilot customers use Copilot because Microsoft has been able to pinky promise some level of control for their sensitive data. That's why many don't get to use Claude or Codex or Mistral directly at work and instead are forced through their lobotomised Copilot flavours.

Remember, as of yet, companies haven't been able to actually measure the value of LLMs ... so it's all in the hands of Legal to choose which models you can use based on marketing and big words.

Ha, keep putting your prompts and workflows into cloud models. They are not okay with being a platform, they intend to cannibalize all businesses. Quality doesn't always reign over control and trust. Your data and original ideas are your edge and moat.

Alignment tax directly eats to model quality, double digit percents.

EU could help them very much if they would start enforcing the Laws, so that no US Company can process European data, due to the Americans not willing to budge on Cloud Act.

That would also help to reduce our dependency on American Hyperscalers, which is much needed given how untrustworthy the US is right now. (And also hostile towards Europe as their new security strategy lays out)

This would be unfortunately a rather nuclear option due to the continent’s insane reliance on technology that breaks its unenforced laws.

How about not making these unenforced laws in the first place so that European companies could actually have a chance at competing? We're going to suffer the externalities of AI either way, but at least there would be a chance that a European company could be relevant.

The AI Act absolutely befuddled me. How could you release relatively strict regulation for a technology that isn't really being used yet and is in the early stages of development? How did they not foresee this kneecapping AI investment and development in Europe? If I were a tinfoil hat wearer I'd probably say that this was intentional sabotage, because this was such an obvious consequence.

Mistral is great, but they haven't kept up with Qwen (at least with Mistral Small 4). Leanstral seems interesting, so we'll have to see how it does.

[dead]

But you can run this model for free on a common battery powered laptop sitting on your laps without cooking your legs.

Sorry, but what are you talking about? This is a 120B-A6B model, which isn't runnable on any laptop except the most beefed up Macbooks, and then will certainly drain its battery and cook your legs.

Yeah my bad, it requires an expensive MacBook.

I think it would still be fine for the legs and on battery for relatively short loads: https://www.notebookcheck.net/Apple-MacBook-Pro-M5-2025-revi...

But 40 degrees and 30W of heat is a bit more than comfortable if you run the agent continuously.

I'm never sure how much faith one can put into such benchmarks but in any case the optics seem to shift once you have pass@2 and pass@3.

Still, the more interesting comparison would be against something such as Codex.

the model is open source, you can run it locally. You don't think thats significant ?

Trustworthy vibe coding. Much better than the other kind!

Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something?

On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.

They haven't made the chart very clear, but it seems it has configurable passes and at 2 passes it's better than Haiku and Sonnet and at 16 passes starts closing in on Opus although it's not quite there, while consistently being less expensive than Sonnet.

pass@k means that you run the model k times and give it a pass if any of the answers is correct. I guess Lean is one of the few use cases where pass@k actually makes sense, since you can automatically validate correctness.

Oh my bad. I'm not sure how that works in practice. Do you just keep running it until the tests pass? I guess with formal verification you can run it as many times as you need, right?

It’s really not hard — just explicitly ask for trustworthy outputs only in your prompt, and Bob’s your uncle.

Assuming that what you're dealing with is assertable. I guess what I mean to say is that in some situations is difficult to articulate what is correct and what isn't depending in some situations is difficult to articulate what is correct and what isn't depending upon the situation in which the software executes.

And Bob’s your uncle.

Can someone please explain... If I don't know any Lean (and I suspect most people don't), is it of any direct value? Trying to understand if there's something it can help me with (e.g. automatically write proofs for my Go programs somehow... I'm not sure) or should I just cheer solely for more open models out there, but this one isn't for me?

Presumably the idea is that an agent generates a Lean4 specification against which the software is measured.

But then the Lean4 specification effectively becomes the software artifact.

And we're sort of back to square 1. How do you verify a Lean4 spec is correct (and that it describes what needs to be built in the first place) without human review?

You're touching on the fundamental "who watches the watchmen" problem in formal verification. But I think the framing slightly misses the key asymmetry: reviewing a Lean4 spec is dramatically easier than reviewing the implementation it constrains.

A formal spec in Lean is typically 10-50x shorter than the code it proves correct. More importantly, Lean's type checker is itself a small, trusted kernel (~10k lines) that has been scrutinized by the PL community for years. So you're not trusting the agent — you're trusting the kernel.

The practical workflow isn't "agent writes spec + code." It's: human writes spec (the hard creative part), agent generates proof that code satisfies spec, Lean kernel mechanically checks the proof. The agent can hallucinate all it wants in step 2 — if the proof doesn't typecheck, it gets rejected deterministically.

The real bottleneck is step 1: writing good specs requires domain expertise. But that's exactly where humans should stay in the loop. It's a much better division of labor than reviewing thousands of lines of generated code.

Does that mean your production code is lean? Or do you translate some other language code to lean to verify it?

I absolutely called this a couple of weeks ago, nice to be vindicated!

> I'm interested to see what it is in the age of LLMs or similar future tools. I suspect a future phase change might be towards disregarding how easy it is for humans to work with the code and instead focus on provability, testing, perhaps combined with token efficiency.

> Maybe Lean combined with Rust shrunk down to something that is very compiler friendly. Imagine if you could specify what you need in high level language and instead of getting back "vibe code", you get back proven correct code, because that's the only kind of code that will successfully compile.

https://news.ycombinator.com/item?id=47192116

It's important to keep in mind that no proof system ensures your proof is the correct proof, only that it's a valid proof. Completely understanding what a proof proves is often nearly as difficult as understanding the program it's proving. Normally you benefit because the process of building a proof forces you to develop your understanding more fully.

Uhm, no? Even with "simple" examples like Dijkstra's shortest path, the spec is easier than the implementation. Maybe not for you, but try it out on an arbitrary 5-yr old. On the extreme end, you have results in maths, like Fermat's Last Theorem. Every teenager can understand the statement (certainly after 10 mins of explanation) but the proof is thousands of pages of super-specialized maths. It is a spectrum. For cryptography, compression, error-correction, databases, etc, the spec is often much simpler than the implementation.

I don't know why you created a new account for this, but take the textbook example of a nontrivial formally verified system: SeL4. That implementation was 8.7k of C code, which correspondend to 15k lines of Isabelle that ultimately needed 100k+ lines of proof to satisfy. And that was with the formal model excluding lots of important properties like hardware failure that actual systems deal with.

Pleasant surprise: someone saying "open source" and actually meaning Open Source. It looks like the weights are Apache-2.0 licensed.

Based on community definitions I've seen, this is considered "open weights". If you can't reproduce the model, it's not "open source"

Naturally the Microsoft-owned language is getting the AI hype instead of the more mature options that could do this sort of work… Agda, ATS, Coq/Rocq, Dafny, Fstar, Idris, Isabelle, Why3 just to name a few.

A bit uncharitable. I'm a diehard fan of Rocq, but it's nothing unusual to see the young new hotness that is Lean continue to get the spotlight. It's not a sign of Microsoft putting its thumb on the scales, and the hype for Lean has long predated LLMs.

It's certainly less mature when it comes to verified programming, but its appeal to mathematicians (rather than formal methods experts) has earned it much respect.

Am I missing something? Isn’t that the language most are using currently when looking at research at openai, google, deepseek etc?

[deleted]

Automated theorem provers running on a $5k piece of hardware is a cool version of the future

Maybe a naive question: given that they see better performance with more passes but the effect hits a limit after a few passes, would performance increase if they used different models per pass, i.e leanstral, kimi, qwen and leanstral again instead of 4x leanstral?

This is called a "LLM alloy", you can even do it in agentic, where you simply swap the model on each llm invocation.

It does actually significantly boost performance. There was an article on here about it recently, I'll see if I can find it.

Edit: https://news.ycombinator.com/item?id=44630724

They found the more different the models were (the less overlap in correctly solved problems), the more it boosted the score.

That sounds quite interesting. Makes me wonder if sooner or later they will have to train multiple independent models that cover those different niches. But maybe we will see that sooner or later. Thanks for the link.

One would think that LoRAs being so successful in StableDiffusion, that more people would be focused on constructing framework based LoRas; but the economics of all this probably preclude trying to go niche in any direction and just keep building the do-all models.

The SD ecosystem in large part was grassroots and focused on nsfw. I think current LLM companies would have a hard time getting that to happen due to their safety stuff.

FYI The Lean 4 paper: https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

What are these "passes" they reference here? Haven't seen that before in LLM evals

Could definitely be interesting for having another model run over the codebase when looking for improvements

It's the number of attempts at answering the question.

From https://mistral.ai/news/leanstral :

  Model        Cost ($) Score
  ..
  Claude Opus     1,650 39.6
  ..
  Leanstral pass@8  145 31.0
  Leanstral pass@16 290 31.9

Curious if pass@2 was tested for haiku and sonnet?

"and continues to scale linearly"

it clearly and demonstrably does not. in fact, from eyeballing their chart Qwen, Kimi, and GLM scale linearly whereas Leanstral does not. But this is not surprising because the Alibaba, Moonshot, and Zhipu have hundreds of employees each and hundreds of millions of dollars of investment each.

Does Mistral come close to Opus 4.6 with any of their models?

I use mistral-medium-3.1 for a lot of random daily tasks, along with the vibe cli. I'd state from my personal opinion that mistral is my preferred 'model vendor' by far at this point. They're extremely consistent between releases while each of them just feels better. I also have a strong personal preference to the output.

I actively use gemini-3.1-pro-preview, claude-4.6-opus-high, and gpt-5.3-codex as well. I prefer them all for different reasons, however I usually _start_ with mistral if it's an option.

Why not Large 3? It's larger and cheaper

Mistral hasn't been in the running for SOTA for quite awhile now

Not at the moment, but a release of Mistral 4 seems close which likely bridges the gap.

Mistral Small 4 is already announced.

MOE but 120B range. Man I wish it was an 80B. I have 2 GPUs with 62Gib of usable VRAM. A 4bit 80B gives me some context window, but 120B puts me into system RAM

Either some q3 or since it's a MoE, maybe a REAP version of q4 might work (or could be terrible, I'm not sure about REAP'd models).

The TDD foundation! We might need one of those. :)

I don’t know a single person using Mistral models.

Isn't their latest speech to text model SOTA? When I tested it on jargon, it was amazing.

https://news.ycombinator.com/item?id=46886735

I'm using this model for my first python project, coding using opencode along with devstral and Mistral Large 3. I know it's not as capable as other, more expensive models, but working with it this way is teaching me python. More directly to your point though, the speech to text model is really good.

It's funny because I just took a break from it to read some hn and found this post.

I used Ministral for data cleaning.

I was surprised: even tho it was the cheapest option (against other small models from Anthropic) it performed the best in my benchmarks.

Mistral is super smart in smaller context and asking questions about it

Pretty much all of my LLM usage has been using Mistral's open source models running on my PC. I do not do full agentic coding as when i tried it with Devstral Small 2 it was a bit too slow (though if i could get 2-3 times the speed of my PC from a second computer it'd be be a different story and AFAIK that is doable if i was willing to spend $2-3k on it). However i've used Mistral's models for spelling and grammar checks[0], translations[1][2], summaries[3] and trying to figure out if common email SPAM avoidance tricks are pointless in the LLM age :-P [4]. FWIW that tool you can see in the shots is a Tcl/Tk script calling a llama.cpp-based command-line utility i threw together some time ago when experimenting with llama.cpp.

I've also used Devstral Small to make a simple raytracer[5][6] (it was made using the "classic" chat by copy/pasting code, not any agentic approach and i did fix bits of it in the process) and a quick-and-dirty "games database" in Python+Flask+Sqlite for my own use (mainly a game backlog DB :-P).

I also use it to make various small snippets, have it generate some boilerplate stuff (e.g. i have an enum in C and want to write a function that prints names for each enum value or have it match a string i read from a json file with the appropriate enum value), "translate" between languages (i had it recently convert some matrix code that i had written in Pascal into C), etc.

[0] https://i.imgur.com/f4OrNI5.png

[1] https://i.imgur.com/Zac3P4t.png

[2] https://i.imgur.com/jPYYKCd.png

[3] https://i.imgur.com/WZGfCdq.png

[4] https://i.imgur.com/ytYkyQW.png

[5] https://i.imgur.com/FevOm0o.png (screenshot)

[6] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92... (C code)

[7] https://i.imgur.com/BzK8JtT.png

That's likely because they're chasing enterprise - see deals with HSBC, ASML, AXA, BNP Paribas etc... Given swelling anti-US sentiment and their status as a French 'national champion', Mistral are probably in a strong position for now regardless of model performance, research quality or consumer uptake.

I'm building a knowledge graph on personal data (emails, files) with Ministral 3:3b. I try with Qwen 3.5:4b as well but mostly Ministral.

Works really well. Extracts companies you have dealt with, people, topics, events, locations, financial transactions, bills, etc.

Me neither, they're not ready for prime imo. I have a yearly sub and the product is just orders of magnitude behind Anthropic's offering. I use Code for real world stuff and I am happy with the result, Mistral is just not something I can trust right now.

I use them solely.

[dead]

Congratulations on the launch!

Mistral seems to focus on a different market than the others. Their best model is meh, their best ASR model locally is either rather slow compared to Parakeet on similar languages, or not as good for others (like qwen ASR).

Side note: Lean seems quite unreadable with tons of single letter variable names. Part of it is me being unaccustomed with it, but still.

Mistral seems to focus on some niche LLM model tooling that are somehow very needed in certain cases. Can't forget their OCR multimodal embedding model!

This is great, congratulations to the Mistral team! I'm looking forward to the code arena benchmark results. Thanks for sharing.

is the haiku comparison because they've distilled from the model?

[dead]

[flagged]

Truly exciting

Here we go.