Heh. I built "Fusion" a few months ago as an MCP using OpenRouter. The idea was to give Claude a "panel of experts" to go talk to when it got stuck.
After extensive testing and benchmarking I discovered that when you ask one model to judge another's response you don't actually get a better answer. You are just asking it "how closely does this resemble the answer you would have given me." Additional rounds and all the "obvious" solutions that pop into your mind reading the proceeding sentence are essentially just cranking up the temperature.
I did find a solution, but it is insanely expensive. Maybe if this gains traction I'll release mine.
I made a rough version of this in 2024[0], interesting to see that the idea is still around. I had the ability to set "quality thresholds", but it didn't seem to matter, the frontier models pretty much always agreed with each other and scored the answer highly, I should revisit it since it is a whole different ballgame than it was 2 years ago.
I think it depends on whether the answer is verifiable.
I have tested two judge models in my apps:
1. Judge model for a resume tailor. It evaluated the result resume vs the base resume and JD and judged it out of 10 on fit and honesty. It worked well and was useful.
2. Review model in my LLM trading bot platform. It reviews decisions from the Main model. The problem here is that the bot is navigating ambiguity. So unless the Review model catches an outright blunder (e.g. making a decision on wrong candle price or a BUY when it should be a SELL), the Review model can do more harm than good.
First, it adds latency to decisions, decisions take twice the amount of time (like be 60s instead of 30s for Gemma 4 31B). Second, it can make the bot too cautious, because Review model only runs on BUY/SELL decisions and not HOLD decisions, so the bot will only make less trades instead of review model increasing number of trades (because of latency and cost).
So overall, I think you'll get better results with a better model single shotting it rather than a review model if the answer isn't easily verifiable. But then why do you need a judge model and not just have the same agent review itself?
ALSO, if you read the reasoning text for a reasoning model (like Gemma 4), you see that it ALREADY reviews itself. So it's doing its best, re-review isn't really adding information. It's an interesting experiment, but you need to evaluate on a case by case basis.
Prompt matters. Obviously if you want another model opinion you must generate from the scratch using the same prompt and then you can try to synthesize, but working with an existing response can work if desired. I use explicit instructions to find issues with assigned severities and then these are going through the panel of judges, only issues passing certain threshold are fixed in the original response.
I'll share a revelation which vastly improved my results: tell judges to evaluate truth and usefulness/should-be-fixed axis separately. Because inevitably with a prompt that is forcing to find issues you will end up with nitpicks. Plus truth axis allows to better evaluate the issue-finder models for your use case.
That's some part of what happens when I generate explanations like this one: https://hanzirama.com/character/%E6%9D%A5#explain - at this point the site is a small side product of my LLMs-evaluation machinery.
Bonus content for patient readers: if you need top quality you will likely need to pin provider(s) on OR, :exacto is not enough to get good repeatable results especially for open-weights models.
I think there is alpha just have to be very careful how you let the models com up with solutions and collaborate.
I've found that if I tell a judge that the answer came from a small and weak local LLM, it will pick the answer apart brutally...but since I have not done this systematically, I dont know how well it generalizes past my vibes.
Anyone else fell like if you can trick the LLM into a mode where it "feels" superior, it will act the asshole very well?
Yeah. I usually do this by telling it to be adversarial and find gaps and holes. Not fool proof but it does seem to increase the quality. It has helped when using local models in particular.
Yeah, you have to shortcut the RL-trained people pleasing
I've started to have different models review things like architectural planning docs- and I think for these more "fuzzy" outputs the differences between the outputs can be quite different and I can use my own "taste" to pick the best one.
I don't think it would work without a human in the loop but it is surprising to me how varied models' vibes are and how a system design varies by what it thinks is important to include and emphasize.
Yeah, same experience. It turned out that objectively better answers were not that easy to find plus the expense plus it’s slow.
I’d be interested in the benchmarking if you ever write it up! People do seem to assume LLM as a judge/panel improves outcomes (and arguably it does in cases like code review?) but I suspect it is very situational and the priors from human panel of experts don’t always translate cleanly.
I had a very similar experience. I'd be keen to see how you went about it if you release it.
Sounds like fusion would be a really good distillation target?
Which models were you using under this? If you used the quality default as exists in the interface, it makes sense that it was ~4x the cost as it'd be 3 frontier models judged by one of those.
The idea would be to use fusion with simpler, cheaper models.
yeah its really counterintuitive i think; i.e, getting the right framework and structure for this to work probably isn't trivial, models really hate playing well together. i wonder how their version would fair in real world use.
[flagged]
I have been thinking a lot about this and my simplified understanding is that each model can be seen as a bell curve over human knowledge and each model has a different distribution. Using multiple models would allow us to change the distribution of other models with text that is out of their original curve. But then if you think about it does SFP and RL even alter the original distribution of text enough that models have enough variety so that their combined output is something better or just an echo chamber I believe not but I have no way to prove it yet.
On OpenRouter's fusion API your request is routed to several models simultaneously and a judge model combines their answers into a final response. This significantly boosts performance, at the cost of time (at least on the one benchmark they tested, a deep research benchmark).
They have a Budget preset consisting of 3 cheaper models (which roughly matches Fable on that benchmark, costing half as much), and a Quality preset of 3 expensive ones (which beats Fable, but costs twice as much as Fable).
Curiously, fusing a model with itself also boosted performance (2xOpus4.8 roughly matching Fable on the benchmark, but costing twice as much as Fable). There's a further, smaller gain from mixing different models. The main gain seems to be from additional test time compute.
Would love to see more research on this, especially focusing on the cheap models that came out recently (e.g. Fusing DSV4 with itself, or with Mimo), and to see what the tradeoffs look like between running a fusion (parallel test time compute) vs increased reasoning or turns.
> Curiously, fusing a model with itself also boosted performance
Back in the GPT2 to GPT3 era this was a pretty common thing to do. You are effectively taking more samples from the space of likely outputs. If your model can do the task 60% of the time just take 5-10 samples and implement some kind of majority voting
It became less common to use as models got high accuracy on problems where combining results is trivial. But with a more complex judge (a competent LLM) you can still get better results by just sampling more of the output space and picking out the best aspects
Interesting how well a panel of Fable 5 + GPT 5.5 beats the frontier of either one, but if you add Gemini into the mix the panel of three performs worse, not better. To me that sounds like Gemini is worse at the given tasks but better at convincing judges of its solutions. Oh and a Panel of 2 Opus 4.8 models is almost exactly as good as one Fable 5. That smells suspicious. Do we know if that might simply be what Anthropic is doing behind the curtain?
Yeah, GPT 5.5 + Fable beating either individually is belivable, but 2x Opus > Fable is what makes me a bit dubious about the whole thing. They might be measuring skills that are too specific or benefit a lot from more tokens being thrown at them. Also Claude Code (the harness) is not the best at the moment, that might be part of it as well?
What throws me off is DeepSeek beating both Opus 4.8 and GPT 5.5.
That definitely doesn't sound right.
> Interesting how well a panel of Fable 5 + GPT 5.5 beats the frontier of either one, but if you add Gemini into the mix the panel of three performs worse
I'm not seeing that? Did you maybe misread the #2 ranked one as Fable + GPT + Gemini? It's actually Opus + GPT + Gemini.
> Oh and a Panel of 2 Opus 4.8 models is almost exactly as good as one Fable 5. That smells suspicious. Do we know if that might simply be what Anthropic is doing behind the curtain?
I wouldn't be surprised if Fable/Mythos is a model distilled from a Panel/Council of Claude instances. Recursive self improvement is something all AI labs must be working on in some way or another.
I don't know if it is still the case with current models, but a few generations back Microsoft had some research results where asking a model to iterate N times would significantly improve performance, with the optimal point being 4 iterations.
I think there's a sweet spot for it. If a model can't do a task, iterating won't help. If a model can do it reliably, there's no need to iterate.
If it can do it, but unreliably, that's where you would get major gains from iterating. I think the Chinese models are in that sweet spot, for many tasks. I would love to test that.
I started working on my own fusion system yesterday. I'm not sure how to benchmark it though.
The thing I'm most interested in is reliability. Going from 90% to 95% on a benchmark doesn't seem like much but you've cut the error rate in half.
> but a few generations back
Out of interest: Was this still before CoT/thinking-mode became the norm?
I have a version of this called llm-consortium which I vibe-coded from a karpathy tweet[0] + some notes using my LLM Plugin Generator plugin for simonw's LLM cli. If you combine it with the llm-model-gateway plugin you can serve a consortium like a regular model on an openai proxy - the response will be the synthesis, and conversation context is preserved for multi-turn chats.
I realized at some point that 'consortium' was not proper term for what this was doing, since I was creating a kind of llm organization/council, whereas a consortium is a group of organizations. So rather than rename it I added the ability to create a consortium of consortiums, where each member can itself be a consortium models. The arbiter can also be a consortium which enables multi-model judging. This can obviously baloon token usage insanely, I think my record is over 100 models prompted from one prompt.
So to reign in the token explosion somewhat I added a simple rank mode, which produces only a ranking, and then the top ranked answer is returned. You can use this in combination with meta-consortiums like this
>llm consortium save cns-kimi -m k2.7-code -n 5 --arbiter mercury-2 --judging-method rank
llm consortium save cns-glm -m glm-5.2 -n 5 --arbiter mercury-2 --judging-method rank
llm consortium save cns-meta-glm-kimi -m cns-glm -m cns-kimi --max-iterations 1 --arbiter qwen-3.5 # judging-method left at default to create a synthesis
This will first send five prompts each to kimi and glm and pick top ranked answer from each using the fast mercury-2 model, then it will create a synthesis from those two responses using a better model like qwen
Mercury-2 is extremely fast, and good for ranking mode, but for synthesis I prefer a slightly larger model. This is most important when you are using it inside a harness or agent with a strict output format. This is because then you end up nesting a complex structure embedded in another complex structure (llm-consortium uses structured reasoning with xml tags). Even opus sometimes struggles with this in the few times I tried it - but qwen, glm and kimi have all been reliable arbiters so far.
Some anecdata on Fusion: I run same query I used for Fable on OR Fusion and results were worse.
It felt, like Fable was able to kinda grasp very deep knowledge/intelligence layers and outline solution not only in agreeable way, but rather it proposed to prioritize solution items, with discarding some of the items, which made a lot of sense to me.
While Fusion felt more like a bit diversified answer of the same class of pre-Fable SOTA models, without touching the depth of knowledge/intelligence layers, which Fable was able to get, in my very limited tests I did, while Fable was accessible.
Spent the weekend inspired by the new openrouter fusion model and wanted to see if it could run in Claude Code and if I could make it very easy for everyone else to try.
Built - claude-fusion-launcher — run Claude Code on a panel of models, not just one
Doesn't it get expensive fast? I found the one-off prompts I did in their website to cost almost a $1/prompt.
I’ve been experimenting with two things on this:
- multi-model consensus, with multiple cross-review rounds. Obviously, the number of inference tasks explodes with the number of models. Led to some interesting results [^0].
- giving an agent "stray thoughts" produced by the same model, or another, giving the second model a selection of the agent’s context, with different triggers (random, loop detection,…)[^1]. So far has proven very helpful and much cheaper than the first.
I built ChatDelta.com to give developers this power hands-on instead of outsourcing it to a company.
I tried OpenRouter Fusion with the budget model option but swapped out DeepSeek v3.2 for DeepSeek V4 Pro. The results weren't that bad. An interesting take on quorums for sure.
However I did notice a tool call to Claude Opus 4.8 for 1168 - 237 tokens, and $0.0118 cost, which I cannot account for because Opus was not in my selection and only revealed in logs. Strange.
Same for me! I bet they use opus to synthesize the final answer somehow? Regardless, it was unexpected.
It should be called something else, maybe Ensemble? It doesn't fuse anything.
Yeah the Rio thing is a better candidate for that word, where they averaged the weights for two models:
Conceptually this is wrapping an agent harness in an LLM call API. I wonder if this format is more digestible than the agent building tools the big labs are rolling out.
I'm sure many have made something like this, I've done a few. I've found simply submitting one's prompt to multiple models to be kind of pointless. You're just going to get statistical noise from the variances in their training methods, as they are all training on pretty much the same data.
I get significantly better results by pre-prompting each LLM (they can be the same LLM too, just another instance), I pre-prompt them to approach from a different perspective. Basically, I create expert personas that each believe they are someone of a different career, different intellectual perspectives, and then that generates a real debate between experts.
Agree, and I see opus and Gemini pro as “quality” on openrouter fusion, this would be super pricy if the prompts are dynamic and not optimised for caching.
I would love to hear why they have created it, what was the business case, what this is going to serve? As you said, this is pretty easy to replicate
[dead]
I was reminded of "model alloys", where they randomly select a LLM for every agentic turn. This significantly boosted performance on security work.
(10 points on the benchmark, or a relative increase of over 20%)
TFA on the other hand tests two things at once: mixing models, and "fuse a model with itself",! the latter being just test time compute. e.g. Opus was able to match Fable on TFA, at the cost of costing twice as much money (and presumably time).
These two dimensions are orthogonal but can be combined for further gains.
It's not clear that every task benefits from it though. The only benched deep research, and their results are a bit weird. (e.g. they have DeepSeek outranking frontier models.)
More research needed!
Similar feature launched open-source and end-to-end encrypted on my TrustedRouter https://trustedrouter.com/
this is great
it's nice to see an actual privacy committment, i spend a lot of time reading through reams of evasive and nebulous provider terms
I opened the page and prompted it `Which 3d printer is the best`. I mean this is a stupid question but I was looking at some 3d printers so it popped into my mind.
It came up with a decent response but I guess Opus or GPT 5.5 would do fine anyway. Gotta try it on different stuff. But this feels like it would work great on some situations.
Interestingly I've had a similar experience with agent teams/swarms, albeit they can get much more expensive depending on the workflow.
I found that Fable didn't have as much of an impact when put in a team.
But it was/is a very pleasant model to work with 1:1. And was the first time I didn't use my primary team based workhorse in months, across 10s of sessions last week.
You could easily distribute the same task to 5 subagents that are specifically programmed to do as best as they can based on their scope and merge the results into a single coherent response.
That is more or less the same thing.
I am not sure who is the intended user of this fusion api as with all things prompt + model matter.
People who don't want the hassle. A lot of Openrouters selling point is removing hassle, and providing things like this can move them up the value chain for people who aren't very cost sensitive and are happy to pay to get better outcomes without having to do the work themselves.
Haven't managed to get past "Fusion failed. You can retry from the results view."
with no clue why it failed...
Would be interesting to see coding performance on SWE benchmarks.
Interesting. Will definitely use this.
One scenario I can see it working is writing markdown specs before the coding starts and analysing it for gaps. That’s so few tokens that throwing as much LLM against it as possible is worthwhile regardless of cost per million tks
I wonder if these fusion techniques could help to run better local AI by streaming tokens from multiple machines and combining them
Random forest!
I got significant improvement on code quality (so much that it has become a no brainer for important tasks such as planning) simply by adding the --self-review flag to swival: https://swival.dev/pages/reviews.html
Two instances of the same model, a producer and a reviewer, and the loops doesn't end until everybody's happy.
[deleted]
really interesting that its basically almost 80% claude opus..
I have an old, slow GPU setup that has nearly 100gb of VRAM
I had been trying to fill this up with big models but it doesn’t seem like these give a good return per Gb
I’m looking at that and wondering would I be better off running multiple such models in parallel. It would probably be a better way to load balance across SLI.
My guess is the scaling will be more “mythical man month” than “no more free lunch” - the interaction of models resembling social dynamics moreso than multi-core setups.
Given that these actors are largely homogenous in culture and incentivising, and coordination overhead is drastically reduced.
Commonly we consider optimal team size to be between 3 and 7 and Brookes’ maximum team size is around 10 or so before the system fails. It should be possible to blow way past those numbers and still experience increased gains in productivity as long as you can keep all your instances stoked.
Heh. I built "Fusion" a few months ago as an MCP using OpenRouter. The idea was to give Claude a "panel of experts" to go talk to when it got stuck.
After extensive testing and benchmarking I discovered that when you ask one model to judge another's response you don't actually get a better answer. You are just asking it "how closely does this resemble the answer you would have given me." Additional rounds and all the "obvious" solutions that pop into your mind reading the proceeding sentence are essentially just cranking up the temperature.
I did find a solution, but it is insanely expensive. Maybe if this gains traction I'll release mine.
I made a rough version of this in 2024[0], interesting to see that the idea is still around. I had the ability to set "quality thresholds", but it didn't seem to matter, the frontier models pretty much always agreed with each other and scored the answer highly, I should revisit it since it is a whole different ballgame than it was 2 years ago.
[0] https://github.com/Ceroxylon/konsensis
Yes, definitely not a new idea. I had a multi-turn composite model in 2024 that was outperforming the top models across benchmarks: https://x.com/LechMazur/status/1828804485033992514.
I think it depends on whether the answer is verifiable.
I have tested two judge models in my apps:
1. Judge model for a resume tailor. It evaluated the result resume vs the base resume and JD and judged it out of 10 on fit and honesty. It worked well and was useful.
2. Review model in my LLM trading bot platform. It reviews decisions from the Main model. The problem here is that the bot is navigating ambiguity. So unless the Review model catches an outright blunder (e.g. making a decision on wrong candle price or a BUY when it should be a SELL), the Review model can do more harm than good.
First, it adds latency to decisions, decisions take twice the amount of time (like be 60s instead of 30s for Gemma 4 31B). Second, it can make the bot too cautious, because Review model only runs on BUY/SELL decisions and not HOLD decisions, so the bot will only make less trades instead of review model increasing number of trades (because of latency and cost).
So overall, I think you'll get better results with a better model single shotting it rather than a review model if the answer isn't easily verifiable. But then why do you need a judge model and not just have the same agent review itself?
ALSO, if you read the reasoning text for a reasoning model (like Gemma 4), you see that it ALREADY reviews itself. So it's doing its best, re-review isn't really adding information. It's an interesting experiment, but you need to evaluate on a case by case basis.
Prompt matters. Obviously if you want another model opinion you must generate from the scratch using the same prompt and then you can try to synthesize, but working with an existing response can work if desired. I use explicit instructions to find issues with assigned severities and then these are going through the panel of judges, only issues passing certain threshold are fixed in the original response.
I'll share a revelation which vastly improved my results: tell judges to evaluate truth and usefulness/should-be-fixed axis separately. Because inevitably with a prompt that is forcing to find issues you will end up with nitpicks. Plus truth axis allows to better evaluate the issue-finder models for your use case.
That's some part of what happens when I generate explanations like this one: https://hanzirama.com/character/%E6%9D%A5#explain - at this point the site is a small side product of my LLMs-evaluation machinery.
Bonus content for patient readers: if you need top quality you will likely need to pin provider(s) on OR, :exacto is not enough to get good repeatable results especially for open-weights models.
Nice - I built an npm package in a similar fashion called Agent Order: https://github.com/btahir/agent-order
I think there is alpha just have to be very careful how you let the models com up with solutions and collaborate.
I've found that if I tell a judge that the answer came from a small and weak local LLM, it will pick the answer apart brutally...but since I have not done this systematically, I dont know how well it generalizes past my vibes.
Anyone else fell like if you can trick the LLM into a mode where it "feels" superior, it will act the asshole very well?
Yeah. I usually do this by telling it to be adversarial and find gaps and holes. Not fool proof but it does seem to increase the quality. It has helped when using local models in particular.
Yeah, you have to shortcut the RL-trained people pleasing
I've started to have different models review things like architectural planning docs- and I think for these more "fuzzy" outputs the differences between the outputs can be quite different and I can use my own "taste" to pick the best one.
I don't think it would work without a human in the loop but it is surprising to me how varied models' vibes are and how a system design varies by what it thinks is important to include and emphasize.
Yeah, same experience. It turned out that objectively better answers were not that easy to find plus the expense plus it’s slow.
I’d be interested in the benchmarking if you ever write it up! People do seem to assume LLM as a judge/panel improves outcomes (and arguably it does in cases like code review?) but I suspect it is very situational and the priors from human panel of experts don’t always translate cleanly.
I had a very similar experience. I'd be keen to see how you went about it if you release it.
Here's what I use: https://github.com/DheerG/swarms
I've been thinking along those lines, too. Could you give a general overview of your solution?
I think it depends.
I regularly ask both GPT and Gemini to give me options - programming libraries to do X, architecture suggestions, names for projects/services/classes
After they answer I ask each model what does it think of the other answer, and to give me a final suggestion considering both answers.
Both GPT and Gemini would frequently say "that other answer is much better than my one, it considered X factor that I missed".
Try telling it the answer came from a small local LLM..the condescension can become palpable.
But.. but I told the LLM that it is an _expert_, is that worth nothing??
Make sure to remind it to make no mistakes.
You found the smoking gun!
prompting "no mistakes" was load-bearing
[dead]
[dead]
I ran a quick eval to see what this looks like qualitatively vs just calling Opus 4.7 or GPT 5.5 directly.
As expected, Fusion was 7x slower and 4x the cost.
This isn't a knock against it, just that it I think this places Fusion into a "use it only when you need it" category.
https://3fpi5avcqq.evvl.io/
Sounds like fusion would be a really good distillation target?
Which models were you using under this? If you used the quality default as exists in the interface, it makes sense that it was ~4x the cost as it'd be 3 frontier models judged by one of those.
The idea would be to use fusion with simpler, cheaper models.
yeah its really counterintuitive i think; i.e, getting the right framework and structure for this to work probably isn't trivial, models really hate playing well together. i wonder how their version would fair in real world use.
[flagged]
I have been thinking a lot about this and my simplified understanding is that each model can be seen as a bell curve over human knowledge and each model has a different distribution. Using multiple models would allow us to change the distribution of other models with text that is out of their original curve. But then if you think about it does SFP and RL even alter the original distribution of text enough that models have enough variety so that their combined output is something better or just an echo chamber I believe not but I have no way to prove it yet.
Context:
Surpassing Frontier Performance with Fusion
https://news.ycombinator.com/item?id=48525392
And a slightly better UI here: https://openrouter.ai/fusion
On OpenRouter's fusion API your request is routed to several models simultaneously and a judge model combines their answers into a final response. This significantly boosts performance, at the cost of time (at least on the one benchmark they tested, a deep research benchmark).
They have a Budget preset consisting of 3 cheaper models (which roughly matches Fable on that benchmark, costing half as much), and a Quality preset of 3 expensive ones (which beats Fable, but costs twice as much as Fable).
Pareto graph: https://openrouter.ai/blog/images/blog/fusion-benchmark-cost...
Curiously, fusing a model with itself also boosted performance (2xOpus4.8 roughly matching Fable on the benchmark, but costing twice as much as Fable). There's a further, smaller gain from mixing different models. The main gain seems to be from additional test time compute.
Would love to see more research on this, especially focusing on the cheap models that came out recently (e.g. Fusing DSV4 with itself, or with Mimo), and to see what the tradeoffs look like between running a fusion (parallel test time compute) vs increased reasoning or turns.
> Curiously, fusing a model with itself also boosted performance
Back in the GPT2 to GPT3 era this was a pretty common thing to do. You are effectively taking more samples from the space of likely outputs. If your model can do the task 60% of the time just take 5-10 samples and implement some kind of majority voting
It became less common to use as models got high accuracy on problems where combining results is trivial. But with a more complex judge (a competent LLM) you can still get better results by just sampling more of the output space and picking out the best aspects
Interesting how well a panel of Fable 5 + GPT 5.5 beats the frontier of either one, but if you add Gemini into the mix the panel of three performs worse, not better. To me that sounds like Gemini is worse at the given tasks but better at convincing judges of its solutions. Oh and a Panel of 2 Opus 4.8 models is almost exactly as good as one Fable 5. That smells suspicious. Do we know if that might simply be what Anthropic is doing behind the curtain?
Yeah, GPT 5.5 + Fable beating either individually is belivable, but 2x Opus > Fable is what makes me a bit dubious about the whole thing. They might be measuring skills that are too specific or benefit a lot from more tokens being thrown at them. Also Claude Code (the harness) is not the best at the moment, that might be part of it as well?
What throws me off is DeepSeek beating both Opus 4.8 and GPT 5.5.
That definitely doesn't sound right.
> Interesting how well a panel of Fable 5 + GPT 5.5 beats the frontier of either one, but if you add Gemini into the mix the panel of three performs worse
I'm not seeing that? Did you maybe misread the #2 ranked one as Fable + GPT + Gemini? It's actually Opus + GPT + Gemini.
> Oh and a Panel of 2 Opus 4.8 models is almost exactly as good as one Fable 5. That smells suspicious. Do we know if that might simply be what Anthropic is doing behind the curtain?
I wouldn't be surprised if Fable/Mythos is a model distilled from a Panel/Council of Claude instances. Recursive self improvement is something all AI labs must be working on in some way or another.
I don't know if it is still the case with current models, but a few generations back Microsoft had some research results where asking a model to iterate N times would significantly improve performance, with the optimal point being 4 iterations.
I think there's a sweet spot for it. If a model can't do a task, iterating won't help. If a model can do it reliably, there's no need to iterate.
If it can do it, but unreliably, that's where you would get major gains from iterating. I think the Chinese models are in that sweet spot, for many tasks. I would love to test that.
I started working on my own fusion system yesterday. I'm not sure how to benchmark it though.
The thing I'm most interested in is reliability. Going from 90% to 95% on a benchmark doesn't seem like much but you've cut the error rate in half.
> but a few generations back
Out of interest: Was this still before CoT/thinking-mode became the norm?
I have a version of this called llm-consortium which I vibe-coded from a karpathy tweet[0] + some notes using my LLM Plugin Generator plugin for simonw's LLM cli. If you combine it with the llm-model-gateway plugin you can serve a consortium like a regular model on an openai proxy - the response will be the synthesis, and conversation context is preserved for multi-turn chats.
I realized at some point that 'consortium' was not proper term for what this was doing, since I was creating a kind of llm organization/council, whereas a consortium is a group of organizations. So rather than rename it I added the ability to create a consortium of consortiums, where each member can itself be a consortium models. The arbiter can also be a consortium which enables multi-model judging. This can obviously baloon token usage insanely, I think my record is over 100 models prompted from one prompt.
So to reign in the token explosion somewhat I added a simple rank mode, which produces only a ranking, and then the top ranked answer is returned. You can use this in combination with meta-consortiums like this
This will first send five prompts each to kimi and glm and pick top ranked answer from each using the fast mercury-2 model, then it will create a synthesis from those two responses using a better model like qwen Mercury-2 is extremely fast, and good for ranking mode, but for synthesis I prefer a slightly larger model. This is most important when you are using it inside a harness or agent with a strict output format. This is because then you end up nesting a complex structure embedded in another complex structure (llm-consortium uses structured reasoning with xml tags). Even opus sometimes struggles with this in the few times I tried it - but qwen, glm and kimi have all been reliable arbiters so far.[0] https://x.com/karpathy/status/1870692546969735361 Further reading: Mixture-of-agents https://www.together.ai/blog/together-moa Google's Mind-Evolution https://arxiv.org/html/2501.09891v1
Some anecdata on Fusion: I run same query I used for Fable on OR Fusion and results were worse.
It felt, like Fable was able to kinda grasp very deep knowledge/intelligence layers and outline solution not only in agreeable way, but rather it proposed to prioritize solution items, with discarding some of the items, which made a lot of sense to me.
While Fusion felt more like a bit diversified answer of the same class of pre-Fable SOTA models, without touching the depth of knowledge/intelligence layers, which Fable was able to get, in my very limited tests I did, while Fable was accessible.
Spent the weekend inspired by the new openrouter fusion model and wanted to see if it could run in Claude Code and if I could make it very easy for everyone else to try.
Built - claude-fusion-launcher — run Claude Code on a panel of models, not just one
Also shows cost
https://github.com/smorinlabs/claude-fusion-launcher
Doesn't it get expensive fast? I found the one-off prompts I did in their website to cost almost a $1/prompt.
I’ve been experimenting with two things on this:
- multi-model consensus, with multiple cross-review rounds. Obviously, the number of inference tasks explodes with the number of models. Led to some interesting results [^0].
- giving an agent "stray thoughts" produced by the same model, or another, giving the second model a selection of the agent’s context, with different triggers (random, loop detection,…)[^1]. So far has proven very helpful and much cheaper than the first.
[0]: https://github.com/lightless-labs/refinery
[1]: https://github.com/Lightless-Labs/skunkworks/tree/main/flux
I built ChatDelta.com to give developers this power hands-on instead of outsourcing it to a company.
I tried OpenRouter Fusion with the budget model option but swapped out DeepSeek v3.2 for DeepSeek V4 Pro. The results weren't that bad. An interesting take on quorums for sure. However I did notice a tool call to Claude Opus 4.8 for 1168 - 237 tokens, and $0.0118 cost, which I cannot account for because Opus was not in my selection and only revealed in logs. Strange.
Same for me! I bet they use opus to synthesize the final answer somehow? Regardless, it was unexpected.
It should be called something else, maybe Ensemble? It doesn't fuse anything.
Yeah the Rio thing is a better candidate for that word, where they averaged the weights for two models:
https://news.ycombinator.com/item?id=48528371
Conceptually this is wrapping an agent harness in an LLM call API. I wonder if this format is more digestible than the agent building tools the big labs are rolling out.
I'm sure many have made something like this, I've done a few. I've found simply submitting one's prompt to multiple models to be kind of pointless. You're just going to get statistical noise from the variances in their training methods, as they are all training on pretty much the same data.
I get significantly better results by pre-prompting each LLM (they can be the same LLM too, just another instance), I pre-prompt them to approach from a different perspective. Basically, I create expert personas that each believe they are someone of a different career, different intellectual perspectives, and then that generates a real debate between experts.
Agree, and I see opus and Gemini pro as “quality” on openrouter fusion, this would be super pricy if the prompts are dynamic and not optimised for caching.
I would love to hear why they have created it, what was the business case, what this is going to serve? As you said, this is pretty easy to replicate
[dead]
I was reminded of "model alloys", where they randomly select a LLM for every agentic turn. This significantly boosted performance on security work.
(10 points on the benchmark, or a relative increase of over 20%)
https://news.ycombinator.com/item?id=44630724
TFA on the other hand tests two things at once: mixing models, and "fuse a model with itself",! the latter being just test time compute. e.g. Opus was able to match Fable on TFA, at the cost of costing twice as much money (and presumably time).
These two dimensions are orthogonal but can be combined for further gains.
It's not clear that every task benefits from it though. The only benched deep research, and their results are a bit weird. (e.g. they have DeepSeek outranking frontier models.)
More research needed!
Similar feature launched open-source and end-to-end encrypted on my TrustedRouter https://trustedrouter.com/
this is great
it's nice to see an actual privacy committment, i spend a lot of time reading through reams of evasive and nebulous provider terms
I opened the page and prompted it `Which 3d printer is the best`. I mean this is a stupid question but I was looking at some 3d printers so it popped into my mind.
Seeing this log is interesting: https://link.ekin.dev/6RzYGGX7
It came up with a decent response but I guess Opus or GPT 5.5 would do fine anyway. Gotta try it on different stuff. But this feels like it would work great on some situations.
Interestingly I've had a similar experience with agent teams/swarms, albeit they can get much more expensive depending on the workflow.
I found that Fable didn't have as much of an impact when put in a team.
But it was/is a very pleasant model to work with 1:1. And was the first time I didn't use my primary team based workhorse in months, across 10s of sessions last week.
You could easily distribute the same task to 5 subagents that are specifically programmed to do as best as they can based on their scope and merge the results into a single coherent response.
That is more or less the same thing.
I am not sure who is the intended user of this fusion api as with all things prompt + model matter.
People who don't want the hassle. A lot of Openrouters selling point is removing hassle, and providing things like this can move them up the value chain for people who aren't very cost sensitive and are happy to pay to get better outcomes without having to do the work themselves.
Haven't managed to get past "Fusion failed. You can retry from the results view." with no clue why it failed...
Would be interesting to see coding performance on SWE benchmarks.
Interesting. Will definitely use this.
One scenario I can see it working is writing markdown specs before the coding starts and analysing it for gaps. That’s so few tokens that throwing as much LLM against it as possible is worthwhile regardless of cost per million tks
I wonder if these fusion techniques could help to run better local AI by streaming tokens from multiple machines and combining them
Random forest!
I got significant improvement on code quality (so much that it has become a no brainer for important tasks such as planning) simply by adding the --self-review flag to swival: https://swival.dev/pages/reviews.html
Two instances of the same model, a producer and a reviewer, and the loops doesn't end until everybody's happy.
really interesting that its basically almost 80% claude opus..
I have an old, slow GPU setup that has nearly 100gb of VRAM
I had been trying to fill this up with big models but it doesn’t seem like these give a good return per Gb
I’m looking at that and wondering would I be better off running multiple such models in parallel. It would probably be a better way to load balance across SLI.
My guess is the scaling will be more “mythical man month” than “no more free lunch” - the interaction of models resembling social dynamics moreso than multi-core setups.
Given that these actors are largely homogenous in culture and incentivising, and coordination overhead is drastically reduced.
Commonly we consider optimal team size to be between 3 and 7 and Brookes’ maximum team size is around 10 or so before the system fails. It should be possible to blow way past those numbers and still experience increased gains in productivity as long as you can keep all your instances stoked.
[flagged]
[flagged]
[dead]
[flagged]