ZAYA1-8B matches DeepSeek-R1 on math with less than 1B active parameters

> The math and coding part is impressive but the agentic one is not.

I think this is very important to eventually become a viable replacement for coding models. Because most of the time coding harnesses are leveraging tool calls to gather the context and then write a solution.

I am hopeful, that one day we can replace Claude and OpenAI models with local SOTA LLMs

It's pretty close already. Check qwen3.6 27b if you haven't already. People are vibe and agentic coding with it on a single GPU.

It is more finicky than Claude but if you hand hold it a bit it's crazy.

I see that going around, and either the test cases are too simplistic or I'm doing something wrong. I have a server with a 3090 in it, enough to run qwen3.6, but I haven't had much luck using it with either codex or oh-my-pi. They work, but the model gets really slow with ~64k context and the attention degrades quickly. You'll sometimes execute a prompt, the model will load a test file and say something like "I was presented with a test file but no command. What should I do with it?".

So yeah, while it's true that qwen3.6 is good for agentic coding, it's not very good for exploring the codebase and coming up with plans. You need to pair it today with a model capable of ingesting the whole context and providing a detailed plan, and even then the implementation might take 10x the amount of time it'd take for sonnet or Gemini 3 to crunch through the plan.

EDIT:

My setup is really as simple as possible. I run ollama on a remote server on my local network. In my laptop I set OLLAMA_HOST and do `ollama pull qwen3.6:27b`, which then becomes available to the agent harnesses. I am not sure now how I set the context, but I think it was directly in oh-my-pi. So server config- and quantization-wise, it's the defaults.

I have a old supermicro X10DRU-i server with two Tesla V100's (48 GB VRAM) and 128GB RAM and have been running qwen3.6-27B with a lot of success. I would say it's performance on my use case (modifying and extending a ~70kloc C++ code base) has been excellent. I have no benchmarks, but it seems comparable to claude sonnet 4.6 in capabilities. I run it with llama.cpp:

llama-server -m Qwen3.6-27B-Q8_0.gguf -c 131072 --tensor-split 0.4,0.6 --batch-size 256 --cont-batching --flash-attn on -ngl 999 --threads 16 --jinja

I regularly get ~22tok/s when context utilization is below <65k, but it does slow done to ~13tok/s when the context is nearly full (lots of swapping to RAM). I have been using the qwen-code harness though, since it is far more token efficient than claude-code which injects massive prompts that chew up the context window. I plan on trying it with pi next.

I'm keeping my ~$20/mo claude subscripts for the planning prompts, and then hand it off to qwen for implementation. It's been working well so far.

I can see that and I don't know your setup, but there are people pushing >70t/s with MTP on a single 3090, with big contexts still >50t/s. 64k is not a lot for agentic coding, and IIRC 128k with turboquant and the likes should be possible for you. r/LocalLLM/ and r/LocalLLaMA/ are worth a visit IMO.

EDIT: just found this recipe repo, may wanna give it a go: https://github.com/noonghunna/club-3090

EDIT-2: this can also shave off a lot of context need for tool calling -> https://github.com/rtk-ai/rtk

I managed to execute with vllm successfully, but it breaks opencode on simple "what's this repo?" task. On oh-my-pi it wont event execute because omp sends multiple system prompts. I'll try with llama.cpp later and see if it works more reliably.

will give more info in the post

EDIT: thanks for the links!

This link [1] features some good insight on how to adapt your usage to smaller models which require more explicit or deliberate prompting. I have been using Gemma 4 31B a lot and have found it very competent. It can be a bit unstable and start spiraling or end up in infinite loops that you need to reset, but for the most part it's been really good.

[1]: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-...

thanks for the link! Interesting paper and little-coder looks purpose built for my own local model agentic experiments.

[dead]

> qwen

I only have luck with pi and qwen bashing 100 line scripts. Everything real needs a planful model. To your point:

> You need to pair it today with a model capable of ingesting the whole context and providing a detailed plan…

Curiously, ANTHROP\C seems determined to ensure you don't use your Opus 4.7 Max 1M tokens for this any more, instead it sics Haiku on your context to "sample" using a weird pile of inchoate regex tp return "no more than 50 lines" or similar uselessness then finally Opus goes and burns tokens cogitating a solve for a problem shape that doesn't have anything to do with the areas of interest, inevitably unsampled.

I really really really want a "no subagents, no sampling" mode (telling it all subagents are Opus in env vars doesn't seem to persuade it to go ahead and use those 1M tokens to just, you know, read the damn file. Ironic if getting the best out of Opus cannot be in their harness.

All this said, it seems most people think AI saves them time and money so long as it costs no money — feels like ANTHROP\C is optimizing for that. I get it.

But can we also have a `ANTHROPIC_ENABLE_HIGH_ROI=1` mode please?

It costs more fixing all these unnecessary oversights than it would cost to just do the toil the machine is here to do.

You're not sharing what quantization you're using, in my experience, anything below Q8 and less than ~30B tends to basically be useless locally, at least for what you typically use codex et al for, I'm sure it works for very simple prompts.

But as soon as you go below Q8, the models get stuck in repeating loops, get the tool calling syntax wrong or just starts outputting gibberish after a short while.

will do that in an edit to the post

Sure, waiting :)

In the meantime, Ollama seems to default to "Q4_K_M" which is barely usable for anything, and really won't be useful for agentic coding, the quantization level is just too low. Not sure why Ollama defaults to basically unusable quantizations, but that train left a long time ago, they're more interesting in people thinking they can run stuff, rather than flagging things up front, and been since day 1.

Ollama is definitely not the way to go once you have an interest beyond "how quickly can I run a new LLM" rather then "how do I use a local llm to do things in a remotely optimal way"

I'm currently giving club3090 a try, it seems to have lots of pre-configured setups depending on the workflow. I'm trying vllm first, then with llama.cpp.

Yeah. Context size matters a lot. With OpenCode dumping like 10k tokens in the system prompt it takes like 4 rounds before it had to compact at say 64k. It's not really worth it to run at anything below 100k and even then the models aren't all that useful.

They're also pretty terrible at summarization. Pretty much always some file read or write in the middle of the task would cross the context margin and it would mark it as completed in the summary. I think leaving the first prompt as well as the last few turns intact would improve this issue quite a lot, but at low context sizes thats pretty much the whole context ...

Something promising I found is "DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090" - https://github.com/Luce-Org/lucebox-hub

Didn't know 3.6 was available on Ollama outside of MacOS!

The (unsloth dynamic) 4-bit quants of Qwen 3.6 kept getting stuck in circles for me. Even though it doesn't benchmark as well, GLM 4.7 Flash at least keeps making progress if I keep nudging it, so I have actually been able to finish some apps with it.

When you run ollama serve, make sure you override the context size to about 32K. Also, I give the model a useful short README.md on the code it is writing or modifying, and a Makefile with useful targets for the agent to use. I usually use Claude Code with qwen3.6

I also go outside for fresh air while I wait for a session to run.

For context, I'm feeling like I have a "free Sonnet" now that I've got Qwen3.6 35B running on my 5070ti at home (I connect to it via Tailscale). I run it _almost exactly_ the same as this Reddit post which found a good way to squeeze the 35B model onto a GPU with 16GB of VRAM: https://www.reddit.com/r/LocalLLaMA/comments/1sor55y/rtx_507... I really like it because it's slightly more operationally complex (I had to write a script to start it) but now that I have it, I literally never have to change it. It's a folder with the llama-server in it and with the model.gguf in it, I run the script which starts serving the model, done.

Like that post, I get 75 tokens/second. The exact model is: Qwen3.6-35B-A3B-UD-Q4_K_M.gguf and I get 128k of context

I run it on my home machine and connect to it from anywhere over tailscale. I connect through the opencode CLI which I configure with this as provider by adding the following to my `~/.config/opencode/opencode.json`:

    {
      "provider": {
        "vllm": {
          "npm": "@ai-sdk/openai-compatible",
          "name": "local-llm-qwen3.6-35B",
          "options": {
            "baseURL": "http://homepc.tail987654.ts.net:8033/v1"
          },
          "models": {
            "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf": {
              "name": "Qwen3.6-35B"
            }
          }
        }
      }
    }

I see your updated post. Switch over to llamacpp and look up recommended quants and settings. A good place for this info is on /r/localllama

Yep! I'm currently trying vllm, then I'll give llamacpp a try too

I agree for planning it's not there yet. But I wouldn't be surprised if something came out that was in a similar weight class.

Qwen3.6 supports 266k context out of the box. Try using q8 kv cache to enable more of it.

I limited it to 64k expecting 24GB vram to not be enough to make use of the entire context window, but I'll try with other's suggestions.

Try oh-my-openagent plan mode.

Vibe coding on consumer hardware is still very limited; this is especially true on GPUs, whose RAM limit is around 16 - maybe 24 - GB for the vast majority (although Macs change the equation).

These are two realworld experiments, whose results are disappointing for those expecting levels of performance comparable to cloud services:

- https://deploy.live/blog/running-local-llms-offline-on-a-ten...

- https://betweentheprompts.com/40000-feet/

The first is even the 35b version of qwen3.6.

I don't disagree, I just want to say I've been really liking DeepseekV4, which is on par with Sonnet 4.6, in my coding tests.

I can't vibe code on a M3 Max - 48GB like I do with claude or codex..Far from it

I don't see how it's disappointing? 95% correct using the 35b model before the right quants came out on a laptop? And they still got tons of code written for them.

On a real GPU using 27b with the latest quants the experience is better. It's still not the same as opus running on a subsidized GPU farm. Well it is better for privacy at least.

I find it interesting how 2 people can read the same thing and come to very different conclusions.

I'm just using Qwen3-coder-next; I tried the new ones and the thinking mode is just too much. I'm still ending up with 'vibe' coding that's slow enough to catch when it does stupid things.

Eh. It is good in terms of results ( accuracy, good recommendations and so on ), but slow when it comes to actual inference. On local 128gb machine, it took over 5 minutes to brainstorm garage door opening mechanism with some additional restrictions for spice.

I find it hilarious how waiting 5 whole minutes to design software is considered slow in a way that people refer to as not useful. My god lol.

Is that 128gb RAM or VRAM?

Its the unified memory in this case ( Ryzen AI max ) so obviously there is some room for improvement there. Still, I would not dismiss the speed out of hand. Remember, we are trying to argue here that 'it is pretty close already'. In ways, it is. It others, it is not yet.

That's absolutely possible, its just as we move towards more advancement, We'll soon see Small models being smart enough to not be judged by parameter count but their reasoning and intelligence. You can see examples like Qwen 3.6 27B.

Yeah this is key, a lot of people are still just looking at the number of params and thinking these models are toys. What Qwen 3.6 has shown is that reasoning and tool calling are just as important if not more.

So at the heart of this architecture is what they call 'Markovian RSA', a combination of two papers RSA[0], which generates a certain amount of reasoning traces for a prompt; and the 'Markovian Thinker'[1] which seems to basically cut the end of those traces to keep context at a reasonable length.

I feel like there's potential to improve that part of just cutting a tunable amount (τ) of tokens off the tail end of those traces, because you may potentially lose valuable insight earlier in the trace? They did train the model (in SFT) to put the relevant information into the tail (τ) of the trace, but I'm not sure this is the best possible way.

0. https://arxiv.org/pdf/2509.26626

1. https://arxiv.org/pdf/2510.06557

Announcement blogpost: https://www.zyphra.com/post/zaya1-8b

I used their online api, and asked it to create code for a timer i can copy paste into about:blank to test out (prompt below)

it did it successfully, but it did need a follow up correction prompt, overally pretty impressive for a model with 760M active parameters, but definitely not deepseek-r1 level

that being said, if something with 760M active parameters can be this good then, there's a good chance it is likely that api-based models are likely to get cheaper in the future

Prompt ------

``` can you write me some js code (that i can put in the console for about:blank) which will basically create a timer for for me that i can start, stop, and store current values for (or rather lap)

so i want it to create buttons (start, stop, lap buttons) on the page for me with labels and divs and other elements that accordingly record the current information and display the current information, and can accordingly start, stop and lap :)

the js code that i copy paste automatically creates the html buttons and divs and other elements that can manage the timer and accordingly the timer works with them ```

0.76 active and it's vaguely competitive at coding sounds promising.

LM studio doesn't let me actually run this yet though: "Unsupported safetensors format: null"

I've been saying it for a long time now. I think small models are the future for LLMs. It's been fun seeing experiments to see just how much better models get by making them insanely large but it's not sustainable.

No I am not saying this model is a drop in Claude replacement. But I think in 2 years we might be really surprised what can be done in a desktop with commodity hardware, no connection to the internet, and a few models that span a subset of tasks.

Really happy to see amd put their hat in the ring. It's a good day for amd investors. I know a lot of AI bros will scoff at this, but having your first training run is a big deal for a new lab. AMD is on their way despite Nvidia having years of runway

You couldn't be any more right!

but he could be absolutely right

He could be right but time will tell if we can really achieve that level in open source space because as you know Even in open source space companies go closed when they achieve something really efficient and frontier. I'm not talking about all but that's usually a pattern

There are a lot of hats in the ring. I don't see Alibaba shutting down anytime soon. They make qwen.

Deepseek is doing valuations right now.

Moonshot is just getting started. Same with AMD. mistral is still working hard at it and has a customer base.

An Egyptian company dropped their first small model this month, Horus.

There are enough geopolitics at play that I expect this to be a very different outcome from typical startup market dynamics. If anything j worry about the big us labs longevity. The world is fed up with US tech it seems, and even for us citizens it's questionable the frontier labs have their interests in mind as they risk the entire economy.

That is a danger, but for now it seems rather distant.

OpenAI has provided in the past a couple of open-weights models, but it does not seem to plan the release of any others.

But except for OpenAI and Anthropic, with this announcement Zyphra is the 12th company which has announced new improved open-weights models during the last couple of months.

A half of these 12 companies have launched not only small models with less than 128B parameters, but also big models with a number of parameters ranging from over 200B to over 1T.

So for now there is a healthy competition and the offerings in open-weights models are very diverse and numerous.

(The 12 directories on huggingface.co: deepseek-ai, google, ibm-granite, LiquidAI, MiniMaxAI, mistralai, moonshotai, nvidia, Qwen, XiaomiMiMo, zai-org, Zyphra.)

using C was 100 times as productive as assembly. what happened was not that we finished software 100 times faster, but that we did projects 100 times bigger in the same time

same thing with smol local LLMs versus the big ones in the sky. your smol local LLM will only be able to tackle projects which are not comercially valuable anymore, because people expect 100x scope and features. which is fine as a hobby/art project

yes, we'll do amazing things with local LLMs in 2 years, but the big LLMs will do things beyond imagination (assembly vs C)

I disagree. I think people can make very good software by balancing their use of AI and their market knowledge. I still believe for the foreseeable future people can make wildly loved or mission critical software with 0 ai and have it be met with market interest.

I think we are going to see a surge in software claiming to do everything and becoming bloated and unsustainable.

I already see 1gpu local models 1 shotting games via vibe coding. I see people doing agentic programming, granted more slowly and cheaply than 12 Claude sessions.

The difference isn't as big as it was 2 months ago. In the past 45 days so many model releases have happened. Meanwhile frontier performance has stagnated and degraded. If it's a taste of what is to come I welcome it.

I'm like two months into a vibe coded C project. My issues are the same as ever. How to pack memory. What syscalls to run and when. Is the program stable after running for 24 hours? When I want to make a change it's usually a trade off with something else. There's no accounting for taste among humans. Let alone among an LM. It's great at implementing my ideas but terrible at coming up with those ideas. Architecture is always going to be king.

I agree. Humans with experience are better at writing and designing code. But for the people who have given up on themselves this stuff is interesting.

I personally use these models for low value boiler plate tasks only. Or auto complete.

I meant literally vibe coded. It's 99.99% written by agents. But it's extremely opinionated on the architecture. How the event bus works. How the plugin ABI is structured. What syscalls to run and when. That's my opinion. The human being. Outside of that the whole thing may as well be a compiler writing assembly. I'm actually thinking of doing that as my next experiment instead of relying on gcc.

Models are heavily fine tuned and trained to follow instructions. They are trained to be subservient. I am sure that cuts into their ability to think creatively. The other risk with a lot of creative thinking is risking hallucinations (creative thinking = perhaps trying what’s not in its training set = hallucination basically). So I will rephrase creative thinking as desired or useful hallucination that is still firmly within the constraints of the prompt.

If that sounds complicated, that’s because it is! It’s a tricky balance to get right. I think the current architecture for most GPT models isn’t sufficient to solve this problem for good. I suppose we need to do more research into what constitutes desirable vs undesirable hallucination and how to shift the balance towards the latter.

I agree with this take.

While smaller models will continue to get better, it does not render larger models obsolete. The larger models will move onto higher value tasks or just generate more value.

Today, a small local model might be as smart as GPT4 was in coding but the biggest models are exploding in demand.

[dead]

[flagged]