Ask HN: Claude 3.5 Sonnet vs. o1 vs. <other> for coding. Let's talk!

Both o1 (mini/preview) and Claude 3.5 Sonnet seem to be popular among devs, but opinions seem to be divided + all over the place. From my experience, both seem to have their strengths and weaknesses, and I find myself switching between both.

If you’ve used either — or ideally both — would love to hear your insights. Feel answers to the following questions will provide some context when you respond:

- What are the strengths & weaknesses of each from your experience?

- Any tips/tricks or prompting techniques you use to get the most from these models?

- How do you typically use them? (via. native apps like ChatGPT, Claude; or via Cursor, GitHub Copilot, etc.?)

- What programming language(s) do you primarily use them with?

Hopefully this thread provides a useful summary and some additional tips for readers.

(I’ll start with mine in the comments)

O1 for collabing on design docs, o1 for overall structure, break it into tasks per preference / sort; sonnet/o1 for executing each small tasks.

O1 is higher quality, more nuanced, and has deeper understanding; the biggest downside rn is the significantly higher latency (both due to thinking, and also, continue.dev doesn't support o1 streaming currently, so you're waiting until it's all done), and higher cost.

In terms of tools: either vscode with continue.dev / cline, or cursor

Languages: node.js / javascript, and lately c# / .net / unity

yes, o1 def. seems to have a deeper 'understanding'

I prefer o1. I mostly use it as a knowledge system. Don't really care for the automatic code generation nonsense. Unless I'm really tired and the task is very simple, in which case I might decide to write a paragraph of text instead of 30 lines of Python. My experience is that when ChatGPT fails, Claude fails too. On some advanced coding tasks, I find ChatGPT's depth of reasoning ability to be better.

Out of curiosity, could you give an example of 30 loc python code that would require a lot of text to describe? I usually find that high level descriptions with maybe one iteration with refinements work surprisingly well if the assistant has knowledge of the context.

curious what you meant by a knowledge system. thanks

o1:

- better for when the response has to address many subgoals coherently

- usually will not undo previous bugfix progress that was made earlier in the conversation, whereas with Claude if you start having extremely long conversations I have noticed it allowing certain bugs it had already fixed to be reintroduced at much later times

Claude:

- image inputs are actually very complementary for debugging issues, esp if visual at all (eg debugging why a GUI framework rendered your UI in an unexpected way, just include a screenshot)

- surprisingly very good at taking descriptions of algorithmic or mathematical procedures and making captioned svg illustrations, then taking screenshots of those svgs + user feedback to enhance the next version of svg illustrations

- more recent knowledge cutoff, so generally speaking somewhat less likely to deny newer APIs/things exist (eg o1 told me tokenizer.apply_chat_template and meta-llama/Llama-3.2-1B-Instruct both did not exist and removed them both from the code I was feeding it)

thanks!

> with Claude if you start having extremely long conversations I have noticed it allowing certain bugs it had already fixed to be reintroduced at much later times

i think this is a result of its inability to handle long contexts well?

My notes:

- Sonnet 3.5 seems good with code generation and o1-preview seems good with debugging

- Sonnet 3.5 struggles with long contexts whereas o1-preview seems good at identifying interdependencies between files in code repo in answering complex questions

- Breaking the problem into small steps seems to yield better results with Sonnet

- I’m using primarily in Cursor/GH Copilot and with Python

I concur. Sonnet is great at starting projects, but eventually gets 'bogged' down and starts losing the plot. o1 is then useful to sort out the issues and painfully pull things back on track.

I like aider with the claude-3-5-sonnet-20241022, haven’t tried it with O1, though.

Also, https://aider.chat/docs/scripting.html offers some nice possibilities.

haven't tried aider. looks interesting

Started a small project to compare AI IDEs

https://github.com/StephanSchmidt/ai-coding-comparison/

(no comparison there yet, just some code to play around with)

interesting, will check it out

Are there any concrete benchmarks for comparing models for different types of programming tasks?

not that i know, but that's something we def. need

o1 if you're going to write full specs and not provide any context.

Sonnet 3.5 if you can provide context (e.g. with cursor)

gpt-4o for UI design. Also for solving screenshots of interviews

wdym by 'screenshots of interviews'? like coding interview questions?

Yup. Most platforms won't let you copy-paste the question to AI but you can use the ChatGPT app to just take a photo of the question.

Personally, I'm not a fan of any company that's interviewing in this manner; most of them lack people who are senior enough to write questions that AI can't answer. So I encourage cheating until companies decide to stop interviewing this way.

o1 is much better at finding complex needle in the haystack bugs/fixes. sonnet 3.5 better at shallow generic coding

agree!