Ask HN: How are LLMs getting smarter beyond just scaling?

I have a question for those who deeply understand LLMs.

From what I understand, the leap from GPT-2 to GPT-3 was mostly about scaling - more compute, more data. GPT-3 to 4 probably followed the same path.

But in the year and a half since GPT-4, LLMs have gotten significantly better, especially the smaller ones. I'm consistently impressed by models like Claude 3.5 Sonnet, despite us supposedly reaching scaling limits.

What's driving these improvements? Is it thousands of small optimizations in data cleaning, training, and prompting? Or am I just deep enough in tech now that I'm noticing subtle changes more? Really curious to hear from people who understand the technical internals here.