Smart Performance Hacks for Faster Python Code

I'm sure this is plenty useful for less experienced people, but the "smart" hacks read a bit like:

Hack 1: Don't Use The Obviously Wrong Data Structure For Your Problem!

Hack 2: Don't Have The Computer Do Useless Stuff!

Hack 3: Don't Allocate Memory When You Don't Need To!

And now, a word from our sponsor: AI! Use AI to help AI build AI with AI, now with 15% more AI! Only with AI! Ask your doctor if AI is right for you!

It's worth pointing out that a few of them are Python-specific. Compilers can inline code, there's usually no need to manually inline functions in most languages, that's Python being Python. Which scope the function is from being important is quintessentially Python being Python.

The major gains in Python come from... not using Python. Essentially you have to rewrite your code around the fact that numpy and pandas are the ones really doing the work behind the curtain (e.g. aggressively vectorize, use algorithms that can use vectorization well rather than "normal" ones). Number 8 of the list hints at that.

Not to mention that a lot of these performance improvements, while sane, are on the order of milliseconds of improvement. Unless you're doing one of these unoptimized approaches thousands or millions of times in a tight loop you're probably not saving a substantial amount of time/energy/computation. Premature optimization is still the root of all evil!

If you want an actual performance improvement in Python code that most people wouldn't necessarily expect: consider using regexes for even basic string parsing if you're doing a lot of it, rather than doing it yourself (e.g. splitting strings, then splitting those strings, etc.); while regexes "feel" like they should be more complicated and therefore slower or less efficient, the regex engine in Python is implemented in C and there's a decent chance that, with a little tweaking, even simple string processing can be done faster with a regex. Again only important in a hot loop, but still.

“Hacks” 4-10 could easily be replaced with “use numpy.” Performance gains from doing math better in pure Python are minimal compared with numpy. It’s not unusual for the numpy version of something to end up taking 0.01x as long to run.

Math is a lost cause, yeah, just use numpy (with the important caveat that you need to know what you're doing, it's easy to fumble badly).

But Python has a few interesting features that can easily get you big wins, like generators, e.g. https://www.dabeaz.com/generators/Generators.pdf

A lot of interesting math can't be done in numpy, sadly. At that point you might be better off writing the initial version in Python and translating it to something else.

A friend of mine asked me to translate some (well-written) number theory code a while back, I got about a 250x speedup just doing a line by line translation from Python to Julia. But the problem was embarrassingly parallel, so I was able to slap on an extra 40x by tossing it on a big machine for a few hours for a total of 10,000x. My friend was very surprised – he was expecting around a 10x improvement.

I 'wrote' (adapted from the Rich project's example code) a simple concurrent file downloader in Python; run 'download <any number of URLs>' and it goes and downloads each one, assuming that the URL has what looks like a filename at the end or the server response with a Content-Disposition header that contains a filename. It was very simple; spawn a thread for each file we're downloading, show a progress bar for each file we're downloading, update the progress bar as we download.

I ended up rewriting the whole thing in Rust (my first Rust project) solely because I noticed that just that simple process - "get some bytes from the network, write them to this file descriptor, update the progress bar's value" was churning my CPU due to how intensive it was for the progress bar to update as often as it was - which wasn't often.

Because of how ridiculous it was I opted to rewrite it in another language; I considered golang but all of the progress bar libraries in Golang are mediocre at best, and I liked the idea of learning more Rust. Surprise surprise, it's faster and more efficient; it even downloads faster, which is kind of ridiculous.

An even crazier example: a coworker was once trying to parse some giant logfile and we ended up nerd-sniping ourselves into finding ways to speed it up (even though it finished while we were doing so). After profiling this very simple code, we found that 99% of the time in processing each line was simply parsing the date, and 99% of that was because Python's strptime is devoted to being able to parse timezones even if the input you're giving it doesn't include one. We played around with things like storing a hash map of "string date to python datetime" since there were a lot of duplicates, but the fastest method was to write an awful Python extension that basically just exposed glibc's strptime so you could bypass Python's (understandably) complex tz parsing. For the version of Python we were using it made parsing hundreds of thousands of dates 47x faster, though now in Python3 it's only about 17x faster? Maybe less.

https://github.com/danudey/pystrptime

I still use Python all the time because usually the time I save writing my code quickly more than outweighs the time I spend having slower code overall; still, if your code is going to live a while, maybe try running it through a profiler and see what surprises you can find.

Yes I didn't realize anyone actually did anything numerical without numpy. I don't think I've ever imported python's math module once. Who in their right mind is making a 1e6 long python list

Use polars vs pandas. This alone saves me more time than any other “hack”.

> Copying large objects like lists […] can be costly in both time and memory.

> modify[ing] objects in place […] improves performance by avoiding the overhead of allocating and populating new structures.

AFAIK the poor performance of list copies (demonstrated in the article by a million-element list taking 10ms) doesn’t come from memory allocation nor from copying the contents of the list itself (in this case, a million pointers).

Rather it comes from the need to chase all of those pointers, accessing a million disparate memory locations, in order to increment each element’s reference count.

Yeah. A more nuanced approach is that you should copy things that don't consist of lots and lots of references to other things, and you should mutate things that are mostly references to other structures.

Which means, eventually, designing your data structures so you generally have two types of structures: one which isn't full of pointers, and one which mostly is.

(2) surprised me a little. Not because of the performance consequences, but because I almost never see explicit calls to `copy()` in Python (and I read a lot of Python).

I think maybe a more realistic example there would be people using splatting without realizing/internalizing that it performs a full copy, e.g.

    xs = [1, *ys]

Another one that stood out was (3). Slots are great, but >95% of the time I'd expect people would want to use `slots=True` with dataclasses instead of manually writing `__slots__` and a constructor like that. `slots=True` has worked since Python 3.10, so every non-EOL version of Python supports it.

I didn't find 2 surprising either, but I'm a little surprised you never see it. If you want to treat the args to a function as immutable, what can you do besides copy, modify, and return a new object?

> what can you do besides copy, modify, and return a new object?

You can directly produce a modified copy, rather than using a mutating operation to implement the modifications.

It should be noted that "return a modified copy" algorithms can be much more efficient than "mutate the existing data" ones. For example, consider the case of removing multiple elements from a list, specified by a predicate. The version of this code that treats the input as immutable, producing a modified copy, can perform a single pass:

  def without(source, predicate):
      return [e for e in source if not predicate(e)]

whereas mutating code can easily end up with quadratic runtime — and also be difficult to get right:

  def remove_which(source, predicate):
      i = 0
      while i < len(source):
          if predicate(source[i]):
              # Each deletion requires O(n) elements to shift position.
              del source[i]
          else:
              # The index increment must be conditional,
              # since removing an element shifts the next one
              # and that shifted element must also be considered.
              i += 1

swap with last element then truncate at the end

Yes, you can do this if you don't care about order, and avoid the performance degradation. But it's even more complex.

Or if you do care about order, you can emulate the C++ "erase-remove" idiom, by keeping track of separate "read" and "write" positions in the source, iterating until "read" reaches the end, and only incrementing "write" for elements that are kept; and then doing a single `del` of a slice at the end. But this, too, is complex to write, and very much the sort of thing one chooses Python in order to avoid. And you do all that work, in essence, just to emulate what the list comprehension does but in-place.

I think copying and modifying is normal! I just almost never see it with `copy()`. Apologies if I didn't say that clearly.

You can use __slots__ for normal classes; it’s not limited to only dataclasses.

I know that; that's why I said "I'd expect" not "you can't."

A previous submission of this (https://news.ycombinator.com/item?id=45937910) didn't take off; I'm posting this to cite my comment there (https://news.ycombinator.com/item?id=45940360) rather than copying and pasting it.

Some helpful guidelines, but it's 2025 and people still use time.time and no stats with their benchmarks :(

In general I feel like these kind of benchmarks might change for each python version, so some caveats might apply.

Perhaps you could suggest what should be used instead of time.time

https://switowski.com/blog/how-to-benchmark-python-code/ has a decent overview of some benchmarking libraries

There's some genuinely interesting tips in here, but #10 is for sure just padding so they could call the article "10 Hacks" haha. Everything else is at least somewhat Python specific, but "Hack 10: Avoid repeated function calls in loops" is just applicable to anything.

Yeah, 10 felt like it was written by ai.

A lot of it felt that way, to me.

I agree. I can't imagine anybody would call a function that returns the same result in a loop like that. There are plenty more optimizations they could come up with. In fact, there are a couple of books https://www.oreilly.com/library/view/high-performance-python... for instance. Didn't want to link the one on Amazon

Maybe also knowing when not to use python, or finding a solution in python that uses C/rust/etc underneath.

People are making fun of this statement here / are being sarcastic. But it's a totally legit suggestion. If you know in advance, you are going to make something where performance matters, strongly consider using something other than one of the slowest languages of them all.

It's kinda funny how uv is written in Rust and many Python libraries where performance is expected to matter (NumPy, Pandas, PyTorch, re, etc.) are implemented in C. Even if you call into fast code from Python you still have to contend with the GIL which I find very limiting for anything resembling performance.

Python's strong native story has always been one of its biggest draws: people find it ironic that so much of the Python ecosystem is native code, but it plays to Python's strength (native code where performance matters, Python for developer joy/ergonomics/velocity).

> Even if you call into fast code from Python you still have to contend with the GIL which I find very limiting for anything resembling performance.

It depends. A lot of native extension code can run without the GIL; the normal trick is to "detach" from the GIL for critical sections and only reconnect to it once Python needs to see your work. PyO3 has a nice collection of APIs for holding/releasing the GIL and for detaching from it entirely[1].

[1]: https://docs.rs/pyo3/0.27.1/pyo3/marker/struct.Python.html#m...

I didn't know about detaching from the GIL... I'll look into that.

> native code where performance matters, Python for developer joy/ergonomics/velocity

Makes sense, but I guess I just feel like you can eat your cake and have it too by using another language. Maybe in the past there was a serious argument to be made about the productivity benefits of Python, but I feel like that is becoming less and less the case. People may slow down (a lot) writing Rust for the first time, but I think that writing JavaScript or Groovy or something should be just as simple, but more performant, do multi-threading out of the box, and generally not require you to use other languages to implement performance critical sections as much. The primary advantage that Python has in my eyes is: there are a lot of libraries. The reason why there are a lot of libraries written in Python? I think it's because Python is the number 1 language taught to people that aren't specifically pursuing computer science / engineering or something in a closely related field.

Yes, I think Python is excellent evidence that developer ecosystems (libraries, etc.) are paramount. Developer ergonomics are important, but I think one of the most interesting lessons from the last decade is that popular languages/ecosystems will converge onto desirable ergonomics.

Python is the ultimate (for now) glue language. I'd much rather write a Python script to glue together a CLI utility & a C library with a remote database than try to do that all in C or Rust or BASH.

Yeah, it's great for stuff like that, but I find myself using Node more in that area.

In my analysis, the lion's share of uv's performance improvement over pip is not due to being written in Rust. Pip just has horrible internal architecture that can't be readily fixed because of all the legacy cruft.

And for numerical stuff it's absolutely possible to completely trash performance by naively assuming that C/Rust/Fortran etc. will magically improve everything. I saw an example in a talk once where it superficially seemed obvious that the Rust code would implement a much more efficient (IIRC) binary search (at any rate, some sub-linear algorithm on an array), but making the data available to Rust; as a native Rust data structure, required O(N) serialization work.

> Pip just has horrible internal architecture that can't be readily fixed because of all the legacy cruft.

Interesting... I didn't know that. So they should be able to get similar results in Python then?

> absolutely possible to completely trash performance by naively assuming

Yeah, of course we'd need a specific benchmark to compare results. It totally depends on the problem that you're trying to solve.

> So they should be able to get similar results in Python then?

I'm making PAPER (https://github.com/zahlman/paper) which is intended to prove as much, while also filling some under-served niches (and ignoring or at least postponing some legacy features to stay small and simple). Although I procrastinated on it for a while and have recently been distracted with factoring out a dependency... I don't want to give too much detail until I have a reasonable Show HN ready.

But yeah, a big deal with uv is the caching it does. It can look up wheels by name and find already-unpacked data, which it hard-links into the target environment. Pip unpacks from the wheel each time (which also entails copying the data rather than doing fast filesystem operations, and its cache is an HTTP cache, which just intercepts the attempt to contact PyPI (or whatever other specified index).

Python offers access to hard links (on systems that support them) in the standard library. All the filesystem-related stuff is already implemented in C under the hood, and a lot of the remaining slowness of I/O is due to unavoidable system calls.

Another big deal is that when uv is asked to precompile .pyc files for the installation, it uses multiple cores. The standard library also has support for this (and, of course, all of the creation of .pyc files in CPython is done at the C level); it's somewhat naive, but can still get most of the benefit. Plus, for the most part the precompiled files are also eligible for caching, and last time I checked even uv didn't do that. (I would not be at all surprised to hear that it does now!)

> It totally depends on the problem that you're trying to solve.

My point was more that even when you have a reasonable problem, you have to be careful about how you interface to the compiled code. It's better to avoid "crossing the boundary" any more than absolutely necessary, which often means designing an API explicitly around batch requests. And even then your users will mess it up. See: explicit iteration over Numpy/Pandas data in a Python loop, iterative `putpixel` with PIL, any number of bad ways to use OpenGL bindings....

> explicit iteration over Numpy/Pandas data in a Python loop

Yeah, I get it. I see the same thing pretty often... The loop itself is slow in Python so you have APIs that do batch processing all in C. Eventually I think to myself, "All this glue code is really slowing down my C." haha

maybe you can skip C and just use assembly

Or fab your own chip.

Or wire wrap your own transistors.

What about using PyPy? You'll probably see a significant improvement in these benchmarks. You should also give it a shot in Node which I expect to be about on par with PyPy, but without the GIL.

If anyone wants to be surprised by optimization, a great way to do it is to look at all the cases where, even though Python is slower than C, the Python interpreter written in Python is faster than the Python interpreter written in C.

Also, if we're going to suggest 'write it in another language' approaches, rewrite it in Golang. I detest writing in Golang but once you get the hang of things you can get to the point where your code only takes twice the time to write and 2% of the time (and memory) to run.

FWIW: the `x in foo` test calls the `foo.__contains__` magic method.

For a list, the only way to implement it is by iterate through it (see the `list_contains` function in the CPython code).

But for the special `range` object, it can implement the `__contains__` efficiently by looking at the start/stop/step (see the `range_contains` source code).

Although the Hack 1 is for demonstration purpose, in most cases you can just do `999999 in range(1000000)`.

In my test, the same `999999 in foo` is 59.1ns for the range object, 27.7ns for the set, 6.7ms for the list. The set is the fastest, except converting the range object to the set takes 21ms.

1,2,5 are just kind of standard computer science knowledge things I would expect any CS2 student to know, and are pretty universal across most languages.

Some of these are pretty nice python tricks though.

I remember learning not to use dot access in performance critical loops

a smart hack for performance is don't use python

While obvious, a huge performance improvement bump can had using cachetools on functions. Cachetools is much more feature rich than lru_cache with support for TTL.