Gaussian Processes for Machine Learning (2006) [pdf]

This is the definitive reference on the topic! I have some notes on the topic as well, if you want something concise, but that doesn't ignore the math [1].

[1] https://blog.quipu-strands.com/bayesopt_1_key_ideas_GPs#gaus...

These are very cool, thanks. Do you know what kind of jobs are more likely to require Gaussian process expertise? I have experience in using GP for surrogate modeling and will be on the job market soon.

Also a resource I enjoyed is the book by Bobby Gramacy [0] which, among other things, spends a good bit on local GP approximation [1] (and has fun exercises).

[0] https://bobby.gramacy.com/surrogates/surrogates.pdf

[1] https://arxiv.org/abs/1303.0383

Aside from secondmind [1] I don't know of any companies (only because I haven't looked)... But if I had to look for places with strong research culture on GPs (I don't know if you're) I would find relevant papers on arxiv and Google scholar, and see if any of them come from industry labs. If I had to take a guess on Bayesian tools at work, maybe the industries to look at would be advertising and healthcare.I would also look out for places that hire econometricists.

Also thank you for the book recommendation!

[1] https://www.secondmind.ai/

Your tutorials show a real talent for visualization. I never grokked SVMs before I came across your Medium page at https://medium.com/cube-dev/support-vector-machines-tutorial... . Thanks!

Thank you for your kind comment!

My take is that the Rasmussen book isn't especially approachable, and that this book has actually held back the wider adoption of GPs in the world.

The book has been seen as the authoritative source on the topic, so people were hesitant to write anything else. At the same time, the book borders on impenetrable.

Good to see GPs still being discussed in 2025!

Here was my attempt at a 'second' introduction a few years ago: https://maximerobeyns.com/second_intro_gps

Why would you learn Gaussian Processes today? Is there any application where they are still leading and have not been superseeded by Deep NNets?

I would argue there are more applications overall where Gaussian processes are superior, as most scientific applications have smaller data sets. Not everything has enough data to take advantage of feature learning in NNs. They are generally reliable, interpretable, and provide excellent uncertainty estimates for free. They can be made to be multiscale, achieving higher precisions as a function approximator than most other methods. Plus, they can exhibit reversion to the prior when you need that.

Another example where it is used is for emulating outputs of an agent-based model for sensitivity analyses.

Basically they're incredibly useful for any situation where you have "medium" data where you don't have enough data to properly train a NN (which are very data hungry in practice) but enough data that you're not really exploiting all the information using a more traditional approach.

GPs essentially allow you to get a lot of the power of a NN while also being able to encode a bunch of domain knowledge you have (which is necessary when you don't have enough data for the model to effectively learn that domain knowledge). On top of that, you get variance estimates which are very important for things like forecasting.

The only real draw back to GPs is that they absolutely do not fit into the "fit/predict" paradigm. Properly building a scalable GP takes a more deeper understanding of the model than most cases. The mathematical foundations required to really understand what's happening when you train a sparse GP greatly exceed what is required to understand a NN, and on top of that there is a fair amount of practical insight into kernel development that is required as well. But the payoff is fantastic.

It's worth recognizing that, once you realize that "attention" is really just kernel smoothing, transformers are essentially learning sophisticated stacked kernels, so ultimately share a lot in common with GPs.

AFAIK state of the art is still a mix of new DNN and old school techniques. Things like parameter efficiency, data efficiency, runtime performance, and understandability would factor into the decision making process.

Bayesian optimization of, say, hyperparameters is the canonical modern usage in my view, and there are other similar optimization problems where it's the preferred approach.

To reduce the risk of being a lemming. It is in everyone's interests for some people not to follow the herd / join the plague of locusts.

you can combine deep NNets with GPs, e.g. here https://arxiv.org/abs/1511.02222

So it isn't a matter of which is better. If you ever need to imbue your deep nets with good confidence estimates, it is definitely worth checking out.

Stationary GPs are just stochastic linear dynamical systems. (Not just the Matern covariance kernel)

For the visually inclined: https://distill.pub/2019/visual-exploration-gaussian-process...

On the HN front page for 16 hours (though with strangely little discussion) just two days ago:

A Visual Exploration of Gaussian Processes (2019) - https://news.ycombinator.com/item?id=44919831 - Aug 2025 (1 comment)