127

Python Data Science Handbook

These types of books are always interesting to me because they tackle so many different things. They cover a range of topics at a high level (data manipulation, visualization, machine learning) and each could have its own book. They balance teaching programming while introducing concepts (and sometimes theory).

In short I think it's hard to strike an appropriate balance between these but this seems to be a good intro level book.

3 hours agoellisv

This book was absolute fire for getting started with data science in 2017-2018, Jake is a great teacher.

2 hours agotrio8453

This is one of the few books that I read cover-to-cover when I was starting out learning Data Science in 2020/21. Will recommend.

2 hours ago__rito__

Interesting choice of Pandas in this day and age. Maybe he’s after imparting general concepts that you could apply to any tabular data manipulator rather than selecting for the latest shiny tool.

3 hours agosschnei8

why? It's the industry standard as far as my reach goes.

What other framework would you replace it with?

No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.

3 hours agodahcryn

You can assert whatever you want, but Polars is a great answer. The performance improvements are secondary to me compared to the dramatic improvement in interface.

Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.

2 hours agocrystal_revenge

> No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.

Can you expand on why Polars isn't optimised for a holistic approach to data science?

2 hours agoporker

I have not work with Polars, but I would imagine any incompatibility with existing libraries (e.g. plotting libraries like plotnine, bokeh) would quickly put me off.

It is a curse I know. I would also choose a better interface. Performance is meh to me, I use SQL if i want to do something at scale that involves row/column data.

an hour agofifilura

This is a non-issue with Polars dataframes to_pandas() method. You get all the performance of Polars for cleaning large datasets, and to_pandas() gives you backwards compatibility with other libraries. However, plotnine is completely compatible with Polars dataframe objects.

an hour agorbartelme

You can always convert from Polars to Pandas. Plotnine will do it automatically for you, even.

an hour agomaleldil

What can you do in more easily in pandas than polars?

an hour agominimaxir

The book is quite old actually, not sure if "this day and age" still applies to it

2 hours agomaxnoe

It was originally published in 2016, and I think this is still the first edition.

3 hours agomsto
[deleted]
an hour ago

What's wrong with Pandas?

3 hours agoxenophonf

I probably wouldn’t rewrite an entire data science stack that used pandas, but most people would use polars if starting a new project today.

3 hours agoclickety_clack

R and Matlab workflows have been fairly stable for the past decade. Why is the Python ecosystem so... unstable? It puts me off investing any time in it.

2 hours agobiofox

The R ecosystem has had a similar evolution with the tidyverse, it was just a little further ago. As for Matlab, I initially learned statistical programming with it a long time ago, but I’m not sure I’ve ever seen it in the wild. I don’t know what’s going on there.

I’m actually quite partial to R myself, and I used to use it extensively back when quick analysis was more valuable to my career. Things have probably progressed, but I dropped it in favor of python because python can integrate into production systems whereas R was (and maybe still is) geared towards writing reports. One of the best things to happen recently in data science is the plotnine library, bringing the grammar of graphics to python imho.

The fact is that today, if you want career opportunities as a data scientist, you need to be fluent in python.

an hour agoclickety_clack

Outside bioconductor or the tidyverse in R can be just as unstable due to CRAN's package requirements.

an hour agorbartelme
[deleted]
2 hours ago

Pandas turns 10x developers with a lust for life into 0.1x developers with grey hairs.

an hour agoamelius

I wouldn't say it's a handbook because it's more like an introduction. But it's pretty well written.

3 hours agowiz21c

it's written 8 years ago though, there is a 2ed of the book by the same author.

3 hours agosynergy20

The linked Github seems to have the 2nd edition in the form of notebooks, https://github.com/jakevdp/PythonDataScienceHandbook/blob/ma..., under the Using Code Examples section, "attribution usually includes the title, author, publisher, and ISBN. For example: "Python Data Science Handbook, 2nd edition, by Jake VanderPlas (O’Reilly). Copyright 2023..." compared to the OP's link which has "The Python Data Science Handbook by Jake VanderPlas (O’Reilly). Copyright 2016..."

an hour agophone_book

He's a great writer and I miss his blog. He had an awesome post on pivot table that I think is now a part of this book.

4 hours agoBenGosub

He is also the creator of the Altair visualization library (Vega-Lite in Python https://altair-viz.github.io/). I really like using it.

2 hours agoayhanfuat

Thanks for the fact, I used Altair sometimes and really admire the simplicity, not knowing it was written by Jake.

28 minutes agolinhns

very cool!