These types of books are always interesting to me because they tackle so many different things. They cover a range of topics at a high level (data manipulation, visualization, machine learning) and each could have its own book. They balance teaching programming while introducing concepts (and sometimes theory).
In short I think it's hard to strike an appropriate balance between these but this seems to be a good intro level book.
This book was absolute fire for getting started with data science in 2017-2018, Jake is a great teacher.
This is one of the few books that I read cover-to-cover when I was starting out learning Data Science in 2020/21. Will recommend.
Interesting choice of Pandas in this day and age. Maybe he’s after imparting general concepts that you could apply to any tabular data manipulator rather than selecting for the latest shiny tool.
why? It's the industry standard as far as my reach goes.
What other framework would you replace it with?
No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.
You can assert whatever you want, but Polars is a great answer. The performance improvements are secondary to me compared to the dramatic improvement in interface.
Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.
> No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.
Can you expand on why Polars isn't optimised for a holistic approach to data science?
I have not work with Polars, but I would imagine any incompatibility with existing libraries (e.g. plotting libraries like plotnine, bokeh) would quickly put me off.
It is a curse I know. I would also choose a better interface. Performance is meh to me, I use SQL if i want to do something at scale that involves row/column data.
This is a non-issue with Polars dataframes to_pandas() method. You get all the performance of Polars for cleaning large datasets, and to_pandas() gives you backwards compatibility with other libraries. However, plotnine is completely compatible with Polars dataframe objects.
You can always convert from Polars to Pandas. Plotnine will do it automatically for you, even.
What can you do in more easily in pandas than polars?
The book is quite old actually, not sure if "this day and age" still applies to it
It was originally published in 2016, and I think this is still the first edition.
[deleted]
What's wrong with Pandas?
I probably wouldn’t rewrite an entire data science stack that used pandas, but most people would use polars if starting a new project today.
R and Matlab workflows have been fairly stable for the past decade. Why is the Python ecosystem so... unstable? It puts me off investing any time in it.
The R ecosystem has had a similar evolution with the tidyverse, it was just a little further ago. As for Matlab, I initially learned statistical programming with it a long time ago, but I’m not sure I’ve ever seen it in the wild. I don’t know what’s going on there.
I’m actually quite partial to R myself, and I used to use it extensively back when quick analysis was more valuable to my career. Things have probably progressed, but I dropped it in favor of python because python can integrate into production systems whereas R was (and maybe still is) geared towards writing reports. One of the best things to happen recently in data science is the plotnine library, bringing the grammar of graphics to python imho.
The fact is that today, if you want career opportunities as a data scientist, you need to be fluent in python.
Outside bioconductor or the tidyverse in R can be just as unstable due to CRAN's package requirements.
[deleted]
Pandas turns 10x developers with a lust for life into 0.1x developers with grey hairs.
I wouldn't say it's a handbook because it's more like an introduction. But it's pretty well written.
it's written 8 years ago though, there is a 2ed of the book by the same author.
The linked Github seems to have the 2nd edition in the form of notebooks, https://github.com/jakevdp/PythonDataScienceHandbook/blob/ma..., under the Using Code Examples section, "attribution usually includes the title, author, publisher, and ISBN. For example: "Python Data Science Handbook, 2nd edition, by Jake VanderPlas (O’Reilly). Copyright 2023..." compared to the OP's link which has "The Python Data Science Handbook by Jake VanderPlas (O’Reilly). Copyright 2016..."
He's a great writer and I miss his blog. He had an awesome post on pivot table that I think is now a part of this book.
He is also the creator of the Altair visualization library (Vega-Lite in Python https://altair-viz.github.io/). I really like using it.
Thanks for the fact, I used Altair sometimes and really admire the simplicity, not knowing it was written by Jake.
I loved his Statistics for Hackers talk: https://speakerdeck.com/pycon2016/jake-vanderplas-statistics...
These types of books are always interesting to me because they tackle so many different things. They cover a range of topics at a high level (data manipulation, visualization, machine learning) and each could have its own book. They balance teaching programming while introducing concepts (and sometimes theory).
In short I think it's hard to strike an appropriate balance between these but this seems to be a good intro level book.
This book was absolute fire for getting started with data science in 2017-2018, Jake is a great teacher.
This is one of the few books that I read cover-to-cover when I was starting out learning Data Science in 2020/21. Will recommend.
Interesting choice of Pandas in this day and age. Maybe he’s after imparting general concepts that you could apply to any tabular data manipulator rather than selecting for the latest shiny tool.
why? It's the industry standard as far as my reach goes.
What other framework would you replace it with?
No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.
You can assert whatever you want, but Polars is a great answer. The performance improvements are secondary to me compared to the dramatic improvement in interface.
Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.
> No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.
Can you expand on why Polars isn't optimised for a holistic approach to data science?
I have not work with Polars, but I would imagine any incompatibility with existing libraries (e.g. plotting libraries like plotnine, bokeh) would quickly put me off.
It is a curse I know. I would also choose a better interface. Performance is meh to me, I use SQL if i want to do something at scale that involves row/column data.
This is a non-issue with Polars dataframes to_pandas() method. You get all the performance of Polars for cleaning large datasets, and to_pandas() gives you backwards compatibility with other libraries. However, plotnine is completely compatible with Polars dataframe objects.
You can always convert from Polars to Pandas. Plotnine will do it automatically for you, even.
What can you do in more easily in pandas than polars?
The book is quite old actually, not sure if "this day and age" still applies to it
It was originally published in 2016, and I think this is still the first edition.
What's wrong with Pandas?
I probably wouldn’t rewrite an entire data science stack that used pandas, but most people would use polars if starting a new project today.
R and Matlab workflows have been fairly stable for the past decade. Why is the Python ecosystem so... unstable? It puts me off investing any time in it.
The R ecosystem has had a similar evolution with the tidyverse, it was just a little further ago. As for Matlab, I initially learned statistical programming with it a long time ago, but I’m not sure I’ve ever seen it in the wild. I don’t know what’s going on there.
I’m actually quite partial to R myself, and I used to use it extensively back when quick analysis was more valuable to my career. Things have probably progressed, but I dropped it in favor of python because python can integrate into production systems whereas R was (and maybe still is) geared towards writing reports. One of the best things to happen recently in data science is the plotnine library, bringing the grammar of graphics to python imho.
The fact is that today, if you want career opportunities as a data scientist, you need to be fluent in python.
Outside bioconductor or the tidyverse in R can be just as unstable due to CRAN's package requirements.
Pandas turns 10x developers with a lust for life into 0.1x developers with grey hairs.
I wouldn't say it's a handbook because it's more like an introduction. But it's pretty well written.
it's written 8 years ago though, there is a 2ed of the book by the same author.
The linked Github seems to have the 2nd edition in the form of notebooks, https://github.com/jakevdp/PythonDataScienceHandbook/blob/ma..., under the Using Code Examples section, "attribution usually includes the title, author, publisher, and ISBN. For example: "Python Data Science Handbook, 2nd edition, by Jake VanderPlas (O’Reilly). Copyright 2023..." compared to the OP's link which has "The Python Data Science Handbook by Jake VanderPlas (O’Reilly). Copyright 2016..."
He's a great writer and I miss his blog. He had an awesome post on pivot table that I think is now a part of this book.
He is also the creator of the Altair visualization library (Vega-Lite in Python https://altair-viz.github.io/). I really like using it.
Thanks for the fact, I used Altair sometimes and really admire the simplicity, not knowing it was written by Jake.
very cool!