This work is supported by Anaconda Inc
How does Dask dataframe performance compare to Pandas? Also, what aboutSpark dataframes and what about Arrow? How do they compare?
I get this question every few weeks. This post is to avoid repetition.
If you’re coming from Python and have smallish datasets then Pandas is theright choice. It’s usable, widely understood, efficient, and well maintained.
The performance benefit (or drawback) of using a parallel dataframe like Daskdataframes or Spark dataframes over Pandas will differ based on the kinds ofcomputations you do:
However, configuration may be required. Often people find that parallelsolutions don’t meet expectations when they first try them out. Unfortunatelymost distributed systems require some configuration to perform optimally.
Many people looking to speed up Pandas don’t need parallelism. There are oftenseveral other tricks like encoding text data, using efficient file formats,avoiding groupby.apply, and so on that are more effective at speeding up Pandasthan switching to parallelism.
Assuming that yes, I do want parallelism, should I choose Apache Spark, or Dask dataframes?
This is often decided more by cultural preferences (JVM vs Python,all-in-one-tool vs integration with other tools) than performance differences,but I’ll try to outline a few things here:
Generally speaking for most operations you’ll be fine using either one. Peopleoften choose between Pandas/Dask and Spark based on cultural preference.Either they have people that really like the Python ecosystem, or they havepeople that really like the Spark ecosystem.
Dataframes are also only a small part of each project. Spark and Dask both domany other things that aren’t dataframes. For example Spark has a graphanalysis library, Dask doesn’t. Dask supports multi-dimensional arrays, Sparkdoesn’t. Spark is generally higher level and all-in-one while Dask islower-level and focuses on integrating into other tools.
For more information, see Dask’s “Comparison to Spark documentation”.
What about Arrow? Is Arrow faster than Pandas?
This question doesn’t quite make sense… yet.
Arrow is not a replacement for Pandas. Today Arrow is useful to peoplebuilding systems and not to analysts directly like Pandas. Arrow is used tomove data between different computational systems and file formats. Arrow doesnot do computation today, but is commonly used as a component in otherlibraries that do do computation. For example, if you use Pandas or Spark orDask today you may be using Arrow without knowing it. Today Arrow is moreuseful for other libraries than it is to end-users.
However, this is likely to change in the future. Arrow developers planto write computational code around Arrow that we would expect to be faster thanthe code in either Pandas or Spark. This is probably a year or two awaythough. There will probably be some effort to make this semi-compatible withPandas, but it’s much too early to tell.