Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Aug 28, 2018

High level performance of Pandas, Dask, Spark, and Arrow

By

This work is supported by Anaconda Inc

Question

How does Dask dataframe performance compare to Pandas? Also, what aboutSpark dataframes and what about Arrow? How do they compare?

I get this question every few weeks. This post is to avoid repetition.

Caveats

  1. This answer is likely to change over time. I’m writing this in August 2018
  2. This question and answer are very high level.More technical answers are possible, but not contained here.

Answers

Pandas

If you’re coming from Python and have smallish datasets then Pandas is theright choice. It’s usable, widely understood, efficient, and well maintained.

Benefits of Parallelism

The performance benefit (or drawback) of using a parallel dataframe like Daskdataframes or Spark dataframes over Pandas will differ based on the kinds ofcomputations you do:

  1. If you’re doing small computations then Pandas is always the right choice.The administrative costs of parallelizing will outweigh any benefit.You should not parallelize if your computations are taking less than, say,100ms.
  2. For simple operations like filtering, cleaning, and aggregating large datayou should expect linear speedup by using a parallel dataframes.
  3. If you’re on a 20-core computer you might expect a 20x speedup. If you’reon a 1000-core cluster you might expect a 1000x speedup, assuming that youhave a problem big enough to spread across 1000 cores. As you scale upadministrative overhead will increase, so you should expect the speedup todecrease a bit.
  4. For complex operations like distributed joins it’s more complicated. Youmight get linear speedups like above, or you might even get slowdowns.Someone experienced in database-like computations and parallel computingcan probably predict pretty well which computations will do well.

However, configuration may be required. Often people find that parallelsolutions don’t meet expectations when they first try them out. Unfortunatelymost distributed systems require some configuration to perform optimally.

There are other options to speed up Pandas

Many people looking to speed up Pandas don’t need parallelism. There are oftenseveral other tricks like encoding text data, using efficient file formats,avoiding groupby.apply, and so on that are more effective at speeding up Pandasthan switching to parallelism.

Comparing Apache Spark and Dask

Assuming that yes, I do want parallelism, should I choose Apache Spark, or Dask dataframes?

This is often decided more by cultural preferences (JVM vs Python,all-in-one-tool vs integration with other tools) than performance differences,but I’ll try to outline a few things here:

  • Spark dataframes will be much better when you have large SQL-style queries(think 100+ line queries) where their query optimizer can kick in.
  • Dask dataframes will be much better when queries go beyond typical databasequeries. This happens most often in time series, random access, and othercomplex computations.
  • Spark will integrate better with JVM and data engineering technology.Spark will also come with everything pre-packaged. Spark is its ownecosystem.
  • Dask will integrate better with Python code. Dask is designed to integratewith other libraries and pre-existing systems. If you’re coming from anexisting Pandas-based workflow then it’s usually much easier to evolve toDask.

Generally speaking for most operations you’ll be fine using either one. Peopleoften choose between Pandas/Dask and Spark based on cultural preference.Either they have people that really like the Python ecosystem, or they havepeople that really like the Spark ecosystem.

Dataframes are also only a small part of each project. Spark and Dask both domany other things that aren’t dataframes. For example Spark has a graphanalysis library, Dask doesn’t. Dask supports multi-dimensional arrays, Sparkdoesn’t. Spark is generally higher level and all-in-one while Dask islower-level and focuses on integrating into other tools.

For more information, see Dask’s “Comparison to Spark documentation”.

Apache Arrow

What about Arrow? Is Arrow faster than Pandas?

This question doesn’t quite make sense… yet.

Arrow is not a replacement for Pandas. Today Arrow is useful to peoplebuilding systems and not to analysts directly like Pandas. Arrow is used tomove data between different computational systems and file formats. Arrow doesnot do computation today, but is commonly used as a component in otherlibraries that do do computation. For example, if you use Pandas or Spark orDask today you may be using Arrow without knowing it. Today Arrow is moreuseful for other libraries than it is to end-users.

However, this is likely to change in the future. Arrow developers planto write computational code around Arrow that we would expect to be faster thanthe code in either Pandas or Spark. This is probably a year or two awaythough. There will probably be some effort to make this semi-compatible withPandas, but it’s much too early to tell.