Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Parallel Python

Fast and Easy

Easy Parallel Python that does what you need

What you can do with Dask

Big Pandas

Dask DataFrames use pandas under the hood, so your current code likely just works. It’s faster than Spark and easier too.

Parquet logoApache Arrow logoPandas logo
import dask.dataframe as dd

df = dd.read_parquet("s3://data/uber/")

# How much did NYC pay Uber?
df.base_passenger_fare.sum().compute()

# And how much did drivers make?
df.driver_pay.sum().compute()

Parallel For Loops

Parallelize your Python code, no matter how complex. Dask is flexible and supports arbitrary dependencies and fine-grained task scheduling.

from dask.distributed import Client

client = Client()

# Define your own code
def f(x):
    return x + 1

# Run your code in parallel
futures = client.map(f, range(100))
results = client.gather(futures)

Big Arrays

Use Dask and NumPy/Xarray to churn through terabytes of multi-dimensional array data in formats like HDF, NetCDF, TIFF, or Zarr.  

NumPy logoXarray logoZarr logo
import xarray as xr

# Open image/array files natively
ds = xr.open_mfdataset("data/*.nc")

# Process across dimensions
ds.mean(dims=["lat", "lon"]).compute()  

Production ETL

Run regular jobs on a schedule.  Know that they’ll finish.  Dask backs modern workflow managers like Prefect and Dagster.

dagster logoAirflow logo
from prefect import flow, task
from prefect_dask import DaskTaskRunner

@task()
def process(file):
    ...

@flow(task_runner=DaskTaskRunner(
    address="http://my-dask-cluster"
))
def workflow():
    files = ...
    results = process.map(files)

Machine Learning

Use Dask with common machine learning libraries to train or predict on large datasets, increasing model accuracy by using all of your data.

xGBoost logoPyTorch_logooptuna logoscikit learn logo
import xgboost as xgb
import dask.dataframe as dd

df = dd.read_parquet("s3://my-data/")
dtrain = xgb.dask.DaskDMatrix(df)

model = xgb.dask.train(
    dtrain,
    {"tree_method": "hist", ...},
    ...
)

Performance at Scale

Fast on Machines

Dask is lightweight, and runs your raw code on your machines without getting in the way.  No virtualization or compilers.

As the Python stack matures your code matures. Today Dask is 50% faster than Spark on standard benchmarks.


        
import pandas as pd df = pd.read_parquet("s3://mybucket/myfile.parquet/") df = df[df.value >= 0] df.groupby("account")["value"].sum() import dask.dataframe as dd df = dd.read_parquet("s3://mybucket/myfile.*.parquet/") df = df[df.value >= 0] df.groupby("account")["value"].sum().compute()

Made for Humans

Computers are cheap.  Humans are expensive.

Fortunately, humans already know how to use Dask.

It’s just Python. It’s just pandas. It’s just NumPy.

Dask’s dashboard guides you towards efficiency, quickly teaching you to become a distributed computing expert.

Cheap and Efficient

Fast humans + Fast machines = Cheap Computing

Rows of Data Computed
1000000000000
Cost
$
0.00

Dask users often process cloud data at $0.10 per TiB

Where you can run Dask

Open Source Deployment

Run Dask on your laptop (it’s trivial) or deploy it on any resource manager like Kubernetes, an HPC job schedulers, cloud SaaS services, or even legacy Hadoop/Spark clusters.

Kubernetes logoYarn logoSlurm logoLSF logoOpenOBS logo
# Start on your laptop
from dask.distributed import LocalCluster
cluster = LocalCluster(
    processes=False,
)       
client = cluster.get_client()

# Use dask to solve your problem
import dask.dataframe as dd
df = dd.read_parquet("/path/to/data.parquet")
df.value.mean().compute()

Where you can run Dask

Managed Cloud

Run Dask in the cloud with open source Kubernetes, or with an easy SaaS solution.  Coiled is free for individuals with modest use and easy for anyone with a cloud account.

Amazon Web Services logoGoogle Cloud logoMicrosoft Azure Logo
# Migrate to distributed systems
import coiled
cluster = coiled.Cluster(
    n_workers=100, region="us-east-2",
)       
client = cluster.get_client()

# Use dask to solve your problem
import dask.dataframe as dd
df = dd.read_parquet("s3://mybucket/data.*.parquet")
df.value.mean().compute()

What users say about Dask

People use Dask and like it! You won’t be alone!
It’s easy
It’s massive
It solved my problem
“Dask has been a trailblazer in making distributed and out-of-memory computing in Python easy and accessible for everyone.”
Wes McKinney
Wes McKinney, Pandas
“At Capital One, early implementations of Dask have reduced model training times by 91% within a few months of development effort.”
Ryan McEntee
Ryan McEntee, Capital One
“My climate science research has been made possible by Dask. Dask integrates seamlessly with Xarray, making it easy to run large-scale computations on multi-dimensional datasets. I can focus on my research instead of thinking about parallelism.”
Paige Martin
Paige Martin, Pangeo
“Dask shines when dealing with generic data structures which don’t conform to table-like structures. PySpark has RDDs, but who wants to code in Python and debug verbose Java logs?”
Ajith Aravind, Simeio
“Dask has transformed how the world interacts with weather, climate, and geospatial data by making it super easy to scale up data processing pipelines on HPC or cloud. Things that seemed impossible five years ago are now routine thanks to Dask.”
Ryan Abernathy
Ryan Abernathy, Earthmover
“With Dask, I can easily adapt code that runs on a single machine and scale it across an entire cluster. Very few other tools let you get going that quickly—across any language.”
Jacqueline Nolis
Jacqueline Nolis, Fanatics Inc.
“Dask also makes it easy to deploy distributed work locally using multiple Python processes in a way that is nearly identical to how full production load is distributed.”
Hugues Demers
Hugues Demers, Grubhub
“To further accelerate our users’ ability to scale easily on the cloud, we expanded this by setting up pre-configured Horovod and Dask clusters.”
Meenakshi Sharma
Meenakshi Sharma, Wayfair
“I used to work in Spark all the time. The sooner you move to Dask, the sooner you’ll be grateful you did.”
John Renken
John Renken, Rebuy

Organizations That Use Dask

Capital One
DE Shaw & Co
Grubhub
Microsoft
NASA
NVIDIA
Shell
Two Sigma
US Air Force
US government
Walmart
arm

Use Cases

Anything you can do with Python, you can scale with Dask

Super easy to get started

You can run Dask right now on your computer. It’s simple and lightweight.

        
$ conda install dask $ pip install "dask[complete]"

Most people use Dask just on their laptops to scale out to 100 GiB datasets. Dask works great on a single machine.