Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Jan 14, 2015

Towards Out-of-core ND-Arrays -- Benchmark MatMul

By

This work is supported by Continuum Analyticsand the XDATA Programas part of the Blaze Project

tl;dr We benchmark dask on an out-of-core dot product. We also compare andmotivate the use of an optimized BLAS.

Here are the results

Performance (GFLOPS) In-Memory On-Disk Reference BLAS 6 18 OpenBLAS one thread 11 23 OpenBLAS four threads 22 11

Disclaimer: This post is on experimental buggy code. This is not ready for public use.

Introduction

This is the fourth in a sequence of posts constructing an out-of-core nd-arrayusing NumPy, Blaze, and dask. You can view these posts here:

  1. Simple task scheduling,
  2. Frontend usability
  3. A multi-threaded scheduler

We now give performance numbers on out-of-core matrix-matrix multiplication.

Matrix-Matrix Multiplication

Dense matrix-matrix multiplication is compute-bound, not I/O bound.We spend most of our time doing arithmetic and relatively little time shufflingdata around. As a result we may be able to read large data from disk withoutperformance loss.

When multiplying two $n\times n$ matrices we read $n^2$ bytes but perform $n^3$computations. There are $n$ computations to do per byte so, relativelyspeaking, I/O is cheap.

We normally measure speed for single CPUs in Giga Floating Point OperationsPer Second (GFLOPS). Lets look at how my laptop does on single-threadedin-memory matrix-matrix multiplication using NumPy.

>>> import numpy as np
>>> x = np.ones((1000, 1000), dtype='f8')
>>> %timeit x.dot(x) # Matrix-matrix multiplication
10 loops, best of 3: 176 ms per loop

>>> 1000 ** 3 / 0.176 / 1e9 # n cubed computations / seconds / billion
>>> 5.681818181818182

OK, so NumPy’s matrix-matrix multiplication clocks in at 6 GFLOPS more orless. The np.dot function ties in to the GEMM operation in the BLASlibrary on my machine. Currently my numpy just uses reference BLAS. (you cancheck this with np.show_config().)

Matrix-Matrix Multiply From Disk

For matrices too large to fit in memory we compute the solution one part at atime, loading blocks from disk when necessary. We parallelize this withmultiple threads. Our last post demonstrates how NumPy+Blaze+Dask automatesthis for us.

We perform a simple numerical experiment, using HDF5 as our on-disk store.

We install stuff

conda install -c blaze blaze
pip install git+https://github.com/ContinuumIO/dask

We set up a fake dataset on disk

>>> import h5py
>>> f = h5py.File('myfile.hdf5')
>>> A = f.create_dataset(name='A', shape=(200000, 4000), dtype='f8',
... chunks=(250, 250), fillvalue=1.0)
>>> B = f.create_dataset(name='B', shape=(4000, 4000), dtype='f8',
... chunks=(250, 250), fillvalue=1.0)

We tell Dask+Blaze how to interact with that dataset

>>> from dask.obj import Array
>>> from blaze import Data, into

>>> a = into(Array, 'myfile.hdf5::/A', blockshape=(1000, 1000)) # dask things
>>> b = into(Array, 'myfile.hdf5::/B', blockshape=(1000, 1000))
>>> A = Data(a) # Blaze things
>>> B = Data(b)

We compute our desired result, storing back onto disk

>>> %time into('myfile.hdf5::/result', A.dot(B))
2min 49s

>>> 200000 * 4000 * 4000 / 169 / 1e9
18.934911242

18.9 GFLOPS, roughly 3 times faster than the in-memory solution. At firstglance this is confusing - shouldn’t we be slower coming from disk? Ourspeedup is due to our use of four cores in parallel. This is good, we don’texperience much slowdown coming from disk.

It’s as if all of our hard drive just became memory.

OpenBLAS

Reference BLAS is slow; it was written long ago. OpenBLAS is a modernimplementation. I installed OpenBLAS with my system installer (apt-get) andthen reconfigured and rebuilt numpy. OpenBLAS supports many cores. We’ll showtimings with one and with four threads.

export OPENBLAS_NUM_THREADS=1
ipython

>>> import numpy as np
>>> x = np.ones((1000, 1000), dtype='f8')
>>> %timeit x.dot(x)
10 loops, best of 3: 89.8 ms per loop

>>> 1000 ** 3 / 0.0898 / 1e9 # compute GFLOPS
11.135857461024498

export OPENBLAS_NUM_THREADS=4
ipython

>>> %timeit x.dot(x)
10 loops, best of 3: 46.3 ms per loop

>>> 1000 ** 3 / 0.0463 / 1e9 # compute GFLOPS
21.598272138228943

This is about four times faster than reference. If you’re not alreadyparallelizing in some other way (like with dask) then you should use a modernBLAS like OpenBLAS or MKL.

OpenBLAS + dask

Finally we run on-disk our experiment again, now with OpenBLAS. We do thisboth with OpenBLAS running with one thread and with many threads.

We’ll skip the code (it’s identical to what’s above) and give a comprehensivetable of results below.

Sadly the out-of-core solution doesn’t improve much by using OpenBLAS.Acutally when both OpenBLAS and dask try to parallelize we lose performance.

Results

Performance (GFLOPS) In-Memory On-Disk Reference BLAS 6 18 OpenBLAS one thread 11 23 OpenBLAS four threads 22 11

tl:dr When doing compute intensive work, don’t worry about using disk, justdon’t use two mechisms of parallelism at the same time.

Main Take-Aways

  1. We don’t lose much by operating from disk in compute-intensive tasks
  2. Actually we can improve performance when an optimized BLAS isn’t avaiable.
  3. Dask doesn’t benefit much from an optimized BLAS. This is sad and surprising. I expected performance to scale with single-core in-memory performance. Perhaps this is indicative of some other limiting factor
  4. One shouldn’t extrapolate too far with these numbers. They’re only relevant for highly compute-bound operations like matrix-matrix multiply

Also, thanks to Wesley Emeneker for finding where we were leaking memory,making results like these possible.