tl;dr We implement a dictionary that spills to disk when we run out ofmemory. We connect this to our scheduler.
This is the fifth in a sequence of posts constructing an out-of-core nd-arrayusing NumPy, Blaze, and dask. You can view these posts here:
We now present chest a dict type that spills to disk when we run out ofmemory. We show how it prevents large computations from flooding memory.
If you read thepost on schedulingyou may recall our goal to minimize intermediate storage during a multi-workercomputation. The image on the right shows a trace of our scheduler as ittraverses a task dependency graph. We want to compute the entire graph quicklywhile keeping only a small amount of data in memory at once.
Sometimes we fail and our scheduler stores many large intermediate results. Inthese cases we want to spill excess intermediate data to disk rather thanflooding local memory.
Chest is a dict-like object that writes data to disk once it runs out ofmemory.
>>> from chest import Chest
>>> d = Chest(available_memory=1e9) # Only use a GigaByte
It satisfies the MutableMapping interface so it looks and feels exactly likea dict. Below we show an example using a chest with only enough data tostore one Python integer in memory.
>>> d = Chest(available_memory=24) # enough space for one Python integer
>>> d['one'] = 1
>>> d['two'] = 2
>>> d['three'] = 3
['one', 'two', 'three']
We keep some data in memory
While the rest lives on disk
By default we store data with pickle but chest supports any protocolwith the dump/load interface (pickle, json, cbor, joblib, ….)
A quick point about pickle. Frequent readers of my blog may know of mysadness at how this libraryserializes functionsand the crippling effect on multiprocessing.That sadness does not extend to normal data. Pickle is fine for data if youuse the protocol= keyword to pickle.dump correctly . Pickle isn’t a goodcross-language solution, but that doesn’t matter in our application oftemporarily storing numpy arrays on disk.
In using chest alongside dask with any reasonable success I had to make thefollowing improvements to the original implementation:
Now we can execute more dask workflows without risk of flooding memory
A = ...
B = ...
expr = A.T.dot(B) - B.mean(axis=0)
cache = Chest(available_memory=1e9)
into('myfile.hdf5::/result', expr, cache=cache)
Now we incur only moderate slowdown when we schedule poorly and run into largequantities of intermediate data.
Chest is only useful when we fail to schedule well. We can still improvescheduling algorithms to avoid keeping data in memory but it’s nice to havechest as a backup for when these algorithms fail. Resilience is reassuring.