Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Aug 3, 2015

Caching

By

This work is supported by Continuum Analyticsand the XDATA Programas part of the Blaze Project

tl;dr: Caching improves performance under repetitive workloads. TraditionalLRU policies don’t fit data science well. We propose a new caching policy.

Humans Repeat Stuff

Consider the dataset that you’ve worked on most recently. How many times haveyou loaded it from disk into memory? How many times have you repeated almostthe same computations on that data?

Exploratory data science workloads involve repetition of very similarcomputations. These computations share structure. By caching frequently usedresults (or parts of results) we may be able to considerably speed upexploratory data analysis.

Caching in other contexts

The web community loves caching. Database backed web applications almostalways guard their database lookups with a system like memcached whichdevotes some amount of memory to caching frequent and recent queries.Because humans visit mostly the same pages over and over again this can reducedatabase load by an order of magnitude. Even if humans visit different pageswith different inputs these pages often share many elements.

Limited caching resources

Given infinite memory we would cache every result that we’ve evercomputed. This would give us for instant recall on anything that wasn’t novel.Sadly our memory resources are finite and so we evict cached resultsthat don’t seem to be worth keeping around.

Traditionally we use a policy like Least Recently Used (LRU). This policyevicts results that have not been requested for a long time. This is cheap andworks well for web and systems applications.

LRU doesn’t fit analytic workloads

Unfortunately LRU doesn’t fit analytic workloads well. Analytic workloads havea large spread computation times and of storage costs. While most webapplication database queries take roughly the same amount of time(100ms-1000ms) and take up roughly the same amount of space to store (1-10kb),the computation and storage costs of analytic computations can easily vary bymany orders of magnitude (spreads in the millions or billions are common.)

Consider the following two common computations of a large NumPy array:

  1. x.std() # costly to recompute, cheap to store
  2. x.T # cheap to recompute, costly to store

In the first case, x.std(), this might take a second on a large array(somewhat expensive) but takes only a few bytes to store. This result is socheap to store that we’re happy to keep it in our cache for a long time, evenif its infrequently requested.

In the second case, x.T this is cheap to compute (just a metadata change inthe array) and executes in microseconds. However the result might takegigabytes of memory to store. We don’t want to keep this in our cache, even ifit’s very frequently requested; it takes up all of the space for otherpotentially more useful (and smaller) results and we can recompute it triviallyanyway.

So we want to keep cached results that have the following properties:

  1. Costly to recompute (in seconds)
  2. Cheap to store (in bytes)
  3. Frequently used
  4. Recently used

Proposed Caching Policy

Here is an alternative to LRU that respects the objectives stated above.

Every time someone accesses an entry in our cache, we increment the scoreassociated to the entry with the following value

\[\textrm{score} = \textrm{score} + \frac{\textrm{compute time}}{\textrm{nbytes}} (1 + \epsilon)^{t}\]

Where compute time is the time it took to compute the result in the firstplace, nbytes is the number of bytes that it takes to store the result,epsilon is a small number that determines the halflife of what “recently”means, and t is an auto-incrementing tick time increased at every access.

This has units of inverse bandwidth (s/byte), gives more importance to newresults with a slowly growing exponential growth, and amplifies the score offrequently requested results in a roughly linear fashion.

We maintain these scores in a heap, keep track of the total number of bytes,and cull the cache as necessary to keep storage costs beneath a fixed budget.Updates cost O(log(k)) for k the number of elements in the cache.

Cachey

I wrote this up into a tiny library called cachey. This is experimentalcode and subject to wild API changes (including renaming.)

The central object is a Cache that includes asks for the following:

  1. Number of available bytes to devote to the cache
  2. Halflife on importance (the number of access that occur to reduce theimportance of a cached result by half) (default 1000)
  3. A lower limit on costs to consider entry to the cache (default 0)

Example

So here is the tiny example

>>> from cachey import Cache
>>> c = Cache(available_bytes=1e9) # 1 GB

>>> c.put('Hello', 'world!', cost=3)

>>> c.get('Hello')
'world!'

More interesting example

The cache includes a memoize decorator. Lets memoize pd.read_csv.

In [1]: import pandas as pd
In [2]: from cachey import Cache

In [3]: c = Cache(1e9)
In [4]: read_csv = c.memoize(pd.read_csv)

In [5]: %time df = read_csv('accounts.csv')
CPU times: user 262 ms, sys: 27.7 ms, total: 290 ms
Wall time: 303 ms

In [6]: %time df = read_csv('accounts.csv') # second read is free
CPU times: user 77 µs, sys: 16 µs, total: 93 µs
Wall time: 93 µs

In [7]: c.total_bytes / c.available_bytes
Out[7]: 0.096

So we create a new function read_csv that operates exactly likepandas.read_csv except that it holds on to recent results in c, a cache.This particular CSV file created a dataframe that filled a tenth of ourcache space. The more often we request this CSV file the more its score willgrow and the more likely it is to remain in the cache into the future. Ifother memoized functions using this same cache produce more valuable results(costly to compute, cheap to store) and we run out of space then this resultwill be evicted and we’ll have to recompute our result if we ask forread_csv('accounts.csv') again.

Memoize everything

Just memoizing read_csv isn’t very interesting. The pd.read_csvfunction operates at a constant data bandwidth of around 100 MB/s. The cachingpolicies around cachey really shine when they get to see all of ourcomputations. For example it could be that we don’t want to hold on to theresults of read_csv because these take up a lot of space. If we findourselves doing the same groupby computations then we might prefer to use ourgigabyte of caching space to store these both because

  1. groupby computations take a long time to compute
  2. groupby results are often very compact in memory

Cachey and Dask

I’m slowly working on integrating cachey into dask’s shared memory scheduler.

Dask is in a good position to apply cachey to many computations. It can lookat every task in a task graph and consider the result for inclusion into thecache. We don’t need to explicitly memoize every function we want to use,dask can do this for us on the fly.

Additionally dask has a nice view of our computation as a collection ofsub-tasks. Similar computations (like mean and variance) often sharesub-tasks.

Future Work

Cachey is new and untested but potentially useful now, particularly through thememoize method shown above.It’s a small and simple codebase.I would love to hear if people find this kind of caching policy valuable.

I plan to think about the following in the near future:

  1. How do we build a hierarchy of caches to share between memory and disk?
  2. How do we cleanly integrate cachey into dask(see PR #502)and how do we make dask amenable to caching(see PR #510)?