Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Oct 10, 2017

Notes on Kafka in Python

By

Summary

I recently investigated the state of Python libraries for Kafka. This blogpostcontains my findings.

Both PyKafka andconfluent-kafka havemature implementations and are maintained by invested companies.Confluent-kafka is generally faster while PyKafka is arguably better designedand documented for Python usability.

Conda packages are now available for both. I hope to extend one or both tosupport asynchronous workloads with Tornado.

Disclaimer: I am not an expert in this space. I have no strong affiliationwith any of these projects. This is a report based on my experience of thepast few weeks. I don’t encourage anyone to draw conclusions from this work.I encourage people to investigate on their own.

Introduction

Apache Kafka is a common data system for streamingarchitectures. It manages rolling buffers of byte messages and provides ascalable mechanism to publish or subscribe to those buffers in real time.While Kafka was originally designed within the JVM space the fact that it onlymanages bytes makes it easy to access from native code systems like C/C++ andPython.

Python Options

Today there are three independent Kafka implementations in Python, two of whichare optionally backed by a C implementation,librdkafka, for speed:

  • kafka-python: The first onthe scene, a Pure Python Kafka client with robust documentation and an APIthat is fairly faithful to the original Java API. This implementation hasthe most stars on GitHub, the most active development team (by number ofcommitters) but also lacks a connection to the fast C library. I’ll admitthat I didn’t spend enough time on this project to judge it well because ofthis.
  • PyKafka: The secondimplementation chronologically. This library is maintained byParse.ly a web analytics company that heavily usesboth streaming systems and Python. PyKafka’s API is more creative anddesigned to follow common Python idioms rather than the Java API. PyKafkahas both a pure Python implementation and connections to the low-levellibrdkafka C library for increased performance.
  • Confluent-kafka:Is the final implementation chronologically. It is maintained byConfluent, the primary for-profit companythat supports and maintains Kafka. This library is the fastest, but alsothe least accessible from a Python perspective. This implementation iswritten in CPython extensions, and the documentation is minimal. However,if you are coming from the Java API then this is entirely consistent withthat experience, so that documentation probably suffices.

Performance

Confluent-kafka message-consumption bandwidths are around 50% higher andmessage-production bandwidths are around 3x higher than PyKafka, both of whichare significantly higher than kafka-python. I’m taking these numbers fromthisblogpostwhich gives benchmarks comparing the three libraries. The primary numericresults follow below:

Note: It’s worth noting that this blogpost was moving smallish 100 byte messagesaround. I would hope that Kafka would perform better (closer to networkbandwidths) when messages are of a decent size.

Producer Throughput

time_in_seconds MBs/s Msgs/s confluent_kafka_producer 5.4 17 183000 pykafka_producer_rdkafka 16 6.1 64000 pykafka_producer 57 1.7 17000 python_kafka_producer 68 1.4 15000

Consumer Throughput

time_in_seconds MBs/s Msgs/s confluent_kafka_consumer 3.8 25 261000 pykafka_consumer_rdkafka 6.1 17 164000 pykafka_consumer 29 3.2 34000 python_kafka_consumer 26 3.6 38000

Note: I discovered this article on parsely/pykafka #559, which has good conversation about the three libraries.

I profiled PyKafka in these cases and it doesn’t appear that these code pathshave yet been optimized. I expect that modest effort could close that gapconsiderably. This difference seems to be more from lack of interest than anyhard design constraint.

It’s not clear how critical these speeds are. According to the PyKafkamaintainers at Parse.ly they haven’t actually turned on the librdkafkaoptimizations in their internal pipelines, and are instead using the slowPure Python implementation, which is apparently more than fast enough forcommon use. Getting messages out of Kafka just isn’t their bottleneck. It maybe that these 250,000 messages/sec limits are not significant in mostapplications. I suspect that this matters more in bulk analysis workloads thanin online applications.

Pythonic vs Java APIs

It took me a few times to get confluent-kafka to work. It wasn’t clear whatinformation I needed to pass to the constructor to connect to Kafka and when Igave the wrong information I received no message that I had done anythingincorrectly. Docstrings and documentation were both minimal. In contrast,PyKafka’s API and error messages quickly led me to correct behavior and I wasup and running within a minute.

However, I persisted with confluent-kafka, found the right Javadocumentation, and eventually didget things up and running. Once this happened everything fell into place and Iwas able to easily build applications with Confluent-kafka that were bothsimple and fast.

Development experience

I would like to add asynchronous support to one or both of these libraries sothat they can read or write data in a non-blocking fashion and play nicely withother asynchronous systems like Tornado or Asyncio. I started investigatingthis with both libraries on GitHub.

Developers

Both libraries have a maintainer who is somewhat responsive and whose time isfunded by the parent company. Both maintainers seem active on a day-to-daybasis and handle contributions from external developers.

Both libraries are fully active with a common pattern of a single main devmerging work from a number of less active developers. Distributions of commitsover the last six months look similar:

confluent-kafka-python$ git shortlog -ns --since "six months ago"
38 Magnus Edenhill
5 Christos Trochalakis
4 Ewen Cheslack-Postava
1 Simon Wahlgren

pykafka$ git shortlog -ns --since "six months ago"
52 Emmett Butler
23 Emmett J. Butler
20 Marc-Antoine Parent
18 Tanay Soni
5 messense
1 Erik Stephens
1 Jeff Widman
1 Prateek Shrivastava
1 aleatha
1 zpcui

Codebase

In regards to the codebases I found that PyKafka was easier to hack on for afew reasons:

  1. Most of PyKafka is written in Python rather than C extensions, and so it ismore accessible to a broader development base. I find that Python Cextensions are not pleasant to work with, even if you are comfortable withC.
  2. PyKafka appears to be much more extensively tested. PyKafka actually spinsup a local Kafka instance to do comprehensive integration tests whileConfluent-kafka seems to only test API without actually running against areal Kakfa instance.
  3. For what it’s worth, PyKafka maintainers respondedquickly to an issue onTornado. Confluent-kafka maintainers still have not responded to acomment on an existing Tornadoissue,even though that comment had signfiicnatly more content (a workingprototype).

To be clear, no maintainer has any responsibility to answer my questions ongithub. They are likely busy with other things that are of more relevance totheir particular mandate.

Conda packages

I’ve pushed/updated recipes for both packages on conda-forge. You can installthem as follows:

conda install -c conda-forge pykafka # Linux, Mac, Windows
conda install -c conda-forge python-confluent-kafka # Linux, Mac

In both cases this these are built against the fast librdkafka C library(except on Windows) and install that library as well.

Future plans

I’ve recently started work on streaming systems and pipelines forDask, so I’ll probably continue toinvestigate this space. I’m still torn between the two implementations. Thereare strong reasons to use either of them.

Culturally I am drawn to Parse.ly’s PyKafka library. They’re clearly Pythondevelopers writing for Python users. However the costs of using a non-Pythonicsystem here just aren’t that large (Kafka’s API is small), and Confluent’sinterests are more aligned with investing in Kafka long term than areParse.ly’s.