I recently investigated the state of Python libraries for Kafka. This blogpostcontains my findings.
Both PyKafka andconfluent-kafka havemature implementations and are maintained by invested companies.Confluent-kafka is generally faster while PyKafka is arguably better designedand documented for Python usability.
Conda packages are now available for both. I hope to extend one or both tosupport asynchronous workloads with Tornado.
Disclaimer: I am not an expert in this space. I have no strong affiliationwith any of these projects. This is a report based on my experience of thepast few weeks. I don’t encourage anyone to draw conclusions from this work.I encourage people to investigate on their own.
Apache Kafka is a common data system for streamingarchitectures. It manages rolling buffers of byte messages and provides ascalable mechanism to publish or subscribe to those buffers in real time.While Kafka was originally designed within the JVM space the fact that it onlymanages bytes makes it easy to access from native code systems like C/C++ andPython.
Today there are three independent Kafka implementations in Python, two of whichare optionally backed by a C implementation,librdkafka, for speed:
Confluent-kafka message-consumption bandwidths are around 50% higher andmessage-production bandwidths are around 3x higher than PyKafka, both of whichare significantly higher than kafka-python. I’m taking these numbers fromthisblogpostwhich gives benchmarks comparing the three libraries. The primary numericresults follow below:
Note: It’s worth noting that this blogpost was moving smallish 100 byte messagesaround. I would hope that Kafka would perform better (closer to networkbandwidths) when messages are of a decent size.
time_in_seconds MBs/s Msgs/s confluent_kafka_producer 5.4 17 183000 pykafka_producer_rdkafka 16 6.1 64000 pykafka_producer 57 1.7 17000 python_kafka_producer 68 1.4 15000
time_in_seconds MBs/s Msgs/s confluent_kafka_consumer 3.8 25 261000 pykafka_consumer_rdkafka 6.1 17 164000 pykafka_consumer 29 3.2 34000 python_kafka_consumer 26 3.6 38000
Note: I discovered this article on parsely/pykafka #559, which has good conversation about the three libraries.
I profiled PyKafka in these cases and it doesn’t appear that these code pathshave yet been optimized. I expect that modest effort could close that gapconsiderably. This difference seems to be more from lack of interest than anyhard design constraint.
It’s not clear how critical these speeds are. According to the PyKafkamaintainers at Parse.ly they haven’t actually turned on the librdkafkaoptimizations in their internal pipelines, and are instead using the slowPure Python implementation, which is apparently more than fast enough forcommon use. Getting messages out of Kafka just isn’t their bottleneck. It maybe that these 250,000 messages/sec limits are not significant in mostapplications. I suspect that this matters more in bulk analysis workloads thanin online applications.
It took me a few times to get confluent-kafka to work. It wasn’t clear whatinformation I needed to pass to the constructor to connect to Kafka and when Igave the wrong information I received no message that I had done anythingincorrectly. Docstrings and documentation were both minimal. In contrast,PyKafka’s API and error messages quickly led me to correct behavior and I wasup and running within a minute.
However, I persisted with confluent-kafka, found the right Javadocumentation, and eventually didget things up and running. Once this happened everything fell into place and Iwas able to easily build applications with Confluent-kafka that were bothsimple and fast.
I would like to add asynchronous support to one or both of these libraries sothat they can read or write data in a non-blocking fashion and play nicely withother asynchronous systems like Tornado or Asyncio. I started investigatingthis with both libraries on GitHub.
Both libraries have a maintainer who is somewhat responsive and whose time isfunded by the parent company. Both maintainers seem active on a day-to-daybasis and handle contributions from external developers.
Both libraries are fully active with a common pattern of a single main devmerging work from a number of less active developers. Distributions of commitsover the last six months look similar:
confluent-kafka-python$ git shortlog -ns --since "six months ago"
38 Magnus Edenhill
5 Christos Trochalakis
4 Ewen Cheslack-Postava
1 Simon Wahlgren
pykafka$ git shortlog -ns --since "six months ago"
52 Emmett Butler
23 Emmett J. Butler
20 Marc-Antoine Parent
18 Tanay Soni
5 messense
1 Erik Stephens
1 Jeff Widman
1 Prateek Shrivastava
1 aleatha
1 zpcui
In regards to the codebases I found that PyKafka was easier to hack on for afew reasons:
To be clear, no maintainer has any responsibility to answer my questions ongithub. They are likely busy with other things that are of more relevance totheir particular mandate.
I’ve pushed/updated recipes for both packages on conda-forge. You can installthem as follows:
conda install -c conda-forge pykafka # Linux, Mac, Windows
conda install -c conda-forge python-confluent-kafka # Linux, Mac
In both cases this these are built against the fast librdkafka C library(except on Windows) and install that library as well.
I’ve recently started work on streaming systems and pipelines forDask, so I’ll probably continue toinvestigate this space. I’m still torn between the two implementations. Thereare strong reasons to use either of them.
Culturally I am drawn to Parse.ly’s PyKafka library. They’re clearly Pythondevelopers writing for Python users. However the costs of using a non-Pythonicsystem here just aren’t that large (Kafka’s API is small), and Confluent’sinterests are more aligned with investing in Kafka long term than areParse.ly’s.