Skip to content

Kafka development

Patrick Godwin requested to merge kafka_development into master

This merge request adds a few classes for Kafka-based data transfer, for use in the streaming infrastructure, along with some extras that came along for the ride.

In particular,

  • Added KafkaReporter, which creates Kafka-based consumers and producers to move data in and out of Kafka topics (queues). This needs Kafka to be run in the background, which can either be done by launching your own Kafka server or by connecting to an already existing Kafka server (which I expect would happen at LHO/LLO when the new low-latency data-transfer infrastructure is more underway).
  • To help with launching your own Kafka server, I have provided a KafkaLauncher, currently housed in utils.py, but I am not married to the idea if you have a better suggestion. It currently is a class that takes in two keyword arguments for it to be instantiated; kafka_config_path and zookeeper_config_path, providing paths to both the Kafka and Zookeeper configuration files, respectively. There are two methods in this class: KafkaLauncher.start() and KafkaLauncher.stop(), which does what you expect. There are bash scripts provided within Kafka in the binary installation, which I use subprocess to call. It's done purely for simplification, and allows us to call them at the beginning of the streaming pipeline, if necessary.
  • I have also provided a Makefile within /etc to install all the necessary dependencies so far: sklearn, cython, and Kafka (binary install, C and python bindings). This has instructions on the top of the Makefile to create an environment script which you can source to install all the packages relative to the location of the Makefile. It's composed of two steps:
    1. make env.sh -f Makefile
    2. make -f Makefile This will prompt for a username/password for the iDQ repo, and then off we go.
  • I have modified the gitlab-ci.yml script to install all the dependencies using this Makefile, as well as changed the base image from stretch to el7 to match the OS used at LHO/LLO. There are some extra steps in here that I don't normally need to do to get the installation working with the Makefile, since I can't git clone without user credentials, but it works fine.
  • Added configuration files for Zookeeper and Kafka within /etc with parameters that I've found to work to launch Kafka as a background process.
  • Added a small snippet on the use of KafkaReporter and the two modes of operation (either launching Kafka yourself or connecting to another server) in the streaming pipeline. It's not comprehensive by any means, but will be fleshed out better when more parts of the streaming pipeline are underway.
  • Added a NoDataError when data is not returned when calling KafkaReporter.retrieve(). I figure that instead of returning None, or something else that could be unclear, this makes the fact that there's no data returned explicit. Then when using the reporter to grab data, they should be expected to handle that error gracefully, e.g. a try/except loop where the two cases (data/no data) are handled differently.

There are aspects of KafkaLauncher that could be improved, and is not robust by any means. I've found that it doesn't handle the case where stop() is not called before exiting well, as a Java process now lives in the wild. This should be reasonably well-contained within the streaming pipeline, but a rogue process calling this would have to be careful. Error handling could be better here, to handle cases where start() fails, etc. If we connect to a Kafka instance that's available on the cluster, then we don't need to use this at all, so in some sense, it's a workaround for the time being.

With all this in place, we should have all we need to start the streaming pipeline and use KafkaReporters to allow data transfer between processes. I've tested both the KafkaLauncher and KafkaReporter, and should be ready for use.

Edited by Patrick Godwin

Merge request reports