How to Improve Operational Work with Operators & GitOps

Posted: October 18th 2021

In a traditional DevOps world, people used to rely on configuration management tools such as Ansible, Puppet, and Chef to automate the management of datastores (Such as Kafka, Elasticsearch) ops work. As we move more towards Agile methodology and rapid feature releases, configuration management tools can't keep up with the SDLC (Software Development LifeCycle).  Problems that the traditional datastore managers encounter are:

  1. There is no guarantee to monitor the configuration drift.
  2. Difficult to achieve a cloud-agnostic solution with IaC (Terraform).
  3. Poor resource utilization (Virtual machines vs Containers).
  4. High Mean-Time-To-Recovery (MTTR).

Using Operators and adopting the GitOps methodology helps Paytm Labs' SREs overcome the above challenges and improve the reliability of our datastore clusters. In this article, we will be discussing GitOps and Kubernetes Operators using the example of Kafka on Kubernetes.

GitOps

GitOps[1] is a paradigm that makes Git the single source of truth and helps you deploy software or infrastructure changes in a declarative manner. Your deployment framework should be capable of monitoring the drift between Git and the live state. Weaveworks pioneered this strategy. They defined 4 core principles for GitOps:

  1. Describe the system declaratively
  2. The desired system state versioned in Git
  3. Approved changes that can be automatically applied to the system
  4. Software agents that can detect and alert on divergence

By implementing these core GitOps principles, you will change your CI/CD pipeline from a push to a pull based model. Some benefits of GitOps are - better productivity, enhanced ops work, easier credential management, better consistency and standardization, and security guarantees. ArgoCD[2] and FluxCD[3] are popular deployment tools that help you achieve the Weaveworks GitOps strategy. At Paytm Labs, we have considered ArgoCD over FluxCD for the following reasons:

Overall, both projects are great at helping you implement your GitOps strategy. Both support the essentials of GitOps like automated synchronization, resource deletions, and declarative setups. However, ArgoCD is more feature-rich: the multi-repo and multi-cluster feature, along with cluster bootstrapping, are really powerful; something that Flux misses out on. We also use Flux image automation[4] to bridge the gap between our CI and CD pipeline.

Kubernetes Operators

Kubernetes Operators[5] are able to manage, package and deploy Kubernetes applications. It helps you manage complex applications by extending the Kubernetes API. In Kubernetes, controllers in the control plane are responsible for matching the desired state with the actual state, whereas Operators are the custom controllers. Kubernetes manages stateless applications without any domain-specific knowledge. However, Statefulset applications like databases and monitoring tools require some additional domain-specific knowledge.

Operators come into the picture where DevOps or SRE can write code to automate tasks in the Kubernetes. OperatorSDK[6] is the popular open-source framework that helps you to write an Operator with ease. It allows you to focus on operational logic, generate code for bootstrapping new projects, or automate common/recurring tasks.

Kafka on Kubernetes

Before jumping into Kafka on Kubernetes, let's see what the challenges are in managing the Kafka cluster on traditional VMs.

  • Scale in/out
  • Rack awareness (Time-consuming setup)
  • Telegraf with JMX exporter for monitoring
  • Managing topics and users
  • Upgrading brokers
  • Deploy and manage:
    • Zookeeper clusters
    • Secor to stream Kafka topics to S3 storage
    • Burrow for consumer lag metrics

At Paytm Labs, we recently introduced the Strimzi Kafka Operator[7] to manage Kafka clusters on Kubernetes, which helps us overcome the above ops challenges. The Strimzi operator offers the following

  • It is secure by default - Built-in security, TLS, SCRAM-SHA and OAuth authentication, and Automated Certificate Management.
  • Simple Yet Configurable - Nodeport, Loadbalancer and Ingress options, Rack awareness of HA, and use dedicated nodes for Kafka.
  • Kubernetes-Native Experience - Use kubectl to manage Kafka, Manage Kafka using GitOps

The following diagram[7] illustrates CRDs supported by Strimzi to operate Kafka cluster on Kubernetes.

The Cluster Operator is responsible for creating resources such as Kafka, Zookeeper, Connect, MirrorMaker, Bridge, Kafka Exporter, and Cruise Control, whereas the Topic and User operators are responsible for managing Kafka topics and ACLs to those topics. The following manifest creates a Kafka cluster with Zookeeper, metrics exporter, and Cruise Control.

As you can see from the above manifest, all broker level configurations, zookeeper and metric exporter are configured declaratively. By deploying the above manifest with ArgoCD, it will help us manage Kafka using GitOps. The following screenshot shows the ArgoCD deployment for the Kafka cluster.

This picture shows the application status is Healthy, which means the status of all the objects we deployed is healthy.  The Current Sync Status is OutOfSync, which means there is a diff between Git and the live state of the deployment. ArgoCD allows us to locate this drift either by using the UI or the CLI, and increases confidence in the CD processes by knowing exactly what state your infrastructure is in relative to your code.

Also, the operator supports MirrorMaker(MM) which helps us migrate from an EC2-managed Kafka cluster to operator-managed cluster. The following manifest details the MM configuration:

Once the MM is up and running, you will have to switch the consumer to the new Kafka cluster followed by the producer.

You will get the benefits of a Kubernetes' Statefulset for your Kafka Cluster such as: high availability for your deployments, better network policy, EFK for cluster logs, rolling upgrades, cloud-agnostic declarative manifests, and better resource management. By moving to Kubernetes Operators and to a GitOps strategy, we also improve ops effort as follows:

  • Cluster creation from hours to mins
    • It saves time to create infrastructure (Terraform) and configure brokers (Ansible). All you need is a single CRD and deploy on Kubernetes.
  • Manage multiple frameworks declaratively
    • Zookeeper, Kafka, MirrorMaker, Kafka Exporter, Kafka Connect and Topics/Users management
  • Scale in/out from days to mins
    • Horizontal scaling is such a tedious process and people spend days properly rebalancing the partitions across the cluster. With Strimzi operator and Cruise Control, rebalancing the cluster will take only minutes, which helps us achieve horizontal scale with almost no-ops effort.
  • Better fault tolerance
    • Rack awareness
    • Statefulset benefits
  • Better troubleshooting
    • Configuring monitoring to Kafka cluster is not painful anymore
    • From no central cluster logs to 100% logs available in the ELK stack
  • From zero visibility to 100% visibility on
    • Broker level configuration
    • Topic and users configuration

Conclusion

GitOps allows software life-cycle practices to infrastructure management and building operators shift the infrastructure management from traditional systems administration to software engineering, marking the rise of Infrastructure-As-Software.

Reference

  1. Guide to GitOps - https://www.weave.works/technologies/gitops/
  2. ArgoCD - https://argoproj.github.io/cd/
  3. FluxCD - https://fluxcd.io/
  4. Flux image automation - https://fluxcd.io/docs/guides/image-update/
  5. Operator Pattern - https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
  6. Operator SDK - https://sdk.operatorframework.io/docs/overview/
  7. Strimzi Kafka - https://strimzi.io/


Authors: Jerome Gagnon & Kiran Sundaravarathan