OPA Gatekeeper In Production

Posted on 25/07/202209/04/2023 by Bernard Pietraga

I’ve been implementing Open Policy Agent Gatekeeper policies for Kubernetes for a while, and this post should reflect some tips for getting the policies and underlying components to work in a redundant way.

Open Policy Agent Gatekeeper is an awesome admission controller, but implementing it can be a hurdle for unprepared organisations.

I have noticed that people who lacked experience with OPA solutions thought that Gatekeeper introduces a lot of problems. If you are looking for simple Kuberentes policies the Kverno might be better pick. It introduces less components and doesn’t introduce another language to the stack – rego Many of these design considerations are the result of specific design choices made by the OPA team, resulting in a more complex to use, but definitely more capable and powerful setup. Another advantage is the mutation webhook capabilities. This post focuses more on the recording part.

If you don’t know where to start

Jump to another section if you already know Gatekeeper.

Gatekeeper provides a way to enforce policies across all Kubernetes resources. It has proven to be useful for infrastructure that has the scale and many users/consumers, in the form of developers and customers, at the other end.

Good place is the official documentation and example library of contraints.

Gatekeeper enables running policies with warn mode. It will help you gain insight into what resources are non-compliant with your requirements.

Admission controller docs is
What is addmission webhook? Operator SDK Docs provide good explanation

Without policies, it would be difficult to manage and maintain a Kubernetes environment at scale while maintaining a good security posture.

Make sure that you have distrubuted webhook controller

The OPA Gatekeeper is effectively an admission webhook with own DSL Rego. The system reliablity depends on the deployment containing Gatekeeper pods performing work. If your Gatekeeper pod is killed or is in a non-functional state then it means that you cannot schedule new resources. Especially in setups where node rotation happens all the time it can be a hassle. Speaking with some teams people even disabled it during the maintenance, which is in my opinion a big no-no.

How to solve this problem? My recommendation is to increase replica count and to schedule, at least one pod on the master node or multiple control plane nodes if your setup is like this.

Another point is the distribution of the components. This can be solved using nodeTolerations and nodeAffinity. Way to improve the reliability of the Gatekeeper is distributing pods to more nodes. To achieve this specifying podAntiAffinity is useful. Depending on your setup you can distribute the rest of the pods to other nodes which are non-master or speed deployments on the master nodes if you have multiple ones.

Example:

    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution :
      - labelSelector:
          matchExpressions:
          - key: gatekeeper.sh/operation
            operator: In
            values:
            - webhook

Make sure that OPA Gatekeeper is the first thing running in the cluster. Make sure your CD system runs it first or configure DaemonSets or ReplicaSets (depending on your deployment configuration) to achieve the same thing.

Make sure you harden your setup to your liking

You want to do your hardening first. Implementing restrictive policies doesn’t benefit you much if your setup is non-compliant.

Another part of the equation is that Gatekeeper, it will prevent you from running everything non-compliant. To mitigate it you might temporarily use the warn configuration as mentioned before, but this is not recommended.

There is more information for policies than .input.review

You can get more information about incoming requests, and use it to make policies even better.

Want to know who triggers the review? No problem

Look at #webhook-request-and-response section of Kubernets Docs.

Use external-data.

Deploy things in order

The Gatekeeper needs to be the first thing you run in the cluster. This is important as you want all of the other admission processes to go through it. The deployment process follows in order:

Deployment of OPA Gatekeeper core components including webhook controller and CRDs
Creation of ConstraintTemplates
Creation of Constraints based on the crds created by ContraintTemplates

Constraints which are actual policy enforcements require custom Gatekeeper CRDs created using ConstraintTemplate. This means that you need have ConstraintTemplate present before creating Constraint.

Because the Contraints and ContraintTemplates depend on the Gatekeeper CRDs the order of the deployments needs to be preserved. When using tools like Flux CD or Argo CD this might be and pose an issue as the default configuration reconciles every part of the manifests at the same time.

Flux CD v2 will handle the first run of Custom CRDs with failure and retry the build in the second run.

Argo CD will requires configuration of skipping creation of custom CRDs. Docs from Release 1.8

metadata:
  annotations:
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true

You still use Jenkins? You can automate everything with Groovy 😊

Cluster rotation check

Clusters and nodes in Kubernetes should be treated as ephemeral. During node creation, recycling, and sunsetting some components or images might be triggered that you don’t use after a node is up. Make sure that you have those covered and hardened in your pipeline. Good example is: If you are using something like Cluster API, it will use custom operators and images during the config changes, you might miss them when you test your policies against already running nodes.

In short, make sure to test Gatekeeper during the whole node lifecycle to make sure you haven’t missed something.

Have correct mindset

When running gatekeeper do not allow the exceptions from the rules. This will creep and make management nightmare. Simetimes this will mean reachintecting the applications to be compliant. If you are going through the route of real hardening it is better to fail than run insecure setup.

Utilize mutations to inject sane defaults

Sometimes rather than changing deployment manifests to be compliant you might want to use mutation webhook. This might be useful for setting AppArmor, SElinux, Seccomp profiles, or some additional labels and annotations.

Test your policies

Use conftest against your policies policy code to verify that they indeed work as intended. This will also speed up your feedback look.

Go further with Gatekeeper providers

One example. You can mitigate supply chain dangers singing your images with cosign and verifying the them on the cluster with cosign-gatekeeper-provider. I will try to write about it in another post. The sky is the limit. Thank you for reading this, have a great day!