Bernard Pietraga

Category: code

OPA Gatekeeper In Production

I have been implementing Open Policy Agent Gatekeeper policies for Kubernetes for a while and this writing should reflect some tips for getting the policies and underlying components working in a redundant manner.

Open Policy Agent Gatekeeper is an awesome admission controller, but the implementation of it might be a hurdle for not prepared organizations.

I have noticed that people who lacked experience with OPA solutions thought that Gatekeeper introduces a lot of problems. If you are looking for simple Kuberentes policies the Kverno might be better pick. It introduces less components and doesn’t introduce another language to the stack – rego A lot of these design thoughts are the result of specific design decisions taken by the OPA team which result in a more complex to use setup, but definitely more powerful. Another benefit is mutation webhook capabilities. This post focuses more on the admission part.

If you don’t know where to start

Jump to another section if you already know Gatekeeper.

Gatekeeper provides a way to enforce policies for all Kubernetes resources. It is proven to be useful for infrastructure which has the scale and a lot of users/consumers in for developers and clients on the other end.

Good place is the official documentation and example library of contraints.

Gatekeeper enables running policies with warn mode. It will help you gain insight into what resources are non-compliant with your requirements.

Without policies, it would be difficult to manage and maintain a Kubernetes environment at scale, while maintaining good security posture.

Make sure that you have distrubuted webhook controller

The OPA Gatekeeper is effectively an admission webhook with own DSL Rego. The system reliablity depends on the deployment containing Gatekeeper pods performing work. If your Gatekeeper pod is killed or is in a non-functional state then it means that you cannot schedule new resources. Especially in setups where node rotation happens all the time it can be a hassle. Speaking with some teams people even disabled it during the maintenance, which is in my opinion a big no-no.

How to solve this problem? My recommendation is to increase replica count and to schedule, at least one pod on the master node or multiple control plane nodes if your setup is like this.

Another point is the distribution of the components. This can be solved using nodeTolerations and nodeAffinity. Way to improve the reliability of the Gatekeeper is distributing pods to more nodes. To achieve this specifying podAntiAffinity is useful. Depending on your setup you can distribute the rest of the pods to other nodes which are non-master or speed deployments on the master nodes if you have multiple ones.

Example:

    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution :
      - labelSelector:
          matchExpressions:
          - key: gatekeeper.sh/operation
            operator: In
            values:
            - webhook

Make sure that OPA Gatekeeper is the first thing running in the cluster. Make sure your CD system runs it first or configure DaemonSets or ReplicaSets (depending on your deployment configuration) to achieve the same thing.

Make sure you harden your setup to your liking

You want to do your hardening first. Implementing restrictive policies doesn’t benefit you much if your setup is non-compliant.

Another part of the equation is that Gatekeeper, it will prevent you from running everything non-compliant. To mitigate it you might temporarily use the warn configuration as mentioned before, but this is not recommended.

There is more information for policies than .input.review

You can get more information about incoming requests, and use it to make policies even better.

Want to know who triggers the review? No problem

Look at #webhook-request-and-response section of Kubernets Docs.

Use external-data.

Deploy things in order

The Gatekeeper needs to be the first thing you run in the cluster. This is important as you want all of the other admission processes to go through it. The deployment process follows in order:

  1. Deployment of OPA Gatekeeper core components including webhook controller and CRDs
  2. Creation of ConstraintTemplates
  3. Creation of Constraints based on the crds created by ContraintTemplates

Constraints which are actual policy enforcements require custom Gatekeeper CRDs created using ConstraintTemplate. This means that you need have ConstraintTemplate present before creating Constraint.

Because the Contraints and ContraintTemplates depend on the Gatekeeper CRDs the order of the deployments needs to be preserved. When using tools like Flux CD or Argo CD this might be and pose an issue as the default configuration reconciles every part of the manifests at the same time.

Flux CD v2 will handle the first run of Custom CRDs with failure and retry the build in the second run.

Argo CD will requires configuration of skipping creation of custom CRDs. Docs from Release 1.8

metadata:
  annotations:
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true

If you still use Jenkins you can automete everything with Groovy :D

Cluster rotation check

Clusters and nodes in Kubernetes should be treated as ephemeral. During node creation, recycling, and sunsetting some components or images might be triggered that you don’t use after a node is up. Make sure that you have those covered and hardened in your pipeline. Good example is: If you are using something like Cluster API, it will use custom operators and images during the config changes, you might miss them when you test your policies against already running nodes.

In short, make sure to test Gatekeeper during the whole node lifecycle to make sure you haven’t missed something.

Have correct mindset

When running gatekeeper do not allow the exceptions from the rules. This will creep and make management nightmare. Simetimes this will mean reachintecting the applications to be compliant. If you are going through the route of real hardening it is better to fail than run insecure setup.

Utilize mutations to inject sane defaults

Sometimes rather than changing deployment manifests to be compliant you might want to use mutation webhook. This might be useful for setting AppArmor, SElinux, Seccomp profiles, or some additional labels and annotations.

Test your policies

Use conftest against your policies policy code to verify that they indeed work as intended. This will also speed up your feedback look.

Go further with Gatekeeper providers

One example. You can mitigate supply chain dangers singing your images with cosign and verifying the them on the cluster with cosign-gatekeeper-provider. I will try to write about it in another post. The sky is the limit. Thank you for reading this, have a great day!

Thoughts on security benchmarks: cloud vs external providers

A bit on my background. I’m a Site Reliability Engineer and co-founder of consulting agency. I’m focusing on Site Reliability, Security Engineering, and Infrastructure Automation. I have worked on Linux hardening, improving cloud security posture for companies around the world.

I have seen cases where the tools utilized by security teams weren’t used to their full potential, just to solve simple problems. Mentioned tools cost hundreds of thousands of dollars in licensing fees yearly. Things that could be addressed by simple checks provided as a cheaper product in cloud provider offering or open-source tools. Currently, the market for security add-ons to cloud and container infrastructure is rapidly expanding. Many claims made by advertisers need to be taken with a grain of salt.

DISCLAIMER: I do not advocate canceling the use of external security providers. My point is that the tools need to be chosen wisely with a clear goal in mind and tested. Proper security tools definitely will help you keep a reasonable security posture. If it is possible, look for open-source solutions first. I do encourage you to do proper research or hire someone competent to do so. Mentioned companies here hire smart folks and solve complex problems. The question is if they solve problems you are focusing on. This just touches a small subset of security tools’ functionality. This article doesn’t cover Intrusion Detection or other imporatant things. This article also doesn’t debate the AWS CIS and Foundational Best Practices but rather uses them as an example.

Why it is important to have compliance checks for the cloud?

Compliance checks are important for the cloud because they ensure that data is being stored and accessed in a way that meets the reasonable requirements of the organization. The proper policies can guide engineers to properly configure cloud resources and thus protect the privacy of the data and ensure that it is being used in a way that is authorized by the company. They are not a silver bullet solution, just one preventative measure. By default, the cloud provider-issued ones are not great, but definitely better than nothing. NSA also issues reasonable guidainces for the cloud. Consider swiss cheese model.

Security/Compliance checks example

Often security engineers expect tools with accessible UI which provides a dashboard and alerting functionality on cloud misconfiguration. I will focus on this case, to keep the scope of this article confined.

A great example of this is security checks and tools which provide audit reports. I will focus on AWS as it is the most common cloud provider, but things presented here could be replicated for Google Cloud and Azure.

Now, something which will require a bit of research from your side. You can just skim through policy names or read exactly what they do. Keep an eye for Cloud provider best practices and their provided checks, and compare them with external cloud provider solutions.

Let’s start specifying the problem:

  • We want to have a compliance check platform for generic AWS configuration checks. To start this will be CIS Benchmark and AWS Foundational Best Practices, which will inform us of non-compliant or misconfigured resources. An example could be a non-encrypted cloud storage bucket like S3. This should cover public best practices s by cloud pro
  • We should have to alert with information about our findings.

Take a look at the CIS benchmark. It provides a set of secuirty standards for Cloud Security.
Next is link to official CIS AWS Foundations Benchmark controls. For start lets also take a look at The AWS Foundational Security Best Practices

This benchmark controls alongside with AWS Config and AWS Audit Manager could provide you with compliance check for the above benchmarks. Both tools are part of AWS Security Hub platform.

Here you can find list of AWS Config Managed Rules

AWS issues documentaion and set of rules named: AWS Foundational Security Best Pratices. They can be enabled ihe AWS Security Hub.

Let’s compare it with checks provided by Security Companies

Security Companies love to provide 1 to 1 comparison or matrixes presenting the offering and comparing it to other solutions. I recommend taking a different approach. Just take a look at the documentation.

Let’s take a look at Bridgecrew. They do have a verbose security offering, but for the scope of this article, let’s just focus on the AWS security policies they check. Here is the link to Bridgecrew AWS Policy Index. Take your time and compare it to AWS CIS Benchmark + AWS Best Practices mentioned before. Did you find some similarities?

Now let’s take a look at Lacework and their Advanced Suppression – Tag Martix They are clear about the CIS benchmark as the policies include AWS_CIS_ in the prefix which is plus, but let’s take a look at the rules starting with LW_AWS_ or LW_S3_. Compare them again with AWS Foundational Security Best Pratices. Did you notice something?

Deployment of proabably cheaper solution

Now lets take a look how we could deploy it:

  • You could code terraform yourself enabling using examples from securityhub_standards_subscription.
  • We could use the open-source terraform module from Cloudposse. They are nice folks and their repos are acutally monitored by Bridgecrew.
    • Cloudposse made free Terraform module for AWS Security Hub. Mentioned default rules can be found here in GitHub.
    • AWS Config is neccessary component of AWS Security Hub, it can be managed by terraform resource here or another Cloudposse module terraform-aws-config.
    • Note that I’m not the biggest fan of their module structuring convention of Cloudposse which differs from Hashicorp standard (which is reasonable and embedded in most of the available open-source modules). But using this one can help you bring up the whole setup for one account quickly and it is hard to complain about hard work someone did and gave out for free.
  • You could use AWS Cloudformation or CLI if we are want to.
  • You can just use AWS Console UI to enable AWS Security Hub. AWS Config is part of the bundle. AWS Audit Manager will be used reports.

What we got?

We have replicated the subset of security compliance checks functionality provided by cloud security providers. Of course, alerting is missing, but it can be solved by AWS provided Lambda function.

What about Kubernetes?

Falco has also Kubernetes and AWS Cloudtrail rules for you open-source already.

I’m looking for SIEM

Check out open-source Wazzuh.

But what if I don’t trust Cloud Provider to monitor Cloud Provider?

Well, from my experience already, a lot of the security companies utilize the cloud to perform analytics and then alert them. I do not have links to back it up, so take it with a grain of salt as everything you read on the internet. You can just stop using the cloud altogether.

Is it enough?

Of course not. Security is a complex field, this is just one detached example. We are not touching attack vectors, just take a look at Mittre Att&ck, and other important topics. This is just an example comparison. Having even the best security checks doesn’t replace infrastructure and application hardening.

What about SOC 2, PCI compliance, NIST, etc.

The majority of mentioned compliance standards are also already covered by cloud provider documentation. Do your research, and think what are your goals and requirements.

Watch out for host/agent configurations with too many priviliges

Some SIEM and Threat Detection solutions give agents additional permissions that they should not have. This creates a backdoor on the machines it runs. Verify the program capabilities before you run it. A good example of it is remote threat mitigation with spawns direct shell from UI to the machine. It means that whoever controls the tool, controls your machines.

Do you have questions?

Reach out to me on About Me page.

IAC tagging for your git blame needs – Yor

Have you ever been in situation where you noticed some weird cloud resource or weird Kubernetes pod? Have you ever had a hard time finding respecitve owner of this resource?

Ideally this situation should not happen, with good practices in place and propper tagging enforcement. Unfortunately we are not living in perfect world, the agile manifesto of Facebook "move fast, break things" which spread to a lot of places leave a lot of things untidy. If you worked long enough in the field complex infrastructure you know what I’m reffering to.

Having proper tags/Kuberentes labels in place helps to keep mental picture of infrastructure, maintain and navigate quicker.

Yor – new free automation tool for IAC tagging

yor by Bridgecrew is free and open-source project aiming to help infrastructure engineers help to orginize their cloud and kuberentes resources. There is support for 3 big cloud providers and kuberentes.

Tracability is one of biggest problems in greenfield projects or big corporate projects.

Tracing code in the cloud can be a challenge, especially when you want to track down specific change to cloud infrastructure. One solution is to use custom tags to help track down your code. You can then use native cloud or kuberentes tools to understand the problem better.

Yor adds important information each new created resource including for example git commit sha. This might save you years after, when you will be looking why something is there in the first place. yor_trace or git_file enables you to find resource in code repostory.

It does integrate with major CI platforms and could be easly integrated with new ones. Here are documentation examples. It works with pre-commit hooks.

Why it is important to tag your cloud infrastructure?

Cloud computing has quickly become a mainstay in the modern IT landscape. It offers many benefits, such as lower costs and better agility, but one challenge it faces is how to manage and monitor resources. One solution is to tag resources so that they can be easily found and managed. In this article, we’ll look at tagging resources in Kubernetes and Amazon Web Services (AWS).

Tagging resources can be a great way to improve your management of your cloud infrastructure. For example, you could use tags to group related resources together, such as instances or services. This can make it much easier to find and manage these resources. You could also use tags to track specific changes, such as when an instance is upgraded or deleted. When you tag your resources in Kubernetes or AWS resources managed by Terraform, you can easily find and access them later. This becomes especially important when you need to troubleshoot or manage a complex cloud-based application. By tagging your resources, you can quickly identify which components are affected by a problem.

Kubernetes lets you tag both individual objects (e.g., pods) and collections of objects (e. g. crds). But this all comes at the cost of maintenance. Yor tries to fix this provinding reasonable set of functionality out of the box.

Scan Docker container images without logging into any solution

Sometimes you want a quick check for any CVE in the Docker image. You are in some Linux machine. You don’t want to use docker scan which is based on Snyk and requires login. Well, you can use the Trivy. It is free, has Apache 2.0 License. Additionally, the tool works with the Terraform code and Linux os. It is not a check providing 100% information about all the issues but a good starting point.

trivy image automattable/memes:0.2-spongebob

Another option is to use the achore/grype. It provides similar functionality covering the scanning of the containers. if the image is created using a multilayer build this tool provides a way to check all the layers.

grype automattable/memes:0.2-spongebob --scope all-layers

The Terraformer might be handy for your IAC toolbelt

Terraform now is pretty much one of the most widely used tools if not industry standard for Cloud configuration. No matter if it is AWS, GCP, or Azure. Unfortunately its resource import feature is tedious to use. If you want to move the existing cloud setup to terraform, there is a better way to do it.

I have worked with a few companies which migrated to the cloud and alongside adopting the Infrastructure As Code approach. What I noticed is that a lot of engineers working with Terraform don’t know about one great open-source tool which tremendously saves time. The tool is called Terraformer. It provides a way to import existing cloud-config to Terraform including the corresponding tfstate.

To use it the existing setup just needs to be present on the cloud. It doesn’t matter if it was previously created manually in the UI, or via API calls, AWS Cloudfomation, gcloud CLI, or other ways.

Of course, there is a minor caveat with this solution. It is still beta, not everything is supported in Terraformer, but this mostly applies more to exotic parts of Cloud offering like security tools (AWS Inspector for example, or Google Cloud IDS), than the basic components which are used almost everywhere like IAM (or AD in Azure), EC2 instances, AKS/EKS/GKE and etc.

Edit: 03/02/2022

Friend asked me for example usage, so I’m including it here.

# Import all Cloudwatch and IAM resource from selected aws profile
terraformer import aws -r cloudwatch,iam --profile=your_aws_profile_name

# List supported resources in google cloud, make sure you have your gcloud context set
terraformer import google list

# Make a plan for importing all of Grafana resources.
terraformer plan grafana -r=*

Scaling Deployments With Go Kubernetes Client

So you want to auto scale your deployments with Golang?

I’ve written this post as I haven’t found any comprehensive guide on doing simple scaling from within Golang app. It utilizes Kubernetes client-go and Kubernetes apimachinery. Note that official documentation is great and can provide information that enable everyone to get up to speed easily.
Image notes: The Gopher mascot used in post image is created by Renee French and shared on Creative Commons Attribution 3.0 License. I’ve combined it with burger from Katerina Limpitsouni undraw.co


It is time to start. You have an application running in lovely K8s, which performs some tasks, and you want to autoscale it based on some really custom internal metric. It is not CPU, requests, or any other thing with which can be done using custom metric or other approach. This article is written as a weird solution to a rare problem and is done using Golang. Most of the time deployments, and other bits of Kubernetes infrastructure can be handled using out of the box components. In case of deployment pods it can be Horizontal Pod Autoscaler which with help of vast metrics can achieve the scaling result we need for our system. So think twice about using it.

This solution also might be helpful in cases when the conditional scaling is during selected time, for example during Superbowl adverts, when you know, you will get a lot of the traffic.

So lets asume we have great startup idea. Application which handles the bread grill toasters, which will be used in the restaurant chains all over the world. There is service which hold a state about amount of oders waiting to be grilled and comunicates with the shiny IOT toasters about heating metrics. There is one pod per toaster. Everyday the restaurant starts there are only 3 toasters running, which means only 3 pods defined in deployment. When the orders ramp up, the restaurant manager calls you and asks if there is possbility of heating other cold toasters laying in the room.

This is an example, which is insecure, using plain requests via http, consider secure communication before using it.

Here we go with our sandwich journey

So lets assume that the serivce responsible for managing toasters for the sake of the convinience returns json response with just one field representing current number of orders which is "orders": . It will be basis for our logic of scaling up and down.

We will create logic that if the number of orders to be fullfiled per toaster (pod) is greater than 5 than the pods should be scaled up. If it is lower than 5 than we should scale down, minimally to 3 pods. Everything shoudl happen in 10 second intervals.

RBAC rules needed to do the job

Skip this section if you are only interested in Golang code. RBAC rules are necessary to perform changes on our deployment. This example above is permissive, so adjust accordingly for security. The goal is to enable our simple toaster app to access data about deployments – replicas, which represent amount of pods and change this amount on the fly.

Lets create rbac.yaml file.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: simple-deployments-manager
rules:
- apiGroups: ["extensions", "apps"]
  resources: ["deployments", "pods"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
# This role binding allows "dave" to read secrets in the "development" namespace.
# You need to already have a ClusterRole named "deployments".
kind: RoleBinding
metadata:
  name: simple-deployments-manager
  #
  # The namespace of the ClusterRoleBinding determines where the permissions are granted.
  # This only grants permissions within the "development" namespace.
  namespace: default
subjects:
- kind: User
  name: system:serviceaccount:default:default # Name is case sensitive
  apiGroup: rbac.authorization.k8s.io
- kind: Group
  name: manager # Name is case sensitive
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: simple-deployments-manager
  apiGroup: rbac.authorization.k8s.io

Deployment – we also need to put our app somewhere

Also skip this section if you are only interested in Golang code. We need to place our app somewhere and to do this, here is an example of file content. Important note is that we pass the environmental variable which is the address of the service providing us with order data in our case.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: simple-scaler
  labels:
    app: simple-scaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: simple-scaler
  template:
    metadata:
      labels:
        app: simple-scaler
    spec:
      containers:
      - name: simple-scaler
        image: simple-scaler
        imagePullPolicy: Never
        env:
        - name: MANAGED_SERVICE_HOST
          value: "managed-service:8080"

Docker image for our main.go

To be able to successfully deploy the app to cluster we need to put it into docker. Here is the simple content which runs the app as non-root.

# Build the simplescaler binary
FROM golang:1.16 as builder

WORKDIR /workspace
# Copy the Go Modules manifests
COPY go.mod go.mod
COPY go.sum go.sum
# cache deps before building and copying source so that we don't need to re-download as much
# and so that source changes don't invalidate our downloaded layer
RUN go mod download

# Copy the go source
COPY main.go main.go

# Build
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 GO111MODULE=on go build -a -o simplescaler main.go

# Use distroless as minimal base image to package the simplescaler binary
# Refer to https://github.com/GoogleContainerTools/distroless for more details
FROM gcr.io/distroless/static:nonroot
WORKDIR /
COPY --from=builder /workspace/simplescaler .
USER 65532:65532

ENTRYPOINT ["/simplescaler"]

Logic – the brain of restaurant scaling – Golang code

Finally, the Golang code. So lets look at the functions we need to get stuff done.

Scroll down if you are only interested in the complete file!

First of all lets create file main.go which we will later run in a separate pod in our greatest cluster.

Now lets start with components we need to map for our app to work. Let’s start with the struct which will be representing the response from the service providing us with a number of orders.

// OrdersResponse is struct representing JSON response from our managed service
// http://managed-service:8080/orders
type OrdersResponse struct {
    Orders int32 "json:\"count,omitempty\"" // Backticks are issue for markdown parser
}

Now lets jump further and map other parts needed. On the side of the Kubernetes we will need the types and interfaces which will represent Kubernetes objects.

// ManagedDeployment struct represents the state used to preform replica set
// scaling.
type ManagedDeployment struct {
    // k8s client of the kuberentes deployment
    Client appsv1.DeploymentInterface

    // System metrics from managed-service from 10 second timespan
    ReplicaCount int32

    // Fetched whole deployment state
    State *v1.Deployment
}

// ManagedDeploymentInterface is the interface exposing methods
type ManagedDeploymentInterface interface {
    Get() error
    UpdateReplicas() error
}

Small factory for the ManagedDeployment will be nice:

// NewManagedDeploymentManager is factory used to create ManagedDeployment
func NewManagedDeploymentManager(deploymentsClient appsv1.DeploymentInterface) *ManagedDeployment {
    return &ManagedDeployment{Client: deploymentsClient}
}

Now the interesting bit. For this we will use some Golang standard libraries and the Kubernetes client-go and Kubernetes apimachinery.

We will need some imports.

import (
    "context"
    "encoding/json"
    "flag"
    "fmt"
    "log"
    "net/http"
    "os"
    "path/filepath"
    "strings"
    "time"

    v1 "k8s.io/api/apps/v1"
    apiv1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
    appsv1 "k8s.io/client-go/kubernetes/typed/apps/v1"
    "k8s.io/client-go/rest"
    "k8s.io/client-go/tools/clientcmd"
    "k8s.io/client-go/util/retry"
)

Let’s implement a 2 functions from our ManagedDeploymentInterface:

  • Get()
  • UpdateReplicas()

First we will create simple pointer helper, as the client library requires pointers to 32 bit integers.

// int32Ptr is used to create pointer neecessary for the replica count
func int32Ptr(i int32) *int32 { return &i }

Get() – will update the attributes of our ManagedDeployment with fresh data from Deployment – including what is important for us State and ReplicaCount – number of pods.

// Get is used to fetch new state und update the ManagedDeployment.
func (c *ManagedDeployment) Get() error {
    result, getErr := c.Client.Get(context.TODO(),
        "managed-deployment",
        metav1.GetOptions{})
    if getErr != nil {
        return getErr
    }
    replicaCount := *result.Spec.Replicas
    c.State = result
    c.ReplicaCount = replicaCount
    return nil
}

UpdateReplicas(count int32) – will update the ManagedDeployment with count representing number of pods. The 32-bit integer is used as it is official client requirement.

// UpdateReplicas is used to change the number of deployemnt replicas based on
// provided count number
func (c *ManagedDeployment) UpdateReplicas(count int32) error {
    c.State.Spec.Replicas = int32Ptr(count)
    _, updateErr := c.Client.Update(context.TODO(),
        c.State,
        metav1.UpdateOptions{})
    if updateErr != nil {
        return updateErr
    }
    c.ReplicaCount = count
    return nil
}

Now it is time to wrap it up into a method which will query the managed host and return the simple pod count.

// GetManagedOrdersCount asks managed-service api to get running processes
// count. Serializes response body to json and returns int containging number
func GetManagedOrdersCount() (int32, error) {
    url := "http://" + ManagedHost + "/orders"
    r := strings.NewReader("")
    req, clientErr := http.NewRequest("GET", url, r)
    if clientErr != nil {
        return int32(0),
            fmt.Errorf("Prepare request : %v", clientErr)
    }
    req.Header.Set("Content-Type", "application/json")

    client := &http.Client{}

    resp, reqErr := client.Do(req)
    if reqErr != nil {
        return int32(0),
            fmt.Errorf("Failed to query the process count : %v", reqErr)
    }
    defer resp.Body.Close()

    // Decode the data
    var cResp OrdersResponse
    if jsonErr := json.NewDecoder(resp.Body).Decode(&cResp); jsonErr != nil {
        return int32(0), fmt.Errorf("Error parsing process count json : %v", jsonErr)
    }
    return cResp.Orders, nil
}

To connect to the cluster lets setup 2 client helpers, which will aid us if we use running locally (outside docker) or in the cluster.


// clientSetup returns appropriate clientset with config either from ./kube
// or cluster api configuration
func clientSetup() (*kubernetes.Clientset, error) {
    if runningInDocker() {
        return setupDockerClusterClient()
    } else {
        return setupLocalClient()
    }
}

// setupLocalClient from minikube ./kube/config
func setupLocalClient() (*kubernetes.Clientset, error) {
    var ns string
    flag.StringVar(&ns, "namespace", "", "namespace")

    // Bootstrap k8s configuration from local   Kubernetes config file
    kubeconfig := filepath.Join(os.Getenv("HOME"), ".kube", "config")
    log.Println("Using kubeconfig file: ", kubeconfig)
    config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
    if err != nil {
        return &kubernetes.Clientset{}, err
    }

    // Create an rest client not targeting specific API version
    clientset, err := kubernetes.NewForConfig(config)
    return clientset, err
}

// setupDockerClusterClient from rest cluster api configuration.
func setupDockerClusterClient() (*kubernetes.Clientset, error) {
    log.Println("Running inside container")
    // creates the in-cluster config
    config, err := rest.InClusterConfig()
    if err != nil {
        return &kubernetes.Clientset{}, err
    }
    // creates the clientset
    clientset, err := kubernetes.NewForConfig(config)
    return clientset, err
}

Now the logic part, the main function which will run recursively. The important part is the usage of ticker, which rather than sleep will keep real 10 second intervals.

func main() {
    // ManagedHost is URI of the managed-service running on cluster
    var ManagedHost string = os.Getenv("MANAGED_SERVICE_HOST")

    // to prevent tick shifting rather than using time.Sleep I've decided to go
    // with time.Tick which of course this also has some downsides. Channels
    // here would add safety. In extreme case there might be race condition.
    tick := time.Tick(10 * time.Second)
    tickCount := 0

    clientset, err := clientSetup()
    if err != nil {
        log.Fatal(err)
    }

    deploymentsClient := clientset.AppsV1().Deployments(apiv1.NamespaceDefault)
    ManagedDeployment := NewManagedDeploymentManager(deploymentsClient)

    // Retrieve the initial version of Deployment
    getErr := ManagedDeployment.Get()
    if getErr != nil {
        log.Fatal(fmt.Errorf("Failed to get latest version of \"orders-deployment\" Deployment: %v", getErr))
    }

    if ManagedDeployment.ReplicaCount <= 0 {
        log.Fatal(fmt.Errorf("Replica count is equal %d, this means the service is still deploying or there is an issue", ManagedDeployment.ReplicaCount))
    }

    // Get process count from managed-service
    oldOrdersCount, err := GetManagedOrdersCount()
    if err != nil {
        log.Fatal(err)
    }

    // in 10 second intervals update orders-deployment replicas if needed.
    for range tick {
        // Update replica count
        retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
            tickCount++ // just for logging puproposes

            // Retrieve the latest version of Deployment before attempting to
            // run update. RetryOnConflict uses exponential backoff to avoid
            // exhausting the apiserver
            getErr := ManagedDeployment.Get()
            if getErr != nil {
                return fmt.Errorf("Failed to get latest version of \"orders-deployment\" Deployment: %v", getErr)
            }

            // Get current process count from managed-service
            currentOrdersCount, err := GetManagedOrdersCount()
            if err != nil {
                return err
            }

            // Find how much processes where created in 10 seconds
            ordersIn10Seconds := currentOrdersCount - oldOrdersCount

            // Count no processes for instance (please take to account int math)
            ordersPerInstance := ordersIn10Seconds / ManagedDeployment.ReplicaCount

            // Overwrite the old process count for new comparisons in next tick
            // it is not needed at this state.
            oldOrdersCount = currentOrdersCount

            if ordersPerInstance >= 5 && ManagedDeployment.ReplicaCount <= 3 {
                // Increment the Replica Count, note: won't use ++ as it can
                // impact second conditianal comparsion.
                oneMore := ManagedDeployment.ReplicaCount + 1
                updateErr := ManagedDeployment.UpdateReplicas(oneMore)
                if updateErr == nil {
                    fmt.Println("   -> Replica count changed to",
                        oneMore)
                }
                return updateErr
            } else if ordersPerInstance < 5 && ManagedDeployment.ReplicaCount > 3 {
                oneLess := ManagedDeployment.ReplicaCount - 1
                updateErr := ManagedDeployment.UpdateReplicas(oneLess)
                if updateErr == nil {
                    fmt.Println("   -> Replica count changed to",
                        oneLess)
                }
                return updateErr
            }

            // Overwrite the old process count for new comparisons in next tick
            oldOrdersCount = currentOrdersCount
            return nil
        })

        if retryErr != nil {
            panic(fmt.Errorf("Update failed: %v", retryErr))
        }
    }
    // end of loop
}

That’s our core logic, responsible for handling deployment scaling back and forth.

Final main.go file

Putting it all together, here is the complete main.go file.

package main

import (
    "context"
    "encoding/json"
    "flag"
    "fmt"
    "log"
    "net/http"
    "os"
    "path/filepath"
    "strings"
    "time"

    v1 "k8s.io/api/apps/v1"
    apiv1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
    appsv1 "k8s.io/client-go/kubernetes/typed/apps/v1"
    "k8s.io/client-go/rest"
    "k8s.io/client-go/tools/clientcmd"
    "k8s.io/client-go/util/retry"
)

// OrdersResponse is struct representing JSON response from:
// http://managed-service:8080/orders
type OrdersResponse struct {
    Orders int32 "json:\"count,omitempty\"" // Backticks are issue for markdown parser
}

// ManagedDeployment struct represents the state used to preform replica set
// scaling.
type ManagedDeployment struct {
    // k8s client of the kuberentes deployment
    Client appsv1.DeploymentInterface

    // System metrics from managed-service from 10 second timespan
    ReplicaCount int32

    // Fetched whole deployment state
    State *v1.Deployment
}

// ManagedDeploymentInterface is the interface exposing methods
type ManagedDeploymentInterface interface {
    Get() error
    UpdateReplicas() error
}

// NewManagedDeploymentManager is factory used to create ManagedDeployment
func NewManagedDeploymentManager(deploymentsClient appsv1.DeploymentInterface) *ManagedDeployment {
    return &ManagedDeployment{Client: deploymentsClient}
}

func main() {
    // ManagedHost is URI of the managed-service running on cluster
    var ManagedHost string = os.Getenv("MANAGED_SERVICE_HOST")

    // to prevent tick shifting rather than using time.Sleep I've decided to go
    // with time.Tick which of course this also has some downsides. Channels
    // here would add safety. In extreme case there might be race condition.
    tick := time.Tick(10 * time.Second)
    tickCount := 0

    // Setup client, might be better to use interfaces here.
    clientset, err := clientSetup()
    if err != nil {
        log.Fatal(err)
    }

    deploymentsClient := clientset.AppsV1().Deployments(apiv1.NamespaceDefault)
    ManagedDeployment := NewManagedDeploymentManager(deploymentsClient)

    // Retrieve the latest version of Deployment
    getErr := ManagedDeployment.Get()
    if getErr != nil {
        log.Fatal(fmt.Errorf("Failed to get latest version of \"orders-deployment\" Deployment: %v", getErr))
    }

    if ManagedDeployment.ReplicaCount <= 0 {
        log.Fatal(fmt.Errorf("Replica count is equal %d, this means the service is still deploying or there is an issue", ManagedDeployment.ReplicaCount))
    }

    // Get process count from managed-service
    oldOrdersCount, err := GetManagedOrdersCount()
    if err != nil {
        log.Fatal(err)
    }

    // in 10 second intervals update orders-deployment replicas if needed.
    for range tick {
        // Update replica count
        retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
            tickCount++ // just for logging puproposes

            // Retrieve the latest version of Deployment before attempting to
            // run update. RetryOnConflict uses exponential backoff to avoid
            // exhausting the apiserver
            getErr := ManagedDeployment.Get()
            if getErr != nil {
                return fmt.Errorf("Failed to get latest version of \"orders-deployment\" Deployment: %v", getErr)
            }

            // Get current process count from managed-service
            currentOrdersCount, err := GetManagedOrdersCount()
            if err != nil {
                return err
            }

            // Find how much processes where created in 10 seconds
            ordersIn10Seconds := currentOrdersCount - oldOrdersCount

            // Count no processes for instance (please take to account int math)
            ordersPerInstance := ordersIn10Seconds / ManagedDeployment.ReplicaCount

            // Overwrite the old process count for new comparisons in next tick
            // it is not needed at this state.
            oldOrdersCount = currentOrdersCount

            if ordersPerInstance >= 5 && ManagedDeployment.ReplicaCount <= 3 {
                // Increment the Replica Count, note: won't use ++ as it can
                // impact second conditianal comparsion.
                oneMore := ManagedDeployment.ReplicaCount + 1
                updateErr := ManagedDeployment.UpdateReplicas(oneMore)
                if updateErr == nil {
                    fmt.Println("   -> Replica count changed to",
                        oneMore)
                }
                return updateErr
            } else if ordersPerInstance < 5 && ManagedDeployment.ReplicaCount > 3 {
                oneLess := ManagedDeployment.ReplicaCount - 1
                updateErr := ManagedDeployment.UpdateReplicas(oneLess)
                if updateErr == nil {
                    fmt.Println("   -> Replica count changed to",
                        oneLess)
                }
                return updateErr
            }

            // Overwrite the old process count for new comparisons in next tick
            oldOrdersCount = currentOrdersCount
            return nil
        })

        if retryErr != nil {
            panic(fmt.Errorf("Update failed: %v", retryErr))
        }
    }
    // end of loop
}

// Get is used to fetch new state und update the ManagedDeployment.
func (c *ManagedDeployment) Get() error {
    result, getErr := c.Client.Get(context.TODO(),
        "managed-deployment",
        metav1.GetOptions{})
    if getErr != nil {
        return getErr
    }
    replicaCount := *result.Spec.Replicas
    c.State = result
    c.ReplicaCount = replicaCount
    return nil
}

// UpdateReplicas is used to change the number of deployemnt replicas based on
// provided count number
func (c *ManagedDeployment) UpdateReplicas(count int32) error {
    c.State.Spec.Replicas = int32Ptr(count)
    _, updateErr := c.Client.Update(context.TODO(),
        c.State,
        metav1.UpdateOptions{})
    if updateErr != nil {
        return updateErr
    }
    c.ReplicaCount = count
    fmt.Println("REPLICA COUNT UPDATE", c.ReplicaCount)
    return nil
}

// GetManagedOrdersCount asks managed-service api to get running processes
// count. Serializes response body to json and returns int containging number
func GetManagedOrdersCount() (int32, error) {
    url := "http://" + ManagedHost + "/orders"
    r := strings.NewReader("")
    req, clientErr := http.NewRequest("GET", url, r)
    if clientErr != nil {
        return int32(0),
            fmt.Errorf("Prepare request : %v", clientErr)
    }
    req.Header.Set("Content-Type", "application/json")

    client := &http.Client{}

    resp, reqErr := client.Do(req)
    if reqErr != nil {
        return int32(0),
            fmt.Errorf("Failed to query the process count : %v", reqErr)
    }
    defer resp.Body.Close()

    // Decode the data
    var cResp OrdersResponse
    if jsonErr := json.NewDecoder(resp.Body).Decode(&cResp); jsonErr != nil {
        return int32(0), fmt.Errorf("Error parsing process count json : %v", jsonErr)
    }
    return cResp.Orders, nil
}

// int32Ptr is used to create pointer neecessary for the replica count
func int32Ptr(i int32) *int32 { return &i }

// runningInDocker check if there is .dockerenv present, meaning the process is
// running in container. I'm aware that this is not deterministic check for all
// of the setups, but it work fine for this example minikube app.
func runningInDocker() bool {
    if _, err := os.Stat("/.dockerenv"); err == nil {
        return true
    } else if os.IsNotExist(err) {
        return false
    } else {
        // Schrodinger cat, this might be container or not :)
        return false
    }
}

// clientSetup returns appropriate clientset with config either from ./kube
// or cluster api configuration
func clientSetup() (*kubernetes.Clientset, error) {
    if runningInDocker() {
        return setupDockerClusterClient()
    } else {
        return setupLocalClient()
    }
}

// setupLocalClient from minikube ./kube/config
func setupLocalClient() (*kubernetes.Clientset, error) {
    var ns string
    flag.StringVar(&ns, "namespace", "", "namespace")

    // Bootstrap k8s configuration from local   Kubernetes config file
    kubeconfig := filepath.Join(os.Getenv("HOME"), ".kube", "config")
    log.Println("Using kubeconfig file: ", kubeconfig)
    config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
    if err != nil {
        return &kubernetes.Clientset{}, err
    }

    // Create an rest client not targeting specific API version
    clientset, err := kubernetes.NewForConfig(config)
    return clientset, err
}

// setupDockerClusterClient from rest cluster api configuration.
func setupDockerClusterClient() (*kubernetes.Clientset, error) {
    log.Println("Running inside container")
    // creates the in-cluster config
    config, err := rest.InClusterConfig()
    if err != nil {
        return &kubernetes.Clientset{}, err
    }
    // creates the clientset
    clientset, err := kubernetes.NewForConfig(config)
    return clientset, err
}

Bash array as a Terraform list of strings input variable

So terraform supports string list assigment to new input variable but not the bash array format. But during the time of writing.

Lets say that we have bash array like this one below

BASH_ARRAY=( "a" "b" "c" )

And we want to pass the bash array to terraform variable declaration which is list of strings.

variable "example" {
  type = list(string)
}

To do it we need to transform the bash array to string format which will look like this:

"[\"a\","\b\",\"c\"]"

To do it use the bash below

# define array
BASH_ARRAY=( "a" "b" "c" )

# convert array to serialized list
ARRAY_WITH_QUOTES=()
for ENTRY in "${BASH_ARRAY[@]}";
do
  ARRAY_WITH_QUOTES+=( "\"${ENTRY}\"" )
done
TERRAFORM_LIST=$(IFS=,; echo [${ARRAY_WITH_QUOTES[*]}])

# pass it to the teraform
terraform plan -var example=$TERRAFORM_LIST

Done, this way you can use bash array as terraform list of string variable