Hashicorp Vault + Vault Secret Operator + GCP for imagePullSecrets

Summary

The need for this mix of buzz words is for a very specific use case. For all of my production hosting I use Google Cloud. For my local environment its podman+kind provisioned by Terraform.

Usually to load container images I will build them locally and push into kind. I do this to alleviate the requirement of an internet connection to do my work. But it got my thinking, if I wanted to, couldn’t I just pull from my us.gcr.io private repository?

Sure – I could load a static key in place but I’d likely forget and that could be an attack vector for compromise. I decided to play with Vault to see if I could accomplish this. Spoiler, you can but there aren’t great instructions for it!

Why Vault?

There are a great many articles on why Vault or a secret manager is a great idea. What it comes down to is minimizing the time a credential is valid and to do that using more short lived credentials so if it gets compromised, the longevity of that compromise will be minimized.

Vault Setup

I will not go into full details on the setup but Vault was deployed via helm chart into the K8s cluster and using this guide from HashiCorp to enable gcp secrets

Your gcpbindings.hcl will need to look something like this at a minimum. You likely don’t need the roles/viewer.

 resource "//cloudresourcemanager.googleapis.com/projects/woohoo-blog-2414" {
        roles = ["roles/viewer", "roles/artifactregistry.reader"]
      }

For the roleset, I called mine “app-token” which you will see later.

The values I used for vault’s helm chart were simply as follows because I don’t need the injector and I don’t think it would even work for what we’re trying to do.

#vault values.yaml
injector:
  enabled: "false"

For the Vault Secret Operator it was simply these values as vault was installed in the default namespace. I did this for simplicity just to get it up and running. A lot of the steps I will share ARE NOT BEST PRACTICES but will help you get it up quickly and then be able to learn best practices. This includes disabling client caching and encryption on the storage (which is a default BUT NOT BEST PRACTICE). Ideally client caching is enabled to have near zero downtime upgrades and therefore encrypting the cache in transit and at rest.

defaultVaultConnection:
  enabled: true
  address: "http://vault.default.svc.cluster.local:8200"
  skipTLSVerify: false

Vault Operator CRDs

First we will start with a VaultConnection and Vault Auth. This is how the Operator will connect with vault.

apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultConnection
metadata:
  name: vault-connection
  namespace: default
spec:
  # required configuration
  # address to the Vault server.
  address: http://vault.default.svc.cluster.local:8200
---
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultAuth
metadata:
  name: static-auth
  namespace: default
spec:
  vaultConnectionRef: vault-connection
  method: kubernetes
  mount: kubernetes
  kubernetes:
    role: test
    serviceAccount: default

The test role attaches to a policy called test policy that looks like this

path "gcp/roleset/*" {
    capabilities = ["read"]
}

This allows us to read the “gcp/roleset/app-token/token” path. Above should likely be more specific such as “gcp/roleset/app-token/+” to lock it down to specific tokens wanting to be read.

All of this to get us to the VaultStaticSecret CRD.

apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultStaticSecret
metadata:
  annotations:
    imageRepository: us.gcr.io
  name: vso-gcr-imagepullref
spec:
  # This is important, otherwise it will try to pull from gcp/data/roleset
  type: kv-v1

  # mount path
  mount: gcp

  # path of the secret
  path: roleset/app-token/token

  # dest k8s secret
  destination:
    name: gcr-imagepullref
    create: true
    type: kubernetes.io/dockerconfigjson
    #type: Opaque
    transformation:
      excludeRaw: true
      excludes:
        - .*
      templates:
        ".dockerconfigjson":
          text: |
            {{- $hostname := .Annotations.imageRepository -}}
            {{- $token := .Secrets.token -}}
            {{- $login := printf "oauth2accesstoken:%s" $token | b64enc -}}
            {{- $auth := dict "auth" $login -}}
            {{- dict "auths" (dict $hostname $auth) | mustToJson -}}

  # static secret refresh interval
  refreshAfter: 30s

  # Name of the CRD to authenticate to Vault
  vaultAuthRef: static-auth

The bulk of this is in the transformation.templates section. This is the magic. We can easily pull the token but its not in a format that Kubernetes would understand and use. Most of the template is to format correctly the mirror the dockerconfigjson format.

To make it more clear, we use an annotation to store the repository hostname.

Incase the template text is a little confusing, a more readable version of this template text section would be as follows.

{{- $hostname := "us.gcr.io" -}}
{{- $token := .Secrets.token -}}
{{- $login := printf "oauth2accesstoken:%s" $token | b64enc -}}
{
  "auths": {
    "{{ $hostname}}": {
      "auth": "{{ $login }}"
    }
  }
}

Apply the manifest and if all went well you should have a secret named “gcr-imagepullref” which you can use in your “imagePullSecrets” section of the manifest.

In Closing

In closing, we leveraged gcp secrets engine and kubernetes auth to attain time limited OAuth tokens and inject into a secret to use for pulling images from a private repository. There are a number of times you may want to do something like this such as when you’re multicloud but want to utilize one repository or have on-premise clusters but want to use your cloud repository. Instead of just pulling a long lived key, this will be more secure and minimize attack vector.

Following some of the best practices will also help that as well such as limiting the scope of roles and ACLs and enabling encryption on the storage and transmission of the data.

For more on the transformation templating, you can go here.

Terraform For Local Environments (podman+kind)

Summary

I do most of my containerization work locally using podman & kind. It’s an easy way to spin up a local environment. From time to time I want to upgrade the K8s version or just completely blow it away.

With kind it is pretty simple…

kind delete cluster --name=<cluster_name>

kind create cluster --name=<cluster_name>

I then load in my Mozilla SOPS key. Then I run my bootstrap script for FluxCD.

But Then I got Lazy

Over the weekend, there was an interesting podman Desktop Bug which caused my kube-apiserver to peg the CPU. It took a bit of fiddling and recreating the cluster a few times.

So I got lazy and wrote some terraform to do it for me.

Providers

For this I used a few terraform providers, namely tehcyx/kind, alekc/kubectl, integrations/github and hashicorp/kubernetes.

TF Resources

For everything we need a kind cluster. This is pretty simple. The key is we want to wait_for_ready because we’ll be doing further actions. The node_image is option and it will just pick the latest.

resource "kind_cluster" "this" {
  name = var.kind_cluster_name
  node_image = var.kind_node_image
  wait_for_ready = true
}

We then want to apply two manifests since Flux has already been bootstrapped and setup.

These two data sources will pull the appropriate manifests from the repository. The components are just that and the base dependencies. The sync manifest is the actual sync configuration data (what to sync, where to sync from, etc).

data "github_repository_file" "gotk-components" {
  repository          = "${var.github_org}/${var.github_repository}"
  branch              = "main"
  file                = var.gotk-components_path
}

data "github_repository_file" "gotk-sync" {
  repository          = "${var.github_org}/${var.github_repository}"
  branch              = "main"
  file                = var.gotk-sync_path
}

Because these manifests have multiple documents, we need to use another data source since kubectl_manifest can only apply a single document at a time.

data "kubectl_file_documents" "gotk-components" {
    content = data.github_repository_file.gotk-components.content
}

data "kubectl_file_documents" "gotk-sync" {
    content = data.github_repository_file.gotk-sync.content
}

We then loop through with a foreach on the components

resource "kubectl_manifest" "gotk-components" {
  depends_on = [ kind_cluster.this ]
  for_each  = data.kubectl_file_documents.gotk-components.manifests
  yaml_body = each.value
}

Before we can apply the sync section, we need to ensure the Mozilla SOPS age.key is applied. We have sensitive data in this environment and key allows us to decrypt it. In other environments this may be a key vault or KMS.

resource "kubernetes_secret" "sops" {
  depends_on = [kubectl_manifest.gotk-components]
  metadata {
    name = "sops-age"
    namespace = "flux-system"
  }
  data = {
    "age.agekey" = file(var.sops_age_key_path)
  }
}

Finally we now want to apply the sync configurations and we’re done!

resource "kubectl_manifest" "gotk-sync" {
  depends_on = [ kind_cluster.this, kubernetes_secret.sops ]
  for_each  = data.kubectl_file_documents.gotk-sync.manifests
  yaml_body = each.value
}

Finale!

From here its terraform apply and we’re off!

OpenTelemetry In Golang

Summary

I was recently working on a project that involved VueJS, Golang(Go) and Mongo. For the API layer in Go, it was time to instrument it with metrics, logs and traces. I was using Gin due to its ease of setup and ability to handle json data.

Parts of the instrumentation were easy. For example traces worked out of the box with the otelgin middlware. Metrics had some examples going around but needed some work and logs were a pain.

The Beauty of OpenTelemetry(OTEL) is that you can instrument your application with it and it does not matter where you send the telemetry on the back end, most of the big name brands support OTLP directly.

Go + Gin + Middleware

Go has the concept of middleware in its web frameworks which make it really easy to monitor or adjust a request in flight. Gin is no exception. Gin by default has two middlewares it applies. They are gin.Logger() & gin.Recovery(). Logger implements a simple logger to the console. Recovery recovers from any panics and returns a 5xx error.

The otelgin middleware above simply takes the context of the http request and with a properly setup OpenTelemetry tracer and internal propagation of context, it will export to your tracing tool that supports OpenTelemetry.

Initializing and Using OTEL Tracing

Initializing the tracer is pretty simple but rather lengthy.

I have a “func InitTracer() func(context.Context) error” function that handles this. For those not terribly familiar with Go, this is a function that returns another function with context that returns an error.

func InitTracer() func(context.Context) error {
	//TODO: Only do cleanup if we're using OTLP
	if os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT") == "" {
		return func(ctx context.Context) error {
			//log.Print("nil cleanup function - success if this is without OTEL!")
			return nil
		}
	}
	exporter, err := otlptrace.New(
		context.Background(),
		otlptracegrpc.NewClient(),
	)

	if err != nil {
		panic(err)
	}

	resources, err := resource.New(
		context.Background(),
		resource.WithAttributes(
			attribute.String("library.language", "go"),
		),
	)
	if err != nil {
		//log.Print("Could not set resources: ", err)
	}

	otel.SetTracerProvider(
		tracesdk.NewTracerProvider(
			tracesdk.WithSampler(tracesdk.AlwaysSample()),
			tracesdk.WithBatcher(exporter),
			tracesdk.WithResource(resources),
		),
	)
	// Baggage may submit too much sensitive data for production
		  otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))

	return exporter.Shutdown
}

The actual usage of this in func main() might look something like this

tracerCleanup := InitTracer()
	//TODO: I don't think this defer ever runs
	defer tracerCleanup(context.Background())

If you use multiple packages, the way this is initialized, it will persist as its configured as the global tracer for the instance.

From there its just a matter of using the middleware from the otelgin package

router.Use(otelgin.Middleware(os.Getenv("OTEL_SERVICE_NAME")))

That is really it. It mostly works out of the box.

Initializing and Using OTEL Metrics

Metrics was a little more difficult. I couldn’t find a suitable example online so I ended up writing my own. It initializes the same way calling

meterCleanup := otelmetricsgin.InitMeter()
defer meterCleanup(context.Background())

router.Use(otelmetricsgin.Middleware())

You want this to be higher up on the usage of middleware because we’re starting a timer to capture latency.

Key Notes About My otelginmetrics

The first thing to do is it is the quickest and dirtiest quick and dirty middleware I could possibly put together. There are much better and eloquent ways of doing it but I needed something to work.

It exports two metrics. One is http_server_requests_total. This is the total number of requests. The other is http_server_request_duration_seconds which is the duration in seconds of each request. The http_server_request_duration_seconds is a histogram with quite a few tags to be able to split by HTTP Method, HTTP Status Code, URI and hostname of the node serving the HTTP.

Prometheus style histograms are out of scope for this article but perhaps another. In short they are time series metrics that are slotted into buckets. In our case we’re slotting them into buckets of response time. Because the default OTEL buckets are poor for latency in seconds (which should almost always be less than 1, I opted to adjust the buckets on this metric to 0.005, 0.01, 0.05, 0.5, 1, 5.

Initializing and Using OTEL Logs

Both of the metrics and traces API for Go for OTEL are considered stable. Logs, however are beta and it shows. It was a bit more complicated to get through but it is possible!

The first one is the default log provider in Go does not have any middleware that supports. As of Go 1.21 slog or Structured Logging became available and uses json format to output rich logging. OTEL doesn’t let you call the logging API directly. It provides what it calls bridges so other providers can call it. For this I used the otelslog api bridge. It initializes similarly.

func InitLog() func(context.Context) error {

	//TODO: Only do cleanup if we're using OTLP
	if os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT") == "" {
		return func(ctx context.Context) error {
			//log.Print("nil cleanup function - success if this is without OTEL!")
			return nil
		}
	}
	ctx := context.Background()

	exporter, err := otlploggrpc.New(ctx)
	if err != nil {
		panic("failed to initialize exporter")
	}

	// Create the logger provider
	lp := log.NewLoggerProvider(
		log.WithProcessor(
			log.NewBatchProcessor(exporter),
		),
	)

	global.SetLoggerProvider(lp)

	return lp.Shutdown
}

And then usage

logger := otelslog.NewLogger(os.Getenv("OTEL_SERVICE_NAME"))

router.Use(sloggin.NewWithConfig(logger, config))

// Health Checks will spam logs, we don't need this
filter := sloggin.IgnorePath("/")

config := sloggin.Config{
	WithRequestID: true,
	Filters:       []sloggin.Filter{filter},
}

router.Use(sloggin.NewWithConfig(logger, config))

From here, we could use the sloggin middleware for Gin to instrument logging on every request with request and response information. An example might look something like this.

Datalog Log & Trace Correlation

In the above screenshot you see an otel.trace_id and otel.span_id. Unfortunately, DataDog cannot use this directly so it needs a conversion and to use dd.trace_id and dd.span_id. We needed to override the logger to somehow inject this. That expertise was way beyond my skill set but I did find someone that could do it and had documented it on their blog. The code did not compile as is and required some adjusting along with DD’s conversion.

To save people some trouble I published my updated version.

To use it we would import it as a different namespace to avoid conflict

import (
   newslogin "github.com/dchapman992000/otelslog"
)

func main() {
    ....
    // This was our first slog logger
    logger := otelslog.NewLogger(os.Getenv("OTEL_SERVICE_NAME"))
    //This is the new one where we inject our new one into it using the embedded structs and promotions in Go
    logger = newslogin.InitialiseLogging(logger.Handler())
}

You can then see in the screenshot pulling up the logs, we have the ability to see the related traces and it all works!