Skip to main content

Daily Log — 2026-06-29: is the fix actually live, AKS internal LB probes, and a field-dropping webhook

· 5 min read
Kobbi Gal
I like to pick things apart and see how they work inside

Another sanitized log — the transferable parts of a day that was mostly "where exactly does this run, and is the thing I think is fixed actually fixed?". Four lessons worth keeping.

1. "Is the fix actually in production?" — trace the whole chain

A change was merged, the PR was deleted, and the question was simple: is it live in this environment or not? The honest answer needs a chain of evidence, not a vibe:

  1. Commit → tag. Find the exact commits that carry the fix and the release tag they landed in (e.g. a service-scoped tag like v1.0.7-<service>). A merged-and-deleted branch still has its commits reachable from the tag.
  2. Tag → build → promote → deploy. Follow the CI: the build-and-push run for those commits, then the promote and deploy pipelines. "Merged" and "deployed" are different events, often hours or days apart.
  3. Deployed → running. Hit the service's own /status (or equivalent) in each environment and read back the version string. Compare it to the commit SHA / tag from step 1. Do this per environment — staging and prod drift.
  4. Scope the blast. Confirm which service even contains the change. A fix can live in one binary (say, a gateway) and not another (a control-plane service), so "deployed" for one doesn't mean the path the customer hits is patched.

The trap is stopping at step 1 ("it's merged") or step 2 ("the pipeline went green"). Only the /status version compared against the SHA actually answers the question.

2. AKS internal load balancer health probes

When you put service.beta.kubernetes.io/azure-load-balancer-internal: "true" on a Service, Azure provisions a load balancer with its own health probe — separate from your pod liveness/readiness probes. Two things bite people:

  • Default probe path is /. Unless you set service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path, an HTTP/HTTPS probe hits GET /. If your service exposes a port where / isn't a valid GET, you get a steady drip of 4xx in your app logs from the probe itself.
  • It probes per exposed port. A multi-port Service can get probed on ports you never meant to health-check over HTTP.

The knobs (all Service annotations):

service.beta.kubernetes.io/azure-load-balancer-health-probe-protocol: "http"   # or tcp/https
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: "/healthz"
service.beta.kubernetes.io/azure-load-balancer-health-probe-port: "8080"

Reference: AKS — configure the standard load balancer and internal LB. If you only remember one thing: a managed LB's health probe is infrastructure you configure via Service annotations, not a pod probe — and its default of / is a classic source of phantom log noise.

3. A mutating webhook silently drops fields its typed structs don't know about

A nasty one: a mutating admission webhook was deleting spec.resourceClaims fields from pods on Kubernetes 1.32+, breaking any workload using DRA (Dynamic Resource Allocation).

The webhook itself never touches resourceClaims. The damage comes from how the patch is built:

  1. Decode the raw AdmissionReview object into a typed corev1.Pod.
  2. Run the mutator (add the sidecar/env/volume it actually cares about).
  3. json.Marshal() the entire typed object back to JSON.
  4. Diff the marshaled typed object against the original raw JSON to produce the JSON patch.

The trap is the version skew between your typed struct and the cluster. In Kubernetes 1.32, DRA flattened PodResourceClaimSource ClaimSource (with nested resourceClaimName / resourceClaimTemplateName) became inlined fields on the struct:

// k8s.io/api v0.29.0
type PodResourceClaim struct {
Name string
Source ClaimSource // { ResourceClaimName, ResourceClaimTemplateName }
}

// k8s.io/api v0.32.x — flattened
type PodResourceClaim struct {
Name string
ResourceClaimName *string
ResourceClaimTemplateName *string
}

A webhook compiled against k8s.io/api v0.29.0 has no fields for the 1.32 shape. So when it re-marshals the whole pod in step 3, those fields don't exist on the struct and vanish — and the diff in step 4 emits a patch that removes them from the real object.

Lessons:

  • If you re-serialize a whole typed object to build an admission patch, you are coupled to your k8s.io/api version. Any field a newer API server added that your struct lacks gets dropped.
  • Prefer patching only what you touch. Mutate via unstructured / raw JSON, or emit a strategic/JSON patch scoped to the paths you actually change, instead of "decode typed → marshal everything → diff".
  • There's no label or config workaround for this class of bug — the dependency is compiled into the binary. The fix is to bump the k8s client libraries to match the clusters you admit and ship a new build. (Turning the webhook off "fixes" it only by removing the feature.)
  • Keep your admission webhooks' client-go/api versions within shouting distance of the newest Kubernetes you run.

4. Debugging a Terraform provider 404

A provider resource (creating a Postgres dynamic secret) returned a 404 for a customer — the same shape as an existing open issue for a different rotated-secret resource. The useful moves:

  • Map resource → API path. Read the provider source to find which backend endpoint the resource calls; a 404 usually means the path/verb the provider builds doesn't match what the server exposes (often a versioned or gateway-routed endpoint).
  • Turn on the logs. TF_LOG=DEBUG terraform apply shows the actual request URL and body — that alone often reveals the wrong path or a missing field.
  • Search issues before filing. Matching the error shape to an existing issue ties yours to prior context (and sometimes a known fix/version).

Lesson that keeps repeating: a 404 from a client library is rarely "not found" in the human sense — it's "the client asked for a URL the server doesn't serve". Print the URL.