Bug 1771773 - [cluster-autoscaler-operator] webhooks are potentially broken by service CA rotation
Summary: [cluster-autoscaler-operator] webhooks are potentially broken by service CA r...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.4.0
Assignee: Brad Ison
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-12 23:00 UTC by Maru Newby
Modified: 2020-05-15 14:45 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-15 14:45:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-autoscaler-operator pull 132 None closed Bug 1771773: webhooks: Rely on service-ca-operator for CA injection 2020-05-15 14:35:08 UTC

Description Maru Newby 2019-11-12 23:00:23 UTC
It appears that the operator sets the CA bundle for the webhooks it deploys at most once per invocation:

https://github.com/openshift/cluster-autoscaler-operator/blob/master/pkg/operator/webhookconfig.go#L44

In the event of the serving CA bundle changing after service CA rotation, the operator should be capable of automatically detecting the change and updating the CA bundle of its webhooks. The 'Refresh Strategies' section of the linked compatibility doc catalogs potential strategies for responding to changes in key material supplied by the service ca operator.

Note that CA rotation can be manually triggered in any 4.x release by removing the signing secret. Automated rotation is likely to be introduced in a future z-stream release. 

Reference: 

Enhancement for automated service CA rotation: 

https://github.com/openshift/enhancements/blob/master/enhancements/automated-service-ca-rotation.md

Operator compatibility with service ca rotation:

https://docs.google.com/document/d/1NB2wUf9e8XScfVM6jFBl8VuLYG6-3uV63eUpqmYE8Ts/edit

Comment 1 Brad Ison 2020-01-28 11:58:15 UTC
The linked PR uses the recent support in the service-ca-operator for having the CA injected automatically, which I think should solve this going forward.

Comment 3 Jianwei Hou 2020-02-06 07:54:28 UTC
Verified in 4.4.0-0.nightly-2020-02-05-181112.

The ValidatingWebhookConfiguration autoscaling.openshift.io has been annotated:
```
# oc get ValidatingWebhookConfiguration autoscaling.openshift.io -n openshift-machine-api -o=jsonpath="{.metadata.annotations}"
map[service.beta.openshift.io/inject-cabundle:true]
```

The cluster-autoscaler-operator log showed webhook configuration was created:
```
oc logs cluster-autoscaler-operator-6f8bfc96bc-cdfpz -c cluster-autoscaler-operator
I0206 03:24:02.136008       1 webhookconfig.go:72] Webhook configuration status: created
```

The service-ca showed validatingwebhookconfiguration is updated.

```
oc logs service-ca-7f884bd857-wjnj6 -n openshift-service-ca
 I0206 03:24:02.133106       1 validatingwebhook.go:55] updating validatingwebhookconfiguration autoscaling.openshift.io with the service signing CA bundle
```

By deleting the signing-key secret from openshift-service-ca, ValidatingWebhookConfiguration got a new value for caBundle.

Comment 4 Maru Newby 2020-02-17 20:51:49 UTC
I'm in the process of backporting service CA rotation to 4.2 and 4.3. Would it make sense to ensure that the autoscaler operator was similarly compatible with CA rotation in those releases? Note that it may not be possible to backport service CA bundle injection for webhooks due to the considerable code changes involved, and it may instead be necessary to rely on a sidecar running NewFileWatcherWatchdog (as mentioned in the operator compatibility doc).

Comment 5 Brad Ison 2020-02-18 15:38:27 UTC
That's a good question. It's pretty easy on the cluster-autoscaler-operator side. Though it would be nice if we could interrogate the service-ca-operator to be sure it supported CA injection, otherwise we could end up with silently broken webhooks.

The validating webhooks are a convenience for the user really, and they are set to ignore failures -- the validation errors will just be reported in events / logs / status instead. Also, the operator will pick up the new CA on restart in 4.2 and 4.3, so it will eventually converge on a working state. Given that, I'm not really sure it's worth it to backport.

Comment 6 Maru Newby 2020-02-18 15:51:17 UTC
CA injection won't be supported by the service ca operator in releases prior to 4.4, so supporting maintenance of the ca bundle in 4.2 and 4.3 would need to implement a different mechanism (as per my mention of NewFileWatcherWatchdog).

The risk of not responding to ca bundle changes in previous releases is that the webhook will break in the event that a user upgrades to a z-stream release that supports automated rotation but does not restart pods prior to expiry of the pre-rotation CA.  There is a non-zero chance that a cluster administrator will not read the release notes and neglect to perform the one-time manual pod restart after the initial CA rotation. While it may be acceptable for the webhook to ignore validation failures, what would be the impact of a validation webhook that the apiserver couldn't call due to TLS verification failure?

Comment 7 Brad Ison 2020-02-18 16:26:35 UTC
> CA injection won't be supported by the service ca operator in releases prior
> to 4.4, so supporting maintenance of the ca bundle in 4.2 and 4.3
> would need to implement a different mechanism (as per my mention of
> NewFileWatcherWatchdog).

Ah, okay.  Just rotation of the CA is being backported, not the additional
injection options.

> While it may be acceptable for the webhook to ignore validation failures,
> what would be the impact of a validation webhook that the apiserver
> couldn't call due to TLS verification failure?

That's what I mean by "they are set to ignore failures."  See here:

  https://github.com/openshift/cluster-autoscaler-operator/blob/master/pkg/operator/webhookconfig.go#L82
  https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#failure-policy

The request will still succeed and the API server will log the failure.

Anyway, we could certainly add a sidecar.  If this is a problem shared by
other operators in the 4.2 and 4.3 z-streams, it would be nice to have a
standard solution we could use.

Comment 8 Maru Newby 2020-02-18 17:18:27 UTC
(In reply to Brad Ison from comment #7)

> That's what I mean by "they are set to ignore failures."  See here:

> The request will still succeed and the API server will log the failure.


Ah, ok - TIL.  Given that a TLS failure is non-critical, then, up to you whether the cost of fixing in previous releases is worth it.

> 
> Anyway, we could certainly add a sidecar.  If this is a problem shared by
> other operators in the 4.2 and 4.3 z-streams, it would be nice to have a
> standard solution we could use.

I agree that a standard solution would be nice, the implementation variance between operators makes that challenging. Most operators are already compatible with rotation one way or the other, and common strategies for doing so (of which sidecard is one) are are listed in the compatibility doc (linked at the end of the bz description).


Note You need to log in before you can comment on or make changes to this bug.