Bug 1771809 - The metrics/healthz endpoint of kube-controller-manager potentially broken by service CA rotation
Summary: The metrics/healthz endpoint of kube-controller-manager potentially broken by...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.4.0
Assignee: Sally
QA Contact: zhou ying
URL:
Whiteboard:
: 1776409 (view as bug list)
Depends On:
Blocks: 1776409
TreeView+ depends on / blocked
 
Reported: 2019-11-13 03:04 UTC by Maru Newby
Modified: 2020-05-13 21:52 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1776409 (view as bug list)
Environment:
Last Closed: 2020-05-13 21:52:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-controller-manager-operator pull 316 0 None closed Bug 1771809: Add e2e for PodDisruptionBudgetAtLimit alert 2021-02-04 06:42:24 UTC
Github openshift origin pull 24169 0 'None' closed Bug 1771809: add PodDisruptionBudgetAtLimit alert e2e 2021-02-04 06:42:24 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-13 21:52:41 UTC

Description Maru Newby 2019-11-13 03:04:06 UTC
A serving cert supplied by the service ca operator appears to be used to secure the healthz/metrics endpoint of the kube controller manager. If the serving cert is regenerated (i.e. when the service CA is rotated), will the healthz/metrics endpoint be refreshed or will the controller manager be restarted?

If not, the 'Refresh Strategies' section of the linked compatibility doc catalogs potential strategies for responding to changes in key material supplied by the service CA operator.

Note that CA rotation can be manually triggered in any 4.x release by removing the signing secret. Automated rotation is likely to be introduced in a future z-stream release. 

References: 

Enhancement for automated service CA rotation: 

https://github.com/openshift/enhancements/blob/master/enhancements/automated-service-ca-rotation.md

Operator compatibility with service ca rotation:

https://docs.google.com/document/d/1NB2wUf9e8XScfVM6jFBl8VuLYG6-3uV63eUpqmYE8Ts/edit

Comment 1 Maciej Szulik 2019-11-14 11:22:27 UTC
KCM is correctly picking up the new certs. The flow to follow is:

When starting the operator we pass a set of resources that are maintained using revision controller from library-go:
https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/pkg/operator/starter.go#L101
those secrets are defined in 
https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/pkg/operator/starter.go#L168-L188
one of them is serving-cert which is managed by the service-serving-cert-signer controller. 

The code responsible for updating pod with the new cert lives in:
https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L330-L335

I've manually performed the test but an automatic one which would be verifying metrics endpoint is needed.

Sally can you add an end-to-end test for kcm-o which will be checking PDB alert that was added in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/301. 
This will:
1. Verify the alert is there, which is good and we want that.
2. Verify the metrics are properly served by the kcm.
3. Verify the metrics are served even when the cert is rotated. Marun will be working on a separate test suite that forces rotation and we need a test proving it's working as expected.

The test should be as follows:
1. create a PDB per instructions from https://bugzilla.redhat.com/show_bug.cgi?id=1762888#c0
2. wait for a bit and verify that alert from the aforementioned PR is firing.

If in doubt check Mike's latest e2e here: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/311

Comment 2 Maru Newby 2019-11-14 20:38:33 UTC
(In reply to Maciej Szulik from comment #1)
> 
> The code responsible for updating pod with the new cert lives in:
> https://github.com/openshift/cluster-kube-controller-manager-operator/blob/
> master/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L330-
> L335
> 
> I've manually performed the test but an automatic one which would be
> verifying metrics endpoint is needed.

The code you've pointed to ensures the pod knows what files to read, but where is the code that ensures the key material is reloaded (or the pod exits) when those paths change? I can't see anything in the operator code that would ensure refresh/restart. If you've been able to test that refresh/restart is occurring that would indicate that code in the kube controller manager is performing this function, and I would be grateful if you could point me to where that is happening. The codepath is gnarly enough that I decided to give up and rely on someone with expertise with the codebase.

> Sally can you add an end-to-end test for kcm-o which will be checking PDB
> alert that was added in
> https://github.com/openshift/cluster-kube-controller-manager-operator/pull/
> 301. 
> This will:
> 1. Verify the alert is there, which is good and we want that.
> 2. Verify the metrics are properly served by the kcm.
> 3. Verify the metrics are served even when the cert is rotated. Marun will
> be working on a separate test suite that forces rotation and we need a test
> proving it's working as expected.
> 
> The test should be as follows:
> 1. create a PDB per instructions from
> https://bugzilla.redhat.com/show_bug.cgi?id=1762888#c0
> 2. wait for a bit and verify that alert from the aforementioned PR is firing.
> 
> If in doubt check Mike's latest e2e here:
> https://github.com/openshift/cluster-kube-controller-manager-operator/pull/
> 311

Just to clarify, the requirement for an e2e that can be used to ensure rotation compatibility is a test that retrieves the current service CA bundle (rough example [1]) and uses that bundle to configure a client that connects to the healthz/metrics endpoint (rough example [2]). There doesn't need to be any data exchanged or validated, just that a trusted connection to the endpoint can be established which indicates that the serving cert was refreshed post-rotation.

Would it make sense for me to refactor the examples into reusable code in library-go that could be used by any operator to simplify verification that an endpoint they controlled was using post-rotation key material?

1: https://github.com/openshift/service-ca-operator/blob/master/test/e2e/e2e_test.go#L690 
2: https://github.com/openshift/service-ca-operator/blob/master/test/util/rotate.go#L123

Comment 4 Maru Newby 2019-11-14 21:32:12 UTC
Whoops, those schedule lines were included by mistake.

Comment 5 Maciej Szulik 2019-11-15 12:41:03 UTC
(In reply to Maru Newby from comment #2)
> Would it make sense for me to refactor the examples into reusable code in
> library-go that could be used by any operator to simplify verification that
> an endpoint they controlled was using post-rotation key material?
> 
> 1:
> https://github.com/openshift/service-ca-operator/blob/master/test/e2e/
> e2e_test.go#L690 
> 2:
> https://github.com/openshift/service-ca-operator/blob/master/test/util/
> rotate.go#L123

I don't see any objections why not. The more tests the better :)

Comment 6 Maciej Szulik 2019-12-02 15:59:21 UTC
We'll deliver the test in 4.4, at this point in time we are certain the logic is there. Moving this to 4.4.

Comment 7 Maciej Szulik 2019-12-02 15:59:41 UTC
*** Bug 1776409 has been marked as a duplicate of this bug. ***

Comment 8 Maciej Szulik 2020-02-04 13:39:19 UTC
PR merged, moving to modified.

Comment 10 zhou ying 2020-02-06 05:41:39 UTC
The related PR is only add e2e test , no function update , so will verify bug directly.

Comment 11 zhou ying 2020-02-10 06:16:19 UTC
Hi: 
   Follow Maciej Szulik's command , I checked the metrics/healthz endpoints, I do met error when  cert rotation. 
[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-02-09-220310   True        False         4h27m   Cluster version is 4.4.0-0.nightly-2020-02-09-220310


Steps:
1) Short the cert rotate to 15 mins;
2) Use shell loop to check the metrics/healthz endpoints, I do met Connection refused during the cert rotation. 

So will re-assign this issue , correct me if I was wrong.

Comment 12 Maru Newby 2020-02-10 06:28:08 UTC
(In reply to zhou ying from comment #11)
> Hi: 
>    Follow Maciej Szulik's command , I checked the metrics/healthz endpoints,
> I do met error when  cert rotation. 
> [root@dhcp-140-138 ~]# oc get clusterversion
> NAME      VERSION                             AVAILABLE   PROGRESSING  
> SINCE   STATUS
> version   4.4.0-0.nightly-2020-02-09-220310   True        False        
> 4h27m   Cluster version is 4.4.0-0.nightly-2020-02-09-220310
> 
> 
> Steps:
> 1) Short the cert rotate to 15 mins;
> 2) Use shell loop to check the metrics/healthz endpoints, I do met
> Connection refused during the cert rotation. 
> 
> So will re-assign this issue , correct me if I was wrong.

I believe that connection refused after certificate rotation is expected in the case of the metrics endpoint since the operator framework that configures the endpoint detects a change in the certificate and restarts the pod. Only if the outage was prolonged would there be a problem.

A more reasonable test for rotation compatibility might be:

- retrieve the serving certificate from the secret
- delete the secret
- wait for the secret to be recreated with a new certificate
- wait for the metrics/heathz endpoint to start using the new certificate

The point isn't zero downtime of the endpoint, it's that it starts using the new certificate shortly after it replaces the old one in the secret.

Comment 13 Maciej Szulik 2020-02-10 15:16:05 UTC
Basically what Marun has said, moving back to qa.

Comment 15 zhou ying 2020-02-11 08:44:05 UTC
(In reply to Maru Newby from comment #12)
> (In reply to zhou ying from comment #11)
> > Hi: 
> >    Follow Maciej Szulik's command , I checked the metrics/healthz endpoints,
> > I do met error when  cert rotation. 
> > [root@dhcp-140-138 ~]# oc get clusterversion
> > NAME      VERSION                             AVAILABLE   PROGRESSING  
> > SINCE   STATUS
> > version   4.4.0-0.nightly-2020-02-09-220310   True        False        
> > 4h27m   Cluster version is 4.4.0-0.nightly-2020-02-09-220310
> > 
> > 
> > Steps:
> > 1) Short the cert rotate to 15 mins;
> > 2) Use shell loop to check the metrics/healthz endpoints, I do met
> > Connection refused during the cert rotation. 
> > 
> > So will re-assign this issue , correct me if I was wrong.
> 
> I believe that connection refused after certificate rotation is expected in
> the case of the metrics endpoint since the operator framework that
> configures the endpoint detects a change in the certificate and restarts the
> pod. Only if the outage was prolonged would there be a problem.
> 
> A more reasonable test for rotation compatibility might be:
> 
> - retrieve the serving certificate from the secret
> - delete the secret
> - wait for the secret to be recreated with a new certificate
> - wait for the metrics/heathz endpoint to start using the new certificate
> 
> The point isn't zero downtime of the endpoint, it's that it starts using the
> new certificate shortly after it replaces the old one in the secret.


Checked with payload: 4.4.0-0.nightly-2020-02-10-215022, no issue found.

Comment 17 errata-xmlrpc 2020-05-13 21:52:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.