A serving cert supplied by the service ca operator appears to be used to secure the healthz/metrics endpoint of the kube controller manager. If the serving cert is regenerated (i.e. when the service CA is rotated), will the healthz/metrics endpoint be refreshed or will the controller manager be restarted? If not, the 'Refresh Strategies' section of the linked compatibility doc catalogs potential strategies for responding to changes in key material supplied by the service CA operator. Note that CA rotation can be manually triggered in any 4.x release by removing the signing secret. Automated rotation is likely to be introduced in a future z-stream release. References: Enhancement for automated service CA rotation: https://github.com/openshift/enhancements/blob/master/enhancements/automated-service-ca-rotation.md Operator compatibility with service ca rotation: https://docs.google.com/document/d/1NB2wUf9e8XScfVM6jFBl8VuLYG6-3uV63eUpqmYE8Ts/edit
KCM is correctly picking up the new certs. The flow to follow is: When starting the operator we pass a set of resources that are maintained using revision controller from library-go: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/pkg/operator/starter.go#L101 those secrets are defined in https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/pkg/operator/starter.go#L168-L188 one of them is serving-cert which is managed by the service-serving-cert-signer controller. The code responsible for updating pod with the new cert lives in: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L330-L335 I've manually performed the test but an automatic one which would be verifying metrics endpoint is needed. Sally can you add an end-to-end test for kcm-o which will be checking PDB alert that was added in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/301. This will: 1. Verify the alert is there, which is good and we want that. 2. Verify the metrics are properly served by the kcm. 3. Verify the metrics are served even when the cert is rotated. Marun will be working on a separate test suite that forces rotation and we need a test proving it's working as expected. The test should be as follows: 1. create a PDB per instructions from https://bugzilla.redhat.com/show_bug.cgi?id=1762888#c0 2. wait for a bit and verify that alert from the aforementioned PR is firing. If in doubt check Mike's latest e2e here: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/311
(In reply to Maciej Szulik from comment #1) > > The code responsible for updating pod with the new cert lives in: > https://github.com/openshift/cluster-kube-controller-manager-operator/blob/ > master/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L330- > L335 > > I've manually performed the test but an automatic one which would be > verifying metrics endpoint is needed. The code you've pointed to ensures the pod knows what files to read, but where is the code that ensures the key material is reloaded (or the pod exits) when those paths change? I can't see anything in the operator code that would ensure refresh/restart. If you've been able to test that refresh/restart is occurring that would indicate that code in the kube controller manager is performing this function, and I would be grateful if you could point me to where that is happening. The codepath is gnarly enough that I decided to give up and rely on someone with expertise with the codebase. > Sally can you add an end-to-end test for kcm-o which will be checking PDB > alert that was added in > https://github.com/openshift/cluster-kube-controller-manager-operator/pull/ > 301. > This will: > 1. Verify the alert is there, which is good and we want that. > 2. Verify the metrics are properly served by the kcm. > 3. Verify the metrics are served even when the cert is rotated. Marun will > be working on a separate test suite that forces rotation and we need a test > proving it's working as expected. > > The test should be as follows: > 1. create a PDB per instructions from > https://bugzilla.redhat.com/show_bug.cgi?id=1762888#c0 > 2. wait for a bit and verify that alert from the aforementioned PR is firing. > > If in doubt check Mike's latest e2e here: > https://github.com/openshift/cluster-kube-controller-manager-operator/pull/ > 311 Just to clarify, the requirement for an e2e that can be used to ensure rotation compatibility is a test that retrieves the current service CA bundle (rough example [1]) and uses that bundle to configure a client that connects to the healthz/metrics endpoint (rough example [2]). There doesn't need to be any data exchanged or validated, just that a trusted connection to the endpoint can be established which indicates that the serving cert was refreshed post-rotation. Would it make sense for me to refactor the examples into reusable code in library-go that could be used by any operator to simplify verification that an endpoint they controlled was using post-rotation key material? 1: https://github.com/openshift/service-ca-operator/blob/master/test/e2e/e2e_test.go#L690 2: https://github.com/openshift/service-ca-operator/blob/master/test/util/rotate.go#L123
With a bit more digging I've traced the path for cert load, and it is indeed dynamic. My apologies, I was previously was tracing the wrong path. https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/serving.go#L229 https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/serving_with_loopback.go#L44 https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-scheduler/app/options/options.go#L188 https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-scheduler/app/server.go#L138
Whoops, those schedule lines were included by mistake.
(In reply to Maru Newby from comment #2) > Would it make sense for me to refactor the examples into reusable code in > library-go that could be used by any operator to simplify verification that > an endpoint they controlled was using post-rotation key material? > > 1: > https://github.com/openshift/service-ca-operator/blob/master/test/e2e/ > e2e_test.go#L690 > 2: > https://github.com/openshift/service-ca-operator/blob/master/test/util/ > rotate.go#L123 I don't see any objections why not. The more tests the better :)
We'll deliver the test in 4.4, at this point in time we are certain the logic is there. Moving this to 4.4.
*** Bug 1776409 has been marked as a duplicate of this bug. ***
PR merged, moving to modified.
The related PR is only add e2e test , no function update , so will verify bug directly.
Hi: Follow Maciej Szulik's command , I checked the metrics/healthz endpoints, I do met error when cert rotation. [root@dhcp-140-138 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-02-09-220310 True False 4h27m Cluster version is 4.4.0-0.nightly-2020-02-09-220310 Steps: 1) Short the cert rotate to 15 mins; 2) Use shell loop to check the metrics/healthz endpoints, I do met Connection refused during the cert rotation. So will re-assign this issue , correct me if I was wrong.
(In reply to zhou ying from comment #11) > Hi: > Follow Maciej Szulik's command , I checked the metrics/healthz endpoints, > I do met error when cert rotation. > [root@dhcp-140-138 ~]# oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING > SINCE STATUS > version 4.4.0-0.nightly-2020-02-09-220310 True False > 4h27m Cluster version is 4.4.0-0.nightly-2020-02-09-220310 > > > Steps: > 1) Short the cert rotate to 15 mins; > 2) Use shell loop to check the metrics/healthz endpoints, I do met > Connection refused during the cert rotation. > > So will re-assign this issue , correct me if I was wrong. I believe that connection refused after certificate rotation is expected in the case of the metrics endpoint since the operator framework that configures the endpoint detects a change in the certificate and restarts the pod. Only if the outage was prolonged would there be a problem. A more reasonable test for rotation compatibility might be: - retrieve the serving certificate from the secret - delete the secret - wait for the secret to be recreated with a new certificate - wait for the metrics/heathz endpoint to start using the new certificate The point isn't zero downtime of the endpoint, it's that it starts using the new certificate shortly after it replaces the old one in the secret.
Basically what Marun has said, moving back to qa.
(In reply to Maru Newby from comment #12) > (In reply to zhou ying from comment #11) > > Hi: > > Follow Maciej Szulik's command , I checked the metrics/healthz endpoints, > > I do met error when cert rotation. > > [root@dhcp-140-138 ~]# oc get clusterversion > > NAME VERSION AVAILABLE PROGRESSING > > SINCE STATUS > > version 4.4.0-0.nightly-2020-02-09-220310 True False > > 4h27m Cluster version is 4.4.0-0.nightly-2020-02-09-220310 > > > > > > Steps: > > 1) Short the cert rotate to 15 mins; > > 2) Use shell loop to check the metrics/healthz endpoints, I do met > > Connection refused during the cert rotation. > > > > So will re-assign this issue , correct me if I was wrong. > > I believe that connection refused after certificate rotation is expected in > the case of the metrics endpoint since the operator framework that > configures the endpoint detects a change in the certificate and restarts the > pod. Only if the outage was prolonged would there be a problem. > > A more reasonable test for rotation compatibility might be: > > - retrieve the serving certificate from the secret > - delete the secret > - wait for the secret to be recreated with a new certificate > - wait for the metrics/heathz endpoint to start using the new certificate > > The point isn't zero downtime of the endpoint, it's that it starts using the > new certificate shortly after it replaces the old one in the secret. Checked with payload: 4.4.0-0.nightly-2020-02-10-215022, no issue found.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581