Bug 1776409

Summary: The metrics/healthz endpoint of kube-controller-manager potentially broken by service CA rotation
Product: OpenShift Container Platform Reporter: Sally <somalley>
Component: kube-controller-managerAssignee: Sally <somalley>
Status: CLOSED DUPLICATE QA Contact: zhou ying <yinzhou>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: aos-bugs, maszulik, mfojtik, mnewby, yinzhou
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1771809 Environment:
Last Closed: 2019-12-02 15:59:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1771809    
Bug Blocks:    

Description Sally 2019-11-25 16:06:57 UTC
+++ This bug was initially created as a clone of Bug #1771809 +++

A serving cert supplied by the service ca operator appears to be used to secure the healthz/metrics endpoint of the kube controller manager. If the serving cert is regenerated (i.e. when the service CA is rotated), will the healthz/metrics endpoint be refreshed or will the controller manager be restarted?

If not, the 'Refresh Strategies' section of the linked compatibility doc catalogs potential strategies for responding to changes in key material supplied by the service CA operator.

Note that CA rotation can be manually triggered in any 4.x release by removing the signing secret. Automated rotation is likely to be introduced in a future z-stream release. 

References: 

Enhancement for automated service CA rotation: 

https://github.com/openshift/enhancements/blob/master/enhancements/automated-service-ca-rotation.md

Operator compatibility with service ca rotation:

https://docs.google.com/document/d/1NB2wUf9e8XScfVM6jFBl8VuLYG6-3uV63eUpqmYE8Ts/edit

--- Additional comment from Maciej Szulik on 2019-11-14 11:22:27 UTC ---

KCM is correctly picking up the new certs. The flow to follow is:

When starting the operator we pass a set of resources that are maintained using revision controller from library-go:
https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/pkg/operator/starter.go#L101
those secrets are defined in 
https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/pkg/operator/starter.go#L168-L188
one of them is serving-cert which is managed by the service-serving-cert-signer controller. 

The code responsible for updating pod with the new cert lives in:
https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L330-L335

I've manually performed the test but an automatic one which would be verifying metrics endpoint is needed.

Sally can you add an end-to-end test for kcm-o which will be checking PDB alert that was added in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/301. 
This will:
1. Verify the alert is there, which is good and we want that.
2. Verify the metrics are properly served by the kcm.
3. Verify the metrics are served even when the cert is rotated. Marun will be working on a separate test suite that forces rotation and we need a test proving it's working as expected.

The test should be as follows:
1. create a PDB per instructions from https://bugzilla.redhat.com/show_bug.cgi?id=1762888#c0
2. wait for a bit and verify that alert from the aforementioned PR is firing.

If in doubt check Mike's latest e2e here: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/311

--- Additional comment from Maru Newby on 2019-11-14 20:38:33 UTC ---

(In reply to Maciej Szulik from comment #1)
> 
> The code responsible for updating pod with the new cert lives in:
> https://github.com/openshift/cluster-kube-controller-manager-operator/blob/
> master/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L330-
> L335
> 
> I've manually performed the test but an automatic one which would be
> verifying metrics endpoint is needed.

The code you've pointed to ensures the pod knows what files to read, but where is the code that ensures the key material is reloaded (or the pod exits) when those paths change? I can't see anything in the operator code that would ensure refresh/restart. If you've been able to test that refresh/restart is occurring that would indicate that code in the kube controller manager is performing this function, and I would be grateful if you could point me to where that is happening. The codepath is gnarly enough that I decided to give up and rely on someone with expertise with the codebase.

> Sally can you add an end-to-end test for kcm-o which will be checking PDB
> alert that was added in
> https://github.com/openshift/cluster-kube-controller-manager-operator/pull/
> 301. 
> This will:
> 1. Verify the alert is there, which is good and we want that.
> 2. Verify the metrics are properly served by the kcm.
> 3. Verify the metrics are served even when the cert is rotated. Marun will
> be working on a separate test suite that forces rotation and we need a test
> proving it's working as expected.
> 
> The test should be as follows:
> 1. create a PDB per instructions from
> https://bugzilla.redhat.com/show_bug.cgi?id=1762888#c0
> 2. wait for a bit and verify that alert from the aforementioned PR is firing.
> 
> If in doubt check Mike's latest e2e here:
> https://github.com/openshift/cluster-kube-controller-manager-operator/pull/
> 311

Just to clarify, the requirement for an e2e that can be used to ensure rotation compatibility is a test that retrieves the current service CA bundle (rough example [1]) and uses that bundle to configure a client that connects to the healthz/metrics endpoint (rough example [2]). There doesn't need to be any data exchanged or validated, just that a trusted connection to the endpoint can be established which indicates that the serving cert was refreshed post-rotation.

Would it make sense for me to refactor the examples into reusable code in library-go that could be used by any operator to simplify verification that an endpoint they controlled was using post-rotation key material?

1: https://github.com/openshift/service-ca-operator/blob/master/test/e2e/e2e_test.go#L690 
2: https://github.com/openshift/service-ca-operator/blob/master/test/util/rotate.go#L123

--- Additional comment from Maru Newby on 2019-11-14 21:29:33 UTC ---

With a bit more digging I've traced the path for cert load, and it is indeed dynamic. My apologies, I was previously was tracing the wrong path.

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/serving.go#L229
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/serving_with_loopback.go#L44
https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-scheduler/app/options/options.go#L188
https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-scheduler/app/server.go#L138

--- Additional comment from Maru Newby on 2019-11-14 21:32:12 UTC ---

Whoops, those schedule lines were included by mistake.

--- Additional comment from Maciej Szulik on 2019-11-15 12:41:03 UTC ---

(In reply to Maru Newby from comment #2)
> Would it make sense for me to refactor the examples into reusable code in
> library-go that could be used by any operator to simplify verification that
> an endpoint they controlled was using post-rotation key material?
> 
> 1:
> https://github.com/openshift/service-ca-operator/blob/master/test/e2e/
> e2e_test.go#L690 
> 2:
> https://github.com/openshift/service-ca-operator/blob/master/test/util/
> rotate.go#L123

I don't see any objections why not. The more tests the better :)

Comment 1 Maciej Szulik 2019-12-02 15:59:41 UTC

*** This bug has been marked as a duplicate of bug 1771809 ***