1771810 – The metrics/healthz endpoint of kube-scheduler may be broken by service CA rotation

Bug 1771810 - The metrics/healthz endpoint of kube-scheduler may be broken by service CA rotation

Summary: The metrics/healthz endpoint of kube-scheduler may be broken by service CA ro...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Sally
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1777069
TreeView+	depends on / blocked

Reported:	2019-11-13 03:08 UTC by Maru Newby
Modified:	2020-05-13 21:52 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1777069 (view as bug list)
Environment:
Last Closed:	2020-05-13 21:52:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Before pod Scheduling (60.18 KB, application/pdf) 2020-02-03 16:22 UTC, RamaKasturi	no flags	Details
After pod scheduling (60.17 KB, application/pdf) 2020-02-03 16:23 UTC, RamaKasturi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-scheduler-operator pull 191	0	None	closed	Bug 1771810: Add e2e to verify kso metrics accessible	2021-02-04 06:42:22 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-13 21:52:42 UTC

Description Maru Newby 2019-11-13 03:08:45 UTC

A serving cert supplied by the service ca operator appears to be used to secure the healthz/metrics endpoint of the kube scheduler. If the serving cert is regenerated (i.e. when the service CA is rotated), it does not appear that the healthz/metrics endpoint will be refreshed or the scheduler restarted. This could result in a broken healthz/metrics endpoint.

The 'Refresh Strategies' section of the linked compatibility doc catalogs potential strategies for responding to changes in key material supplied by the service CA operator.

Note that CA rotation can be manually triggered in any 4.x release by removing the signing secret. Automated rotation is likely to be introduced in a future z-stream release. 

References: 

Enhancement for automated service CA rotation: 

https://github.com/openshift/enhancements/blob/master/enhancements/automated-service-ca-rotation.md

Operator compatibility with service ca rotation:

https://docs.google.com/document/d/1NB2wUf9e8XScfVM6jFBl8VuLYG6-3uV63eUpqmYE8Ts/edit

Comment 1 Maciej Szulik 2019-11-14 11:33:53 UTC

KS is correctly picking up the new certs. The flow to follow is:

When starting the operator we pass a set of resources that are maintained using revision controller from library-go:
https://github.com/openshift/cluster-kube-scheduler-operator/blob/master/pkg/operator/starter.go#L106
those secrets are defined in 
https://github.com/openshift/cluster-kube-scheduler-operator/blob/master/pkg/operator/starter.go#L157-L170
one of them is serving-cert which is managed by the service-serving-cert-signer controller. 

The code responsible for updating pod with the new cert lives in:
https://github.com/openshift/cluster-kube-scheduler-operator/blob/master/pkg/operator/target_config_reconciler_v410_00.go#L126-L131

I've manually performed the test but an automatic one which would be verifying metrics endpoint is needed.

Sally can you add an end-to-end test for ks-o which will be checking one of the scheduler metrics, scheduler_scheduling_duration_seconds_sum for example.
This will:
1. Verify the metrics are properly served by the ks.
2. Verify the metrics are served even when the cert is rotated. Marun will be working on a separate test suite that forces rotation and we need a test proving it's working as expected.

The test should be as follows:
1. check the current values of scheduler_scheduling_duration_seconds_sum (or other of your choosing)
2. schedule some test application, a pod, a simple deployment
3. check values of scheduler_scheduling_duration_seconds_sum (the same as in 1) and compare - they should differ.

You may want to sync with Mike about which metric to pick other than scheduler_scheduling_duration_seconds_sum.

If in doubt check Mike's latest e2e here: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/311

Comment 2 Maru Newby 2019-11-14 21:31:42 UTC

My apologies, I was previously tracing the wrong path. SecureServingWithLoopback ensures the use of NewDynamicServingContentFromFiles.

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/serving.go#L229
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/serving_with_loopback.go#L44
https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-scheduler/app/options/options.go#L188
https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-scheduler/app/server.go#L138

Comment 3 Maciej Szulik 2019-12-02 17:54:19 UTC

This will be fixed in 4.4, moving accordingly.

Comment 6 RamaKasturi 2020-02-03 16:22:40 UTC

Created attachment 1657375 [details]
Before pod Scheduling

Comment 7 RamaKasturi 2020-02-03 16:23:47 UTC

Created attachment 1657376 [details]
After pod scheduling

Comment 9 errata-xmlrpc 2020-05-13 21:52:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.