Bug 1771810
Summary: | The metrics/healthz endpoint of kube-scheduler may be broken by service CA rotation | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Maru Newby <mnewby> | ||||||
Component: | kube-scheduler | Assignee: | Sally <somalley> | ||||||
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.4 | CC: | aos-bugs, maszulik, mfojtik, yinzhou | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.4.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1777069 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-05-13 21:52:40 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1777069 | ||||||||
Attachments: |
|
Description
Maru Newby
2019-11-13 03:08:45 UTC
KS is correctly picking up the new certs. The flow to follow is: When starting the operator we pass a set of resources that are maintained using revision controller from library-go: https://github.com/openshift/cluster-kube-scheduler-operator/blob/master/pkg/operator/starter.go#L106 those secrets are defined in https://github.com/openshift/cluster-kube-scheduler-operator/blob/master/pkg/operator/starter.go#L157-L170 one of them is serving-cert which is managed by the service-serving-cert-signer controller. The code responsible for updating pod with the new cert lives in: https://github.com/openshift/cluster-kube-scheduler-operator/blob/master/pkg/operator/target_config_reconciler_v410_00.go#L126-L131 I've manually performed the test but an automatic one which would be verifying metrics endpoint is needed. Sally can you add an end-to-end test for ks-o which will be checking one of the scheduler metrics, scheduler_scheduling_duration_seconds_sum for example. This will: 1. Verify the metrics are properly served by the ks. 2. Verify the metrics are served even when the cert is rotated. Marun will be working on a separate test suite that forces rotation and we need a test proving it's working as expected. The test should be as follows: 1. check the current values of scheduler_scheduling_duration_seconds_sum (or other of your choosing) 2. schedule some test application, a pod, a simple deployment 3. check values of scheduler_scheduling_duration_seconds_sum (the same as in 1) and compare - they should differ. You may want to sync with Mike about which metric to pick other than scheduler_scheduling_duration_seconds_sum. If in doubt check Mike's latest e2e here: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/311 My apologies, I was previously tracing the wrong path. SecureServingWithLoopback ensures the use of NewDynamicServingContentFromFiles. https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/serving.go#L229 https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/serving_with_loopback.go#L44 https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-scheduler/app/options/options.go#L188 https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-scheduler/app/server.go#L138 This will be fixed in 4.4, moving accordingly. Created attachment 1657375 [details]
Before pod Scheduling
Created attachment 1657376 [details]
After pod scheduling
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |