Bug 1767523

Summary: kubelets with expired certificates should cause an alert to be fired
Product: OpenShift Container Platform Reporter: Steve Kuznetsov <skuznets>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED ERRATA QA Contact: Sunil Choudhary <schoudha>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: alegrand, anpicker, aos-bugs, erooth, jokerman, kakkoyun, lcosic, mloibl, pkrupa, rphillips, surbania
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Release Note
Doc Text:
adds a server_expiration_renew_errors metric for expired certificates.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-04 11:14:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steve Kuznetsov 2019-10-31 16:06:09 UTC
Description of problem:

When the kubelet fails to get its certificate renewed, it cannot do any work or make any forward progress. This should cause an alert to be fired as this is anomalous

Oct 23 19:18:57 origin-ci-ig-m-428p origin-node[4967]: I1023 19:18:57.337406    4967 certificate_manager.go:287] Rotating certificates
Oct 23 19:21:50 origin-ci-ig-m-428p origin-node[4967]: E1023 19:21:50.485640    4967 certificate_manager.go:326] Certificate request was not signed: timed out waiting for the condition
Oct 23 19:23:05 origin-ci-ig-m-428p origin-node[4967]: E1023 19:23:05.337508    4967 reflector.go:253] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to watch *v1.Pod: the server has asked for the client to provide credentials (get pods)
Oct 23 19:23:08 origin-ci-ig-m-428p origin-node[4967]: F1023 19:23:08.425371    4967 transport.go:106] The currently active client certificate has expired and the server is responsive, exiting.
Oct 23 19:23:08 origin-ci-ig-m-428p systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Oct 23 19:23:08 origin-ci-ig-m-428p systemd[1]: Unit origin-node.service entered failed state.
Oct 23 19:23:08 origin-ci-ig-m-428p systemd[1]: origin-node.service failed.

Comment 1 Frederic Branczyk 2019-10-31 16:09:25 UTC
Reassigning to node team as each component owns its own monitoring. I agree that there should be metrics and alerts around this.

Comment 2 Ryan Phillips 2019-10-31 19:17:47 UTC
Potential upstream PR: https://github.com/kubernetes/kubernetes/pull/84614

Comment 7 errata-xmlrpc 2020-05-04 11:14:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581