Bug 1767523 - kubelets with expired certificates should cause an alert to be fired
Summary: kubelets with expired certificates should cause an alert to be fired
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.4.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-31 16:06 UTC by Steve Kuznetsov
Modified: 2020-05-04 11:15 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
adds a server_expiration_renew_errors metric for expired certificates.
Clone Of:
Environment:
Last Closed: 2020-05-04 11:14:43 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift origin pull 24513 None closed Bug 1767523: UPSTREAM: 84614: kubelet: add certificate rotation error metric 2020-11-23 06:49:14 UTC
Red Hat Product Errata RHBA-2020:0581 None None None 2020-05-04 11:15:11 UTC

Description Steve Kuznetsov 2019-10-31 16:06:09 UTC
Description of problem:

When the kubelet fails to get its certificate renewed, it cannot do any work or make any forward progress. This should cause an alert to be fired as this is anomalous

Oct 23 19:18:57 origin-ci-ig-m-428p origin-node[4967]: I1023 19:18:57.337406    4967 certificate_manager.go:287] Rotating certificates
Oct 23 19:21:50 origin-ci-ig-m-428p origin-node[4967]: E1023 19:21:50.485640    4967 certificate_manager.go:326] Certificate request was not signed: timed out waiting for the condition
Oct 23 19:23:05 origin-ci-ig-m-428p origin-node[4967]: E1023 19:23:05.337508    4967 reflector.go:253] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to watch *v1.Pod: the server has asked for the client to provide credentials (get pods)
Oct 23 19:23:08 origin-ci-ig-m-428p origin-node[4967]: F1023 19:23:08.425371    4967 transport.go:106] The currently active client certificate has expired and the server is responsive, exiting.
Oct 23 19:23:08 origin-ci-ig-m-428p systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Oct 23 19:23:08 origin-ci-ig-m-428p systemd[1]: Unit origin-node.service entered failed state.
Oct 23 19:23:08 origin-ci-ig-m-428p systemd[1]: origin-node.service failed.

Comment 1 Frederic Branczyk 2019-10-31 16:09:25 UTC
Reassigning to node team as each component owns its own monitoring. I agree that there should be metrics and alerts around this.

Comment 2 Ryan Phillips 2019-10-31 19:17:47 UTC
Potential upstream PR: https://github.com/kubernetes/kubernetes/pull/84614

Comment 7 errata-xmlrpc 2020-05-04 11:14:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.