1767523 – kubelets with expired certificates should cause an alert to be fired

Bug 1767523 - kubelets with expired certificates should cause an alert to be fired

Summary: kubelets with expired certificates should cause an alert to be fired

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-31 16:06 UTC by Steve Kuznetsov
Modified:	2020-05-04 11:15 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Release Note
Doc Text:	adds a server_expiration_renew_errors metric for expired certificates.
Clone Of:
Environment:
Last Closed:	2020-05-04 11:14:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24513	0	None	closed	Bug 1767523: UPSTREAM: 84614: kubelet: add certificate rotation error metric	2021-02-10 02:53:18 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:15:11 UTC

Description Steve Kuznetsov 2019-10-31 16:06:09 UTC

Description of problem:

When the kubelet fails to get its certificate renewed, it cannot do any work or make any forward progress. This should cause an alert to be fired as this is anomalous

Oct 23 19:18:57 origin-ci-ig-m-428p origin-node[4967]: I1023 19:18:57.337406    4967 certificate_manager.go:287] Rotating certificates
Oct 23 19:21:50 origin-ci-ig-m-428p origin-node[4967]: E1023 19:21:50.485640    4967 certificate_manager.go:326] Certificate request was not signed: timed out waiting for the condition
Oct 23 19:23:05 origin-ci-ig-m-428p origin-node[4967]: E1023 19:23:05.337508    4967 reflector.go:253] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to watch *v1.Pod: the server has asked for the client to provide credentials (get pods)
Oct 23 19:23:08 origin-ci-ig-m-428p origin-node[4967]: F1023 19:23:08.425371    4967 transport.go:106] The currently active client certificate has expired and the server is responsive, exiting.
Oct 23 19:23:08 origin-ci-ig-m-428p systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Oct 23 19:23:08 origin-ci-ig-m-428p systemd[1]: Unit origin-node.service entered failed state.
Oct 23 19:23:08 origin-ci-ig-m-428p systemd[1]: origin-node.service failed.

Comment 1 Frederic Branczyk 2019-10-31 16:09:25 UTC

Reassigning to node team as each component owns its own monitoring. I agree that there should be metrics and alerts around this.

Comment 2 Ryan Phillips 2019-10-31 19:17:47 UTC

Potential upstream PR: https://github.com/kubernetes/kubernetes/pull/84614

Comment 7 errata-xmlrpc 2020-05-04 11:14:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.