1958405 – etcd: current health checks and reporting are not adequate to ensure availability

Bug 1958405 - etcd: current health checks and reporting are not adequate to ensure availability

Summary: etcd: current health checks and reporting are not adequate to ensure availabi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-07 20:24 UTC by Sam Batschelet
Modified:	2021-10-26 04:26 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:07:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 596	None	closed	Bug 1958405: Improve etcd service observability and health checks	2021-06-11 12:18:56 UTC
Github	openshift etcd pull 78	None	closed	Bug 1958405: UPSTREAM: <carry>: server: add support for log rotation (#12774)	2021-06-10 13:58:27 UTC
Github	openshift must-gather pull 235	None	closed	Bug 1958405: add etcd health logs to audit	2021-06-10 15:56:12 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 23:07:42 UTC

Description Sam Batschelet 2021-05-07 20:24:54 UTC

Description of problem: OpenShift currently has a few different methods of health checking etcd.

quorum-guard: As part of maintaining the PDB quorum-guard readinessProbe performs a curl GET to the etcd /health endpoint. If the health check fails the container does not report ready status and PDB reporting will reflect one unavailable. This runs every 5s and reporting of failures can be seen in the kubelet logs and also reflected in the etcd logs.

```
2021-05-07 18:18:48.401026 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-05-07 18:18:53.498054 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-05-07 18:18:58.414041 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-05-07 18:19:03.403105 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-05-07 18:19:08.407647 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-05-07 18:19:13.398697 I | etcdserver/api/etcdhttp: /health OK (status code 200)
```

cluster-etcd-operator health checks: These checks performed by the cluster member controller and will report the health status of etcd members. The controller is set to sync every minute. health check failures are evented and will result in degraded status if a member is unhealthy.

livenessProbe: etcd uses a liveness probe to perform health checks every 5s. Failures will affect Ready status of the etcd container.

With all of the above, there is no simple reportable what to ensure availability at any given point. For example, etcd-1 has been down for 25s.

Version-Release number of selected component (if applicable):

How reproducible: 100%

Steps to Reproduce:
1.
2.
3.

Actual results: multiple health checks exist but granularity is not at 1s level and reporting is not reasonably consumable.

Expected results: etcd should report service availability on a 1s granularity to the cluster. The results of these checks should be easily consumable by automation and included in the e2e-intervals report. This data is critical so that we can with certainty explain complex outages between the apiserver and etcd.

Additional info:

Comment 2 Sam Batschelet 2021-05-11 16:39:08 UTC

This fix will involve multiple PRs, thus this BZ will be a canonical location for updates. A general plan will be outlined shortly on how this will be addressed.

Comment 11 ge liu 2021-06-16 03:42:59 UTC

Verified as comments 8

Comment 13 errata-xmlrpc 2021-07-27 23:07:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.