1829923 – test: etcd etcd-quorum-guard leaderelection lost

Bug 1829923 - test: etcd etcd-quorum-guard leaderelection lost

Summary: test: etcd etcd-quorum-guard leaderelection lost

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Dan Mace
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:	1847672
Blocks:	1845636
TreeView+	depends on / blocked

Reported:	2020-04-30 15:04 UTC by Brett Tofel
Modified:	2020-10-27 15:58 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	test: Overall
Last Closed:	2020-10-27 15:58:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 383	None	closed	Bug 1847672: Expand supported set of probe field mutations	2021-01-06 10:04:54 UTC
Github	openshift machine-config-operator pull 1797	None	closed	Bug 1829923: Fix quorum-guard timeouts	2021-01-06 10:04:17 UTC
Github	openshift machine-config-operator pull 1818	None	closed	Bug 1829923: etcd-quorum-guard: update containers section in deployment.yaml	2021-01-06 10:04:17 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 15:58:56 UTC

Description Brett Tofel 2020-04-30 15:04:34 UTC

"ns/openshift-machine-config-operator pod/etcd-quorum-guard-565f8d849c-klgqh Readiness probe failed: command timed out (19 times)"

"dial tcp [::1]:6443: connect: connection refused\nI0430 12:50:38.672317       1 leaderelection.go:288] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition\nF0430 12:50:38.672352       1 server.go:257] leaderelection lost\n"


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915#1:build-log.txt%3A12804

Reliably failing in "job.initialize" in redhat-openshift-ocp-release-4.4-informing suite.

Comment 1 Ben Parees 2020-04-30 21:33:42 UTC

search results:
https://search.apps.build01.ci.devcluster.openshift.com/?search=leaderelection+lost&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

lots of leader election lost being seen in test failures again.

happening in e2e-aws-upgrade which is a generic job used to test various version to version upgrades, so this could be due to issues in basically any ocp version.

Comment 2 Alay Patel 2020-05-01 20:52:37 UTC

There is only a single event of failed "Readiness probe" reported here: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915/artifacts/e2e-aws/events.json

-----
{
            "apiVersion": "v1",
            "count": 20,
            "eventTime": null,
            "firstTimestamp": "2020-04-30T12:54:03Z",
            "involvedObject": {
                "apiVersion": "v1",
                "fieldPath": "spec.containers{guard}",
                "kind": "Pod",
                "name": "etcd-quorum-guard-565f8d849c-klgqh",
                "namespace": "openshift-machine-config-operator",
                "resourceVersion": "8104",
                "uid": "5167f230-7e34-4fb5-a1d5-b2e930b22f16"
            },
            "kind": "Event",
            "lastTimestamp": "2020-04-30T13:07:33Z",
            "message": "Readiness probe failed: command timed out",
            "metadata": {
                "creationTimestamp": "2020-04-30T12:54:03Z",
                "name": "etcd-quorum-guard-565f8d849c-klgqh.160a9a173b11309f",
                "namespace": "openshift-machine-config-operator",
                "resourceVersion": "146976",
                "selfLink": "/api/v1/namespaces/openshift-machine-config-operator/events/etcd-quorum-guard-565f8d849c-klgqh.160a9a173b11309f",
                "uid": "05785c65-fc68-4589-93f1-113481e12d68"
            },
            "reason": "Unhealthy",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "kubelet",
                "host": "ip-10-0-145-13.us-east-2.compute.internal"
            },
            "type": "Warning"
        },

----

The corresponding etcd server entry for the failure does not seem to be fatal: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915/artifacts/e2e-aws/pods/openshift-etcd_etcd-ip-10-0-145-13.us-east-2.compute.internal_etcd.log

----
2020-04-30 13:07:14.709133 W | etcdserver: read-only range request "key:\"/kubernetes.io/persistentvolumeclaims/e2e-volumemode-9222/pvc-5n2hj\" " with result "range_response_count:1 size:468" took too long (239.823142ms) to execute
2020-04-30 13:07:14.709260 W | etcdserver: read-only range request "key:\"/kubernetes.io/clusterroles/self-access-reviewer\" " with result "range_response_count:1 size:389" took too long (157.091852ms) to execute
2020-04-30 13:07:18.867912 I | embed: rejected connection from "10.130.0.8:40554" (error "EOF", ServerName "")
2020-04-30 13:07:24.407740 I | embed: rejected connection from "10.130.0.8:40732" (error "EOF", ServerName "")
2020-04-30 13:07:33.167135 I | embed: rejected connection from "127.0.0.1:57514" (error "EOF", ServerName "") <======= failed readiness probe spotted
2020-04-30 13:07:43.880351 I | embed: rejected connection from "10.130.0.8:41356" (error "EOF", ServerName "")
2020-04-30 13:07:51.364402 I | embed: rejected connection from "10.130.0.8:41572" (error "EOF", ServerName "")
2020-04-30 13:08:04.145769 W | etcdserver: request "header:<ID:7238535618199108688 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/events/e2e-provisioning-9081/pod-subpath-test-local-preprovisionedpv-5flf.160a9ad91ea34a1a\" mod_revision:149734 > success:<request_delete_range:<key:\"/kubernetes.io/events/e2e-provisioning-9081/pod-subpath-test-local-preprovisionedpv-5flf.160a9ad91ea34a1a\" > > failure:<>>" with result "size:20" took too long (104.925693ms) to execute
2020-04-30 13:08:04.148476 W | etcdserver: read-only range request "key:\"/kubernetes.io/events/e2e-volume-3403/aws-client.160a9ad3e51fa5fd\" " with result "range_response_count:1 size:524" took too long (114.702437ms) to execute

----

Looks like etcd is fairly stable, and the cause for failure of the test does not seem to be evident at etcd level.

Comment 3 Sam Batschelet 2020-05-18 15:06:40 UTC

> 2020-04-30 13:07:33.167135 I | embed: rejected connection from "127.0.0.1:57514" (error "EOF", ServerName "") <======= failed readiness probe spotted

this is concerning that we can't explain the failure if that indeed from quorum-guard as all members are up.

Comment 5 Dan Mace 2020-05-28 13:43:08 UTC

Because this isn't a 4.5 regression and we don't yet have any clear understanding of the impact, I'm moving the bug to 4.6 so we can continue investigation without blocking the 4.5 release. If somebody has a specific justification for blocking 4.5, please provide details.

Comment 7 Dan Mace 2020-06-05 19:42:28 UTC

Just picked a random recent PR run to start level setting again here:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/367/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/1635

$ jq '.items[] | select((.involvedObject.name // "" | startswith("etcd-quorum-guard")) and .source.component=="kubelet" and .reason=="Unhealthy") | { "name:": .involvedObject.name, "count": .count, "message": .message }' events.json
{
  "name:": "etcd-quorum-guard-5988dfbc54-fnhp7",
  "count": 1,
  "message": "Readiness probe failed: "
}
{
  "name:": "etcd-quorum-guard-5988dfbc54-qwvwp",
  "count": 1,
  "message": "Readiness probe failed: command timed out"
}
{
  "name:": "etcd-quorum-guard-5988dfbc54-qwvwp",
  "count": 2,
  "message": "Readiness probe failed: "
}
{
  "name:": "etcd-quorum-guard-5988dfbc54-vwvrp",
  "count": 3,
  "message": "Readiness probe failed: "
}

Haven't tried to do a wider analysis or dig in yet, just wanted to show that it's happening often enough that I was able to easily find it on my first try.

Comment 8 Dan Mace 2020-06-08 14:41:24 UTC

Good example from a recent periodic:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/2347

Simple answer is that quorum-guard pods are created a minute prior to the etcd pods, and fail readiness probes until etcd actually comes up.

Given the bootstrap etcd is functional (without quorum) at this time, and quorum is ultimately achieved after master etcd startup, I'm not sure the logs mean much.

Is quorum-guard intended to monitor only the status of the post-bootstrap etcd cluster? If so, maybe quorum-guard is starting too soon. Otherwise we should disregard its failures during bootstrapping, but then how do we easily distinguish those failures from real ones post bootstrapping?

Comment 11 Dan Mace 2020-06-11 18:23:46 UTC

The fix was bad. https://github.com/openshift/machine-config-operator/pull/1818 repairs it.

Comment 12 Kirsten Garrison 2020-06-12 02:17:32 UTC

This is how I verified my PR in CI: https://github.com/openshift/machine-config-operator/pull/1818#issuecomment-643024469

Comment 14 Dan Mace 2020-06-15 14:34:08 UTC

Re-opening again — I think we solved it for new installations, but I don't think the changes are being applied during Y or Z upgrades. See the thread starting at:

https://github.com/openshift/machine-config-operator/pull/1818#issuecomment-644121442

Comment 15 Dan Mace 2020-06-16 19:01:16 UTC

Created blocking CVO bug https://bugzilla.redhat.com/show_bug.cgi?id=1847672 to track the upgrade issue.

Comment 18 errata-xmlrpc 2020-10-27 15:58:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.