Bug 1829923 - test: etcd etcd-quorum-guard leaderelection lost
Summary: test: etcd etcd-quorum-guard leaderelection lost
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.4
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.6.0
Assignee: Dan Mace
QA Contact: ge liu
Depends On: 1847672
Blocks: 1845636
TreeView+ depends on / blocked
Reported: 2020-04-30 15:04 UTC by Brett Tofel
Modified: 2020-10-27 15:58 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
test: Overall
Last Closed: 2020-10-27 15:58:32 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 383 0 None closed Bug 1847672: Expand supported set of probe field mutations 2021-01-06 10:04:54 UTC
Github openshift machine-config-operator pull 1797 0 None closed Bug 1829923: Fix quorum-guard timeouts 2021-01-06 10:04:17 UTC
Github openshift machine-config-operator pull 1818 0 None closed Bug 1829923: etcd-quorum-guard: update containers section in deployment.yaml 2021-01-06 10:04:17 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:58:56 UTC

Description Brett Tofel 2020-04-30 15:04:34 UTC
"ns/openshift-machine-config-operator pod/etcd-quorum-guard-565f8d849c-klgqh Readiness probe failed: command timed out (19 times)"

"dial tcp [::1]:6443: connect: connection refused\nI0430 12:50:38.672317       1 leaderelection.go:288] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition\nF0430 12:50:38.672352       1 server.go:257] leaderelection lost\n"


Reliably failing in "job.initialize" in redhat-openshift-ocp-release-4.4-informing suite.

Comment 1 Ben Parees 2020-04-30 21:33:42 UTC
search results:

lots of leader election lost being seen in test failures again.

happening in e2e-aws-upgrade which is a generic job used to test various version to version upgrades, so this could be due to issues in basically any ocp version.

Comment 2 Alay Patel 2020-05-01 20:52:37 UTC
There is only a single event of failed "Readiness probe" reported here: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915/artifacts/e2e-aws/events.json

            "apiVersion": "v1",
            "count": 20,
            "eventTime": null,
            "firstTimestamp": "2020-04-30T12:54:03Z",
            "involvedObject": {
                "apiVersion": "v1",
                "fieldPath": "spec.containers{guard}",
                "kind": "Pod",
                "name": "etcd-quorum-guard-565f8d849c-klgqh",
                "namespace": "openshift-machine-config-operator",
                "resourceVersion": "8104",
                "uid": "5167f230-7e34-4fb5-a1d5-b2e930b22f16"
            "kind": "Event",
            "lastTimestamp": "2020-04-30T13:07:33Z",
            "message": "Readiness probe failed: command timed out",
            "metadata": {
                "creationTimestamp": "2020-04-30T12:54:03Z",
                "name": "etcd-quorum-guard-565f8d849c-klgqh.160a9a173b11309f",
                "namespace": "openshift-machine-config-operator",
                "resourceVersion": "146976",
                "selfLink": "/api/v1/namespaces/openshift-machine-config-operator/events/etcd-quorum-guard-565f8d849c-klgqh.160a9a173b11309f",
                "uid": "05785c65-fc68-4589-93f1-113481e12d68"
            "reason": "Unhealthy",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "kubelet",
                "host": "ip-10-0-145-13.us-east-2.compute.internal"
            "type": "Warning"


The corresponding etcd server entry for the failure does not seem to be fatal: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915/artifacts/e2e-aws/pods/openshift-etcd_etcd-ip-10-0-145-13.us-east-2.compute.internal_etcd.log

2020-04-30 13:07:14.709133 W | etcdserver: read-only range request "key:\"/kubernetes.io/persistentvolumeclaims/e2e-volumemode-9222/pvc-5n2hj\" " with result "range_response_count:1 size:468" took too long (239.823142ms) to execute
2020-04-30 13:07:14.709260 W | etcdserver: read-only range request "key:\"/kubernetes.io/clusterroles/self-access-reviewer\" " with result "range_response_count:1 size:389" took too long (157.091852ms) to execute
2020-04-30 13:07:18.867912 I | embed: rejected connection from "" (error "EOF", ServerName "")
2020-04-30 13:07:24.407740 I | embed: rejected connection from "" (error "EOF", ServerName "")
2020-04-30 13:07:33.167135 I | embed: rejected connection from "" (error "EOF", ServerName "") <======= failed readiness probe spotted
2020-04-30 13:07:43.880351 I | embed: rejected connection from "" (error "EOF", ServerName "")
2020-04-30 13:07:51.364402 I | embed: rejected connection from "" (error "EOF", ServerName "")
2020-04-30 13:08:04.145769 W | etcdserver: request "header:<ID:7238535618199108688 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/events/e2e-provisioning-9081/pod-subpath-test-local-preprovisionedpv-5flf.160a9ad91ea34a1a\" mod_revision:149734 > success:<request_delete_range:<key:\"/kubernetes.io/events/e2e-provisioning-9081/pod-subpath-test-local-preprovisionedpv-5flf.160a9ad91ea34a1a\" > > failure:<>>" with result "size:20" took too long (104.925693ms) to execute
2020-04-30 13:08:04.148476 W | etcdserver: read-only range request "key:\"/kubernetes.io/events/e2e-volume-3403/aws-client.160a9ad3e51fa5fd\" " with result "range_response_count:1 size:524" took too long (114.702437ms) to execute


Looks like etcd is fairly stable, and the cause for failure of the test does not seem to be evident at etcd level.

Comment 3 Sam Batschelet 2020-05-18 15:06:40 UTC
> 2020-04-30 13:07:33.167135 I | embed: rejected connection from "" (error "EOF", ServerName "") <======= failed readiness probe spotted

this is concerning that we can't explain the failure if that indeed from quorum-guard as all members are up.

Comment 5 Dan Mace 2020-05-28 13:43:08 UTC
Because this isn't a 4.5 regression and we don't yet have any clear understanding of the impact, I'm moving the bug to 4.6 so we can continue investigation without blocking the 4.5 release. If somebody has a specific justification for blocking 4.5, please provide details.

Comment 7 Dan Mace 2020-06-05 19:42:28 UTC
Just picked a random recent PR run to start level setting again here:


$ jq '.items[] | select((.involvedObject.name // "" | startswith("etcd-quorum-guard")) and .source.component=="kubelet" and .reason=="Unhealthy") | { "name:": .involvedObject.name, "count": .count, "message": .message }' events.json
  "name:": "etcd-quorum-guard-5988dfbc54-fnhp7",
  "count": 1,
  "message": "Readiness probe failed: "
  "name:": "etcd-quorum-guard-5988dfbc54-qwvwp",
  "count": 1,
  "message": "Readiness probe failed: command timed out"
  "name:": "etcd-quorum-guard-5988dfbc54-qwvwp",
  "count": 2,
  "message": "Readiness probe failed: "
  "name:": "etcd-quorum-guard-5988dfbc54-vwvrp",
  "count": 3,
  "message": "Readiness probe failed: "

Haven't tried to do a wider analysis or dig in yet, just wanted to show that it's happening often enough that I was able to easily find it on my first try.

Comment 8 Dan Mace 2020-06-08 14:41:24 UTC
Good example from a recent periodic:


Simple answer is that quorum-guard pods are created a minute prior to the etcd pods, and fail readiness probes until etcd actually comes up.

Given the bootstrap etcd is functional (without quorum) at this time, and quorum is ultimately achieved after master etcd startup, I'm not sure the logs mean much.

Is quorum-guard intended to monitor only the status of the post-bootstrap etcd cluster? If so, maybe quorum-guard is starting too soon. Otherwise we should disregard its failures during bootstrapping, but then how do we easily distinguish those failures from real ones post bootstrapping?

Comment 11 Dan Mace 2020-06-11 18:23:46 UTC
The fix was bad. https://github.com/openshift/machine-config-operator/pull/1818 repairs it.

Comment 12 Kirsten Garrison 2020-06-12 02:17:32 UTC
This is how I verified my PR in CI: https://github.com/openshift/machine-config-operator/pull/1818#issuecomment-643024469

Comment 14 Dan Mace 2020-06-15 14:34:08 UTC
Re-opening again — I think we solved it for new installations, but I don't think the changes are being applied during Y or Z upgrades. See the thread starting at:


Comment 15 Dan Mace 2020-06-16 19:01:16 UTC
Created blocking CVO bug https://bugzilla.redhat.com/show_bug.cgi?id=1847672 to track the upgrade issue.

Comment 18 errata-xmlrpc 2020-10-27 15:58:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.