Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1829923

Summary: test: etcd etcd-quorum-guard leaderelection lost
Product: OpenShift Container Platform Reporter: Brett Tofel <btofel>
Component: EtcdAssignee: Dan Mace <dmace>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: high    
Version: 4.4CC: alpatel, dmace, kgarriso, sbatsche, skolicha, wking
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
test: Overall
Last Closed: 2020-10-27 15:58:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1847672    
Bug Blocks: 1845636    

Description Brett Tofel 2020-04-30 15:04:34 UTC
"ns/openshift-machine-config-operator pod/etcd-quorum-guard-565f8d849c-klgqh Readiness probe failed: command timed out (19 times)"

"dial tcp [::1]:6443: connect: connection refused\nI0430 12:50:38.672317       1 leaderelection.go:288] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition\nF0430 12:50:38.672352       1 server.go:257] leaderelection lost\n"


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915#1:build-log.txt%3A12804

Reliably failing in "job.initialize" in redhat-openshift-ocp-release-4.4-informing suite.

Comment 1 Ben Parees 2020-04-30 21:33:42 UTC
search results:
https://search.apps.build01.ci.devcluster.openshift.com/?search=leaderelection+lost&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

lots of leader election lost being seen in test failures again.

happening in e2e-aws-upgrade which is a generic job used to test various version to version upgrades, so this could be due to issues in basically any ocp version.

Comment 2 Alay Patel 2020-05-01 20:52:37 UTC
There is only a single event of failed "Readiness probe" reported here: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915/artifacts/e2e-aws/events.json

-----
{
            "apiVersion": "v1",
            "count": 20,
            "eventTime": null,
            "firstTimestamp": "2020-04-30T12:54:03Z",
            "involvedObject": {
                "apiVersion": "v1",
                "fieldPath": "spec.containers{guard}",
                "kind": "Pod",
                "name": "etcd-quorum-guard-565f8d849c-klgqh",
                "namespace": "openshift-machine-config-operator",
                "resourceVersion": "8104",
                "uid": "5167f230-7e34-4fb5-a1d5-b2e930b22f16"
            },
            "kind": "Event",
            "lastTimestamp": "2020-04-30T13:07:33Z",
            "message": "Readiness probe failed: command timed out",
            "metadata": {
                "creationTimestamp": "2020-04-30T12:54:03Z",
                "name": "etcd-quorum-guard-565f8d849c-klgqh.160a9a173b11309f",
                "namespace": "openshift-machine-config-operator",
                "resourceVersion": "146976",
                "selfLink": "/api/v1/namespaces/openshift-machine-config-operator/events/etcd-quorum-guard-565f8d849c-klgqh.160a9a173b11309f",
                "uid": "05785c65-fc68-4589-93f1-113481e12d68"
            },
            "reason": "Unhealthy",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "kubelet",
                "host": "ip-10-0-145-13.us-east-2.compute.internal"
            },
            "type": "Warning"
        },

----

The corresponding etcd server entry for the failure does not seem to be fatal: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915/artifacts/e2e-aws/pods/openshift-etcd_etcd-ip-10-0-145-13.us-east-2.compute.internal_etcd.log

----
2020-04-30 13:07:14.709133 W | etcdserver: read-only range request "key:\"/kubernetes.io/persistentvolumeclaims/e2e-volumemode-9222/pvc-5n2hj\" " with result "range_response_count:1 size:468" took too long (239.823142ms) to execute
2020-04-30 13:07:14.709260 W | etcdserver: read-only range request "key:\"/kubernetes.io/clusterroles/self-access-reviewer\" " with result "range_response_count:1 size:389" took too long (157.091852ms) to execute
2020-04-30 13:07:18.867912 I | embed: rejected connection from "10.130.0.8:40554" (error "EOF", ServerName "")
2020-04-30 13:07:24.407740 I | embed: rejected connection from "10.130.0.8:40732" (error "EOF", ServerName "")
2020-04-30 13:07:33.167135 I | embed: rejected connection from "127.0.0.1:57514" (error "EOF", ServerName "") <======= failed readiness probe spotted
2020-04-30 13:07:43.880351 I | embed: rejected connection from "10.130.0.8:41356" (error "EOF", ServerName "")
2020-04-30 13:07:51.364402 I | embed: rejected connection from "10.130.0.8:41572" (error "EOF", ServerName "")
2020-04-30 13:08:04.145769 W | etcdserver: request "header:<ID:7238535618199108688 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/events/e2e-provisioning-9081/pod-subpath-test-local-preprovisionedpv-5flf.160a9ad91ea34a1a\" mod_revision:149734 > success:<request_delete_range:<key:\"/kubernetes.io/events/e2e-provisioning-9081/pod-subpath-test-local-preprovisionedpv-5flf.160a9ad91ea34a1a\" > > failure:<>>" with result "size:20" took too long (104.925693ms) to execute
2020-04-30 13:08:04.148476 W | etcdserver: read-only range request "key:\"/kubernetes.io/events/e2e-volume-3403/aws-client.160a9ad3e51fa5fd\" " with result "range_response_count:1 size:524" took too long (114.702437ms) to execute

----

Looks like etcd is fairly stable, and the cause for failure of the test does not seem to be evident at etcd level.

Comment 3 Sam Batschelet 2020-05-18 15:06:40 UTC
> 2020-04-30 13:07:33.167135 I | embed: rejected connection from "127.0.0.1:57514" (error "EOF", ServerName "") <======= failed readiness probe spotted

this is concerning that we can't explain the failure if that indeed from quorum-guard as all members are up.

Comment 5 Dan Mace 2020-05-28 13:43:08 UTC
Because this isn't a 4.5 regression and we don't yet have any clear understanding of the impact, I'm moving the bug to 4.6 so we can continue investigation without blocking the 4.5 release. If somebody has a specific justification for blocking 4.5, please provide details.

Comment 7 Dan Mace 2020-06-05 19:42:28 UTC
Just picked a random recent PR run to start level setting again here:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/367/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/1635

$ jq '.items[] | select((.involvedObject.name // "" | startswith("etcd-quorum-guard")) and .source.component=="kubelet" and .reason=="Unhealthy") | { "name:": .involvedObject.name, "count": .count, "message": .message }' events.json
{
  "name:": "etcd-quorum-guard-5988dfbc54-fnhp7",
  "count": 1,
  "message": "Readiness probe failed: "
}
{
  "name:": "etcd-quorum-guard-5988dfbc54-qwvwp",
  "count": 1,
  "message": "Readiness probe failed: command timed out"
}
{
  "name:": "etcd-quorum-guard-5988dfbc54-qwvwp",
  "count": 2,
  "message": "Readiness probe failed: "
}
{
  "name:": "etcd-quorum-guard-5988dfbc54-vwvrp",
  "count": 3,
  "message": "Readiness probe failed: "
}

Haven't tried to do a wider analysis or dig in yet, just wanted to show that it's happening often enough that I was able to easily find it on my first try.

Comment 8 Dan Mace 2020-06-08 14:41:24 UTC
Good example from a recent periodic:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/2347

Simple answer is that quorum-guard pods are created a minute prior to the etcd pods, and fail readiness probes until etcd actually comes up.

Given the bootstrap etcd is functional (without quorum) at this time, and quorum is ultimately achieved after master etcd startup, I'm not sure the logs mean much.

Is quorum-guard intended to monitor only the status of the post-bootstrap etcd cluster? If so, maybe quorum-guard is starting too soon. Otherwise we should disregard its failures during bootstrapping, but then how do we easily distinguish those failures from real ones post bootstrapping?

Comment 11 Dan Mace 2020-06-11 18:23:46 UTC
The fix was bad. https://github.com/openshift/machine-config-operator/pull/1818 repairs it.

Comment 12 Kirsten Garrison 2020-06-12 02:17:32 UTC
This is how I verified my PR in CI: https://github.com/openshift/machine-config-operator/pull/1818#issuecomment-643024469

Comment 14 Dan Mace 2020-06-15 14:34:08 UTC
Re-opening again — I think we solved it for new installations, but I don't think the changes are being applied during Y or Z upgrades. See the thread starting at:

https://github.com/openshift/machine-config-operator/pull/1818#issuecomment-644121442

Comment 15 Dan Mace 2020-06-16 19:01:16 UTC
Created blocking CVO bug https://bugzilla.redhat.com/show_bug.cgi?id=1847672 to track the upgrade issue.

Comment 18 errata-xmlrpc 2020-10-27 15:58:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196