"ns/openshift-machine-config-operator pod/etcd-quorum-guard-565f8d849c-klgqh Readiness probe failed: command timed out (19 times)" "dial tcp [::1]:6443: connect: connection refused\nI0430 12:50:38.672317 1 leaderelection.go:288] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition\nF0430 12:50:38.672352 1 server.go:257] leaderelection lost\n" https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915#1:build-log.txt%3A12804 Reliably failing in "job.initialize" in redhat-openshift-ocp-release-4.4-informing suite.
search results: https://search.apps.build01.ci.devcluster.openshift.com/?search=leaderelection+lost&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job lots of leader election lost being seen in test failures again. happening in e2e-aws-upgrade which is a generic job used to test various version to version upgrades, so this could be due to issues in basically any ocp version.
There is only a single event of failed "Readiness probe" reported here: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915/artifacts/e2e-aws/events.json ----- { "apiVersion": "v1", "count": 20, "eventTime": null, "firstTimestamp": "2020-04-30T12:54:03Z", "involvedObject": { "apiVersion": "v1", "fieldPath": "spec.containers{guard}", "kind": "Pod", "name": "etcd-quorum-guard-565f8d849c-klgqh", "namespace": "openshift-machine-config-operator", "resourceVersion": "8104", "uid": "5167f230-7e34-4fb5-a1d5-b2e930b22f16" }, "kind": "Event", "lastTimestamp": "2020-04-30T13:07:33Z", "message": "Readiness probe failed: command timed out", "metadata": { "creationTimestamp": "2020-04-30T12:54:03Z", "name": "etcd-quorum-guard-565f8d849c-klgqh.160a9a173b11309f", "namespace": "openshift-machine-config-operator", "resourceVersion": "146976", "selfLink": "/api/v1/namespaces/openshift-machine-config-operator/events/etcd-quorum-guard-565f8d849c-klgqh.160a9a173b11309f", "uid": "05785c65-fc68-4589-93f1-113481e12d68" }, "reason": "Unhealthy", "reportingComponent": "", "reportingInstance": "", "source": { "component": "kubelet", "host": "ip-10-0-145-13.us-east-2.compute.internal" }, "type": "Warning" }, ---- The corresponding etcd server entry for the failure does not seem to be fatal: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915/artifacts/e2e-aws/pods/openshift-etcd_etcd-ip-10-0-145-13.us-east-2.compute.internal_etcd.log ---- 2020-04-30 13:07:14.709133 W | etcdserver: read-only range request "key:\"/kubernetes.io/persistentvolumeclaims/e2e-volumemode-9222/pvc-5n2hj\" " with result "range_response_count:1 size:468" took too long (239.823142ms) to execute 2020-04-30 13:07:14.709260 W | etcdserver: read-only range request "key:\"/kubernetes.io/clusterroles/self-access-reviewer\" " with result "range_response_count:1 size:389" took too long (157.091852ms) to execute 2020-04-30 13:07:18.867912 I | embed: rejected connection from "10.130.0.8:40554" (error "EOF", ServerName "") 2020-04-30 13:07:24.407740 I | embed: rejected connection from "10.130.0.8:40732" (error "EOF", ServerName "") 2020-04-30 13:07:33.167135 I | embed: rejected connection from "127.0.0.1:57514" (error "EOF", ServerName "") <======= failed readiness probe spotted 2020-04-30 13:07:43.880351 I | embed: rejected connection from "10.130.0.8:41356" (error "EOF", ServerName "") 2020-04-30 13:07:51.364402 I | embed: rejected connection from "10.130.0.8:41572" (error "EOF", ServerName "") 2020-04-30 13:08:04.145769 W | etcdserver: request "header:<ID:7238535618199108688 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/events/e2e-provisioning-9081/pod-subpath-test-local-preprovisionedpv-5flf.160a9ad91ea34a1a\" mod_revision:149734 > success:<request_delete_range:<key:\"/kubernetes.io/events/e2e-provisioning-9081/pod-subpath-test-local-preprovisionedpv-5flf.160a9ad91ea34a1a\" > > failure:<>>" with result "size:20" took too long (104.925693ms) to execute 2020-04-30 13:08:04.148476 W | etcdserver: read-only range request "key:\"/kubernetes.io/events/e2e-volume-3403/aws-client.160a9ad3e51fa5fd\" " with result "range_response_count:1 size:524" took too long (114.702437ms) to execute ---- Looks like etcd is fairly stable, and the cause for failure of the test does not seem to be evident at etcd level.
> 2020-04-30 13:07:33.167135 I | embed: rejected connection from "127.0.0.1:57514" (error "EOF", ServerName "") <======= failed readiness probe spotted this is concerning that we can't explain the failure if that indeed from quorum-guard as all members are up.
Because this isn't a 4.5 regression and we don't yet have any clear understanding of the impact, I'm moving the bug to 4.6 so we can continue investigation without blocking the 4.5 release. If somebody has a specific justification for blocking 4.5, please provide details.
Just picked a random recent PR run to start level setting again here: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/367/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/1635 $ jq '.items[] | select((.involvedObject.name // "" | startswith("etcd-quorum-guard")) and .source.component=="kubelet" and .reason=="Unhealthy") | { "name:": .involvedObject.name, "count": .count, "message": .message }' events.json { "name:": "etcd-quorum-guard-5988dfbc54-fnhp7", "count": 1, "message": "Readiness probe failed: " } { "name:": "etcd-quorum-guard-5988dfbc54-qwvwp", "count": 1, "message": "Readiness probe failed: command timed out" } { "name:": "etcd-quorum-guard-5988dfbc54-qwvwp", "count": 2, "message": "Readiness probe failed: " } { "name:": "etcd-quorum-guard-5988dfbc54-vwvrp", "count": 3, "message": "Readiness probe failed: " } Haven't tried to do a wider analysis or dig in yet, just wanted to show that it's happening often enough that I was able to easily find it on my first try.
Good example from a recent periodic: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/2347 Simple answer is that quorum-guard pods are created a minute prior to the etcd pods, and fail readiness probes until etcd actually comes up. Given the bootstrap etcd is functional (without quorum) at this time, and quorum is ultimately achieved after master etcd startup, I'm not sure the logs mean much. Is quorum-guard intended to monitor only the status of the post-bootstrap etcd cluster? If so, maybe quorum-guard is starting too soon. Otherwise we should disregard its failures during bootstrapping, but then how do we easily distinguish those failures from real ones post bootstrapping?
The fix was bad. https://github.com/openshift/machine-config-operator/pull/1818 repairs it.
This is how I verified my PR in CI: https://github.com/openshift/machine-config-operator/pull/1818#issuecomment-643024469
Re-opening again — I think we solved it for new installations, but I don't think the changes are being applied during Y or Z upgrades. See the thread starting at: https://github.com/openshift/machine-config-operator/pull/1818#issuecomment-644121442
Created blocking CVO bug https://bugzilla.redhat.com/show_bug.cgi?id=1847672 to track the upgrade issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196