Bug 1829923 - test: etcd etcd-quorum-guard leaderelection lost
Summary: test: etcd etcd-quorum-guard leaderelection lost
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.6.0
Assignee: Dan Mace
QA Contact: ge liu
URL:
Whiteboard:
Depends On: 1847672
Blocks: 1845636
TreeView+ depends on / blocked
 
Reported: 2020-04-30 15:04 UTC by Brett Tofel
Modified: 2020-10-08 13:42 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
test: Overall
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 383 None closed Bug 1847672: Expand supported set of probe field mutations 2020-10-19 19:58:47 UTC
Github openshift machine-config-operator pull 1797 None closed Bug 1829923: Fix quorum-guard timeouts 2020-10-19 19:58:59 UTC
Github openshift machine-config-operator pull 1818 None closed Bug 1829923: etcd-quorum-guard: update containers section in deployment.yaml 2020-10-19 19:58:47 UTC

Description Brett Tofel 2020-04-30 15:04:34 UTC
"ns/openshift-machine-config-operator pod/etcd-quorum-guard-565f8d849c-klgqh Readiness probe failed: command timed out (19 times)"

"dial tcp [::1]:6443: connect: connection refused\nI0430 12:50:38.672317       1 leaderelection.go:288] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition\nF0430 12:50:38.672352       1 server.go:257] leaderelection lost\n"


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915#1:build-log.txt%3A12804

Reliably failing in "job.initialize" in redhat-openshift-ocp-release-4.4-informing suite.

Comment 1 Ben Parees 2020-04-30 21:33:42 UTC
search results:
https://search.apps.build01.ci.devcluster.openshift.com/?search=leaderelection+lost&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

lots of leader election lost being seen in test failures again.

happening in e2e-aws-upgrade which is a generic job used to test various version to version upgrades, so this could be due to issues in basically any ocp version.

Comment 2 Alay Patel 2020-05-01 20:52:37 UTC
There is only a single event of failed "Readiness probe" reported here: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915/artifacts/e2e-aws/events.json

-----
{
            "apiVersion": "v1",
            "count": 20,
            "eventTime": null,
            "firstTimestamp": "2020-04-30T12:54:03Z",
            "involvedObject": {
                "apiVersion": "v1",
                "fieldPath": "spec.containers{guard}",
                "kind": "Pod",
                "name": "etcd-quorum-guard-565f8d849c-klgqh",
                "namespace": "openshift-machine-config-operator",
                "resourceVersion": "8104",
                "uid": "5167f230-7e34-4fb5-a1d5-b2e930b22f16"
            },
            "kind": "Event",
            "lastTimestamp": "2020-04-30T13:07:33Z",
            "message": "Readiness probe failed: command timed out",
            "metadata": {
                "creationTimestamp": "2020-04-30T12:54:03Z",
                "name": "etcd-quorum-guard-565f8d849c-klgqh.160a9a173b11309f",
                "namespace": "openshift-machine-config-operator",
                "resourceVersion": "146976",
                "selfLink": "/api/v1/namespaces/openshift-machine-config-operator/events/etcd-quorum-guard-565f8d849c-klgqh.160a9a173b11309f",
                "uid": "05785c65-fc68-4589-93f1-113481e12d68"
            },
            "reason": "Unhealthy",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "kubelet",
                "host": "ip-10-0-145-13.us-east-2.compute.internal"
            },
            "type": "Warning"
        },

----

The corresponding etcd server entry for the failure does not seem to be fatal: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/9915/artifacts/e2e-aws/pods/openshift-etcd_etcd-ip-10-0-145-13.us-east-2.compute.internal_etcd.log

----
2020-04-30 13:07:14.709133 W | etcdserver: read-only range request "key:\"/kubernetes.io/persistentvolumeclaims/e2e-volumemode-9222/pvc-5n2hj\" " with result "range_response_count:1 size:468" took too long (239.823142ms) to execute
2020-04-30 13:07:14.709260 W | etcdserver: read-only range request "key:\"/kubernetes.io/clusterroles/self-access-reviewer\" " with result "range_response_count:1 size:389" took too long (157.091852ms) to execute
2020-04-30 13:07:18.867912 I | embed: rejected connection from "10.130.0.8:40554" (error "EOF", ServerName "")
2020-04-30 13:07:24.407740 I | embed: rejected connection from "10.130.0.8:40732" (error "EOF", ServerName "")
2020-04-30 13:07:33.167135 I | embed: rejected connection from "127.0.0.1:57514" (error "EOF", ServerName "") <======= failed readiness probe spotted
2020-04-30 13:07:43.880351 I | embed: rejected connection from "10.130.0.8:41356" (error "EOF", ServerName "")
2020-04-30 13:07:51.364402 I | embed: rejected connection from "10.130.0.8:41572" (error "EOF", ServerName "")
2020-04-30 13:08:04.145769 W | etcdserver: request "header:<ID:7238535618199108688 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/events/e2e-provisioning-9081/pod-subpath-test-local-preprovisionedpv-5flf.160a9ad91ea34a1a\" mod_revision:149734 > success:<request_delete_range:<key:\"/kubernetes.io/events/e2e-provisioning-9081/pod-subpath-test-local-preprovisionedpv-5flf.160a9ad91ea34a1a\" > > failure:<>>" with result "size:20" took too long (104.925693ms) to execute
2020-04-30 13:08:04.148476 W | etcdserver: read-only range request "key:\"/kubernetes.io/events/e2e-volume-3403/aws-client.160a9ad3e51fa5fd\" " with result "range_response_count:1 size:524" took too long (114.702437ms) to execute

----

Looks like etcd is fairly stable, and the cause for failure of the test does not seem to be evident at etcd level.

Comment 3 Sam Batschelet 2020-05-18 15:06:40 UTC
> 2020-04-30 13:07:33.167135 I | embed: rejected connection from "127.0.0.1:57514" (error "EOF", ServerName "") <======= failed readiness probe spotted

this is concerning that we can't explain the failure if that indeed from quorum-guard as all members are up.

Comment 5 Dan Mace 2020-05-28 13:43:08 UTC
Because this isn't a 4.5 regression and we don't yet have any clear understanding of the impact, I'm moving the bug to 4.6 so we can continue investigation without blocking the 4.5 release. If somebody has a specific justification for blocking 4.5, please provide details.

Comment 7 Dan Mace 2020-06-05 19:42:28 UTC
Just picked a random recent PR run to start level setting again here:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/367/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/1635

$ jq '.items[] | select((.involvedObject.name // "" | startswith("etcd-quorum-guard")) and .source.component=="kubelet" and .reason=="Unhealthy") | { "name:": .involvedObject.name, "count": .count, "message": .message }' events.json
{
  "name:": "etcd-quorum-guard-5988dfbc54-fnhp7",
  "count": 1,
  "message": "Readiness probe failed: "
}
{
  "name:": "etcd-quorum-guard-5988dfbc54-qwvwp",
  "count": 1,
  "message": "Readiness probe failed: command timed out"
}
{
  "name:": "etcd-quorum-guard-5988dfbc54-qwvwp",
  "count": 2,
  "message": "Readiness probe failed: "
}
{
  "name:": "etcd-quorum-guard-5988dfbc54-vwvrp",
  "count": 3,
  "message": "Readiness probe failed: "
}

Haven't tried to do a wider analysis or dig in yet, just wanted to show that it's happening often enough that I was able to easily find it on my first try.

Comment 8 Dan Mace 2020-06-08 14:41:24 UTC
Good example from a recent periodic:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/2347

Simple answer is that quorum-guard pods are created a minute prior to the etcd pods, and fail readiness probes until etcd actually comes up.

Given the bootstrap etcd is functional (without quorum) at this time, and quorum is ultimately achieved after master etcd startup, I'm not sure the logs mean much.

Is quorum-guard intended to monitor only the status of the post-bootstrap etcd cluster? If so, maybe quorum-guard is starting too soon. Otherwise we should disregard its failures during bootstrapping, but then how do we easily distinguish those failures from real ones post bootstrapping?

Comment 11 Dan Mace 2020-06-11 18:23:46 UTC
The fix was bad. https://github.com/openshift/machine-config-operator/pull/1818 repairs it.

Comment 12 Kirsten Garrison 2020-06-12 02:17:32 UTC
This is how I verified my PR in CI: https://github.com/openshift/machine-config-operator/pull/1818#issuecomment-643024469

Comment 14 Dan Mace 2020-06-15 14:34:08 UTC
Re-opening again — I think we solved it for new installations, but I don't think the changes are being applied during Y or Z upgrades. See the thread starting at:

https://github.com/openshift/machine-config-operator/pull/1818#issuecomment-644121442

Comment 15 Dan Mace 2020-06-16 19:01:16 UTC
Created blocking CVO bug https://bugzilla.redhat.com/show_bug.cgi?id=1847672 to track the upgrade issue.


Note You need to log in before you can comment on or make changes to this bug.