Bug 1805569
Summary: | 4.2->4.3 updates occasionally suffer from: failed to connect to {0.0.0.0:2379 0 <nil>}... tls: bad certificate | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | ||||||
Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | ge liu <geliu> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 4.3.0 | CC: | alpatel, lmohanty, mfojtik, obockows, sbatsche, scuppett, sdodson | ||||||
Target Milestone: | --- | Keywords: | Upgrades | ||||||
Target Release: | 4.5.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2020-04-08 13:39:38 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
W. Trevor King
2020-02-21 03:06:10 UTC
Created attachment 1664596 [details]
search.sh results from the past 24h, showing the issue only happens on 4.2->4.3
To understand the impact of the issue we need more information on following questions. If fine if we do not have answer to few of these but we should try to get the answers. Thanks in advance What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit? What kind of clusters are impacted because of the bug? What cluster functionality is degraded while hitting the bug? Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. Is it possible to recover the cluster from the bug? Is recovery automatic without intervention? I.e. is the buggy condition transient? Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix? Is recovery possible only after more extensive cluster-admin intervention? Is recovery impossible (bricked cluster)? What is the observed rate of failure we see in CI? Is there a manual workaround that exists to recover from the bug? What are manual steps? How long before the bug is fixed? Is this a regression from a previous set of upgrades, or a known issue? 6 occurrences over the past 48hrs https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly/138 4.2.0-0.nightly-2020-03-01-045748 created at 2020-03-01 05:08:35 +0000 UTC with 98 images to tag release:initial 4.3.0-0.nightly-2020-03-02-014115 created at 2020-03-02 01:44:02 +0000 UTC with 100 images to tag release:latest 11 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly/138/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-136-94.ec2.internal_etcd-member.log 16 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly/138/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-141-14.ec2.internal_etcd-member.log 13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly/138/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-151-209.ec2.internal_etcd-member.log https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19611 4.2.21 created at 2020-02-26 12:25:24 +0000 UTC with 98 images to tag release:initial 4.3.0-0.nightly-2020-03-01-020430 created at 2020-03-01 02:06:21 +0000 UTC with 100 images to tag release:latest 10 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19611/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-137-126.us-west-2.compute.internal_etcd-member.log 15 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19611/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-142-92.us-west-2.compute.internal_etcd-member.log 13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19611/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-152-14.us-west-2.compute.internal_etcd-member.log https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19644 4.2.21 created at 2020-02-26 12:25:24 +0000 UTC with 98 images to tag release:initial 4.3.0-0.nightly-2020-03-01-115610 created at 2020-03-01 12:01:11 +0000 UTC with 100 images to tag release:latest 13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19644/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-130-220.us-west-2.compute.internal_etcd-member.log 11 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19644/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-136-3.us-west-2.compute.internal_etcd-member.log 16 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19644/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-155-169.us-west-2.compute.internal_etcd-member.log https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19662 4.3.0-0.nightly-2020-03-01-194304 created at 2020-03-01 19:44:41 +0000 UTC with 100 images to tag release:latest 4.2.21 created at 2020-02-26 12:25:24 +0000 UTC with 98 images to tag release:initial 11 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19662/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-128-148.us-west-2.compute.internal_etcd-member.log 16 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19662/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-137-39.us-west-2.compute.internal_etcd-member.log 13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19662/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-145-115.us-west-2.compute.internal_etcd-member.log https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19713 4.3.0-0.nightly-2020-03-02-115939 created at 2020-03-02 12:01:16 +0000 UTC with 100 images to tag release:latest 4.2.21 created at 2020-02-26 12:25:24 +0000 UTC with 98 images to tag release:initial 11 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19713/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-131-232.us-west-2.compute.internal_etcd-member.log 16 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19713/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-141-155.us-west-2.compute.internal_etcd-member.log 13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19713/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-154-4.us-west-2.compute.internal_etcd-member.log https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19751 4.3.0-0.ci-2020-03-02-170302 created at 2020-03-02 17:07:40 +0000 UTC with 101 images to tag release:latest 4.2.21 created at 2020-02-26 12:25:24 +0000 UTC with 98 images to tag release:initial 17 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19751/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-131-21.us-west-2.compute.internal_etcd-member.log 13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19751/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-140-126.us-west-2.compute.internal_etcd-member.log 11 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19751/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-159-147.us-west-2.compute.internal_etcd-member.log @wking, thanks for the report. Digging deeper into the logs, as you noted that etcd upgrade rollout started from --> 2020-02-19T20:57:06Z True ip-10-0-131-197.us-east-2.compute.internal openshift-machine-api/ci-op-zxiyfz1z-77109-w846g-master-2 to --> 2020-02-19T21:06:04Z True ip-10-0-154-122.us-east-2.compute.internal openshift-machine-api/ci-op-zxiyfz1z-77109-w846g-master-1 This is reflected in etcd and the last etcd node, etcd-1 to start was at -- > 2020-02-19 21:06:01.326620 I | rafthttp: peer 1be93dba027a5dab became active, https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18082/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-131-197.us-east-2.compute.internal_etcd-member.log The kube apiserver logs also reflect the same timeline, the error connecting to etcd-1 node was not observed after 21:05:58.744010 --> W0219 21:05:57.884268 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-1.ci-op-zxiyfz1z-77109.origin-ci-int-aws.dev.rhcloud.com:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.154.122:2379: connect: connection refused". Reconnecting... W0219 21:05:58.744010 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-1.ci-op-zxiyfz1z-77109.origin-ci-int-aws.dev.rhcloud.com:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.154.122:2379: connect: connection refused". Reconnecting... I0219 21:06:00.179737 1 controller.go:107] OpenAPI AggregationController: Processing item v1.quota.openshift.io I0219 21:06:34.316401 1 controller.go:107] OpenAPI AggregationController: Processing item v1.apps.openshift.io However, the errors on test failure start from `20:36:29.713` to `21:02:25.825`. This is before the etcd upgrade actually starts. Feb 19 20:36:29.713 W ns/openshift-kube-scheduler pod/installer-7-ip-10-0-154-122.us-east-2.compute.internal MountVolume.SetUp failed for volume "installer-sa-token-stwf9" : couldn't propagate object cache: timed out waiting for the condition Feb 19 20:36:30.107 I ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator multiple versions found, probably in transition: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7786c12b230c2ced5f19a18ba22325a2ae4b0a577db5d5d9271021ee3123d304,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:969a12b60852a6b295cd496807fff5395cc6dfc72ff8aee015274dad8f7de3e6 (23 times) ... ... ... Feb 19 21:02:25.631 I ns/openshift-operator-lifecycle-manager clusterserviceversion/packageserver APIServices not installed (6 times) Feb 19 21:02:25.825 W ns/openshift-marketplace pod/community-operators-857659c86d-bldc7 Liveness probe failed: command timed out (2 times) Feb 19 21:02:26.221 I ns/openshift-operator-lifecycle-manager clusterserviceversion/packageserver APIServices not installed (7 times) Hence, Etcd seems to be running, and kube-apiserver is able to communicate with etcd, not sure what suggests that etcd is the reason for tests failing. The client bad certificate errors are unfortunate, but that does not indicate etcd is down. We attempted to fix it in 4.3 here https://github.com/openshift/kubecsr/pull/17, but have not backported to 4.2 yet. So we can close this as a dup of bug 1769992? *** This bug has been marked as a duplicate of bug 1769992 *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |