Bug 1805569

Summary: 4.2->4.3 updates occasionally suffer from: failed to connect to {0.0.0.0:2379 0 <nil>}... tls: bad certificate
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.3.0CC: alpatel, lmohanty, mfojtik, obockows, sbatsche, scuppett, sdodson
Target Milestone: ---Keywords: Upgrades
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-08 13:39:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
search.sh, for finding any release (non-PR) instances of this in the past 48h of CI
none
search.sh results from the past 24h, showing the issue only happens on 4.2->4.3 none

Description W. Trevor King 2020-02-21 03:06:10 UTC
Created attachment 1664595 [details]
search.sh, for finding any release (non-PR) instances of this in the past 48h of CI

For example, from 4.2.20 -> 4.3.3 on AWS [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18082/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-131-197.us-east-2.compute.internal_etcd-member.log | grep 'bad certificate'
WARNING: 2020/02/19 20:57:03 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 20:57:04 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 20:57:06 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 20:57:08 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 20:57:13 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 20:57:20 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 20:57:32 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 20:57:51 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 20:58:18 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 20:59:08 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 21:00:05 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 21:01:54 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 21:03:32 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 21:05:47 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2020/02/19 21:07:39 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18082/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-142-173.us-east-2.compute.internal_etcd-member.log | grep 'bad certificate'
WARNING: 2020/02/19 21:01:25 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
...
WARNING: 2020/02/19 21:08:05 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18082/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-154-122.us-east-2.compute.internal_etcd-member.log | grep 'bad certificate'
WARNING: 2020/02/19 21:06:02 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
...
WARNING: 2020/02/19 21:07:50 grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...

That's all happening in the middle of the node rollout:

2020-02-19T20:56:05Z True ip-10-0-138-140.us-east-2.compute.internal openshift-machine-api/ci-op-zxiyfz1z-77109-w846g-worker-us-east-2a-lzhwb
2020-02-19T20:57:06Z True ip-10-0-131-197.us-east-2.compute.internal openshift-machine-api/ci-op-zxiyfz1z-77109-w846g-master-2
2020-02-19T20:59:52Z True ip-10-0-139-53.us-east-2.compute.internal openshift-machine-api/ci-op-zxiyfz1z-77109-w846g-worker-us-east-2a-9n76v
2020-02-19T21:01:29Z True ip-10-0-142-173.us-east-2.compute.internal openshift-machine-api/ci-op-zxiyfz1z-77109-w846g-master-0
2020-02-19T21:03:25Z True ip-10-0-157-236.us-east-2.compute.internal openshift-machine-api/ci-op-zxiyfz1z-77109-w846g-worker-us-east-2b-pwps7
2020-02-19T21:06:04Z True ip-10-0-154-122.us-east-2.compute.internal openshift-machine-api/ci-op-zxiyfz1z-77109-w846g-master-1

With the attached script, you can find the past 24h of these, also attached.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18082

Comment 1 W. Trevor King 2020-02-21 03:07:05 UTC
Created attachment 1664596 [details]
search.sh results from the past 24h, showing the issue only happens on 4.2->4.3

Comment 3 Lalatendu Mohanty 2020-02-27 18:05:53 UTC
To understand the impact of the issue we need more information on following questions. If fine if we do not have answer to few of these but we should try to get the answers. Thanks in advance

What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit?
What kind of clusters are impacted because of the bug? 
What cluster functionality is degraded while hitting the bug?
Can this bug cause data loss?
Data loss = API server data loss or CRD state information loss etc. 
Is it possible to recover the cluster from the bug?
Is recovery automatic without intervention?  I.e. is the buggy condition transient?
Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix?
Is recovery possible only after more extensive cluster-admin intervention?
Is recovery impossible (bricked cluster)?
What is the observed rate of failure we see in CI?
Is there a manual workaround that exists to recover from the bug? 
What are manual steps? 
How long before the bug is fixed?

Comment 5 Clayton Coleman 2020-03-02 15:51:23 UTC
Is this a regression from a previous set of upgrades, or a known issue?

Comment 6 Scott Dodson 2020-03-02 21:20:44 UTC
6 occurrences over the past 48hrs

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly/138
  4.2.0-0.nightly-2020-03-01-045748 created at 2020-03-01 05:08:35 +0000 UTC with 98 images to tag release:initial
  4.3.0-0.nightly-2020-03-02-014115 created at 2020-03-02 01:44:02 +0000 UTC with 100 images to tag release:latest
  11 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly/138/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-136-94.ec2.internal_etcd-member.log
  16 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly/138/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-141-14.ec2.internal_etcd-member.log
  13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly/138/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-151-209.ec2.internal_etcd-member.log
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19611
  4.2.21 created at 2020-02-26 12:25:24 +0000 UTC with 98 images to tag release:initial
  4.3.0-0.nightly-2020-03-01-020430 created at 2020-03-01 02:06:21 +0000 UTC with 100 images to tag release:latest
  10 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19611/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-137-126.us-west-2.compute.internal_etcd-member.log
  15 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19611/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-142-92.us-west-2.compute.internal_etcd-member.log
  13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19611/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-152-14.us-west-2.compute.internal_etcd-member.log
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19644
  4.2.21 created at 2020-02-26 12:25:24 +0000 UTC with 98 images to tag release:initial
  4.3.0-0.nightly-2020-03-01-115610 created at 2020-03-01 12:01:11 +0000 UTC with 100 images to tag release:latest
  13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19644/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-130-220.us-west-2.compute.internal_etcd-member.log
  11 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19644/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-136-3.us-west-2.compute.internal_etcd-member.log
  16 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19644/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-155-169.us-west-2.compute.internal_etcd-member.log
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19662
  4.3.0-0.nightly-2020-03-01-194304 created at 2020-03-01 19:44:41 +0000 UTC with 100 images to tag release:latest
  4.2.21 created at 2020-02-26 12:25:24 +0000 UTC with 98 images to tag release:initial
  11 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19662/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-128-148.us-west-2.compute.internal_etcd-member.log
  16 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19662/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-137-39.us-west-2.compute.internal_etcd-member.log
  13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19662/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-145-115.us-west-2.compute.internal_etcd-member.log
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19713
  4.3.0-0.nightly-2020-03-02-115939 created at 2020-03-02 12:01:16 +0000 UTC with 100 images to tag release:latest
  4.2.21 created at 2020-02-26 12:25:24 +0000 UTC with 98 images to tag release:initial
  11 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19713/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-131-232.us-west-2.compute.internal_etcd-member.log
  16 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19713/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-141-155.us-west-2.compute.internal_etcd-member.log
  13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19713/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-154-4.us-west-2.compute.internal_etcd-member.log
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19751
  4.3.0-0.ci-2020-03-02-170302 created at 2020-03-02 17:07:40 +0000 UTC with 101 images to tag release:latest
  4.2.21 created at 2020-02-26 12:25:24 +0000 UTC with 98 images to tag release:initial
  17 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19751/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-131-21.us-west-2.compute.internal_etcd-member.log
  13 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19751/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-140-126.us-west-2.compute.internal_etcd-member.log
  11 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/19751/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-159-147.us-west-2.compute.internal_etcd-member.log

Comment 7 Alay Patel 2020-03-02 21:48:23 UTC
@wking, thanks for the report. Digging deeper into the logs, as you noted that etcd upgrade rollout started from
 
--> 2020-02-19T20:57:06Z True ip-10-0-131-197.us-east-2.compute.internal openshift-machine-api/ci-op-zxiyfz1z-77109-w846g-master-2
 
to
 
--> 2020-02-19T21:06:04Z True ip-10-0-154-122.us-east-2.compute.internal openshift-machine-api/ci-op-zxiyfz1z-77109-w846g-master-1
 
This is reflected in etcd and the last etcd node, etcd-1 to start was at
 
-- > 2020-02-19 21:06:01.326620 I | rafthttp: peer 1be93dba027a5dab became active, https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18082/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-131-197.us-east-2.compute.internal_etcd-member.log
 
The kube apiserver logs also reflect the same timeline, the error connecting to etcd-1 node was not observed after 21:05:58.744010
 
--> W0219 21:05:57.884268       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-1.ci-op-zxiyfz1z-77109.origin-ci-int-aws.dev.rhcloud.com:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.154.122:2379: connect: connection refused". Reconnecting...
W0219 21:05:58.744010       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-1.ci-op-zxiyfz1z-77109.origin-ci-int-aws.dev.rhcloud.com:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.154.122:2379: connect: connection refused". Reconnecting...
I0219 21:06:00.179737       1 controller.go:107] OpenAPI AggregationController: Processing item v1.quota.openshift.io
I0219 21:06:34.316401       1 controller.go:107] OpenAPI AggregationController: Processing item v1.apps.openshift.io
 
 
However, the errors on test failure start from `20:36:29.713` to `21:02:25.825`. This is before the etcd upgrade actually starts.
 
 
Feb 19 20:36:29.713 W ns/openshift-kube-scheduler pod/installer-7-ip-10-0-154-122.us-east-2.compute.internal MountVolume.SetUp failed for volume "installer-sa-token-stwf9" : couldn't propagate object cache: timed out waiting for the condition
Feb 19 20:36:30.107 I ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator multiple versions found, probably in transition: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7786c12b230c2ced5f19a18ba22325a2ae4b0a577db5d5d9271021ee3123d304,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:969a12b60852a6b295cd496807fff5395cc6dfc72ff8aee015274dad8f7de3e6 (23 times)
...
...
...
Feb 19 21:02:25.631 I ns/openshift-operator-lifecycle-manager clusterserviceversion/packageserver APIServices not installed (6 times)
Feb 19 21:02:25.825 W ns/openshift-marketplace pod/community-operators-857659c86d-bldc7 Liveness probe failed: command timed out (2 times)
Feb 19 21:02:26.221 I ns/openshift-operator-lifecycle-manager clusterserviceversion/packageserver APIServices not installed (7 times)
 
Hence, Etcd seems to be running, and kube-apiserver is able to communicate with etcd, not sure what suggests that etcd is the reason for tests failing.
 
The client bad certificate errors are unfortunate, but that does not indicate etcd is down. We attempted to fix it in 4.3 here https://github.com/openshift/kubecsr/pull/17, but have not backported to 4.2 yet.

Comment 9 W. Trevor King 2020-04-06 19:07:37 UTC
So we can close this as a dup of bug 1769992?

Comment 10 Sam Batschelet 2020-04-08 13:39:38 UTC

*** This bug has been marked as a duplicate of bug 1769992 ***

Comment 13 Red Hat Bugzilla 2023-09-14 05:53:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days