1797796 – Cluster etcd operator cannot talk to bootstrap pod because of auth failures

Bug 1797796 - Cluster etcd operator cannot talk to bootstrap pod because of auth failures

Summary: Cluster etcd operator cannot talk to bootstrap pod because of auth failures

Keywords:
Status:	CLOSED DUPLICATE of bug 1808060
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1771572
TreeView+	depends on / blocked

Reported:	2020-02-03 21:29 UTC by Alay Patel
Modified:	2020-04-15 00:16 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Release Note
Doc Text:	When there are mutliple networks, it is important to remember the following guidelines: 1) bootkube would have to be populated with BOOTSTRAP_IP in the same subnet as the masters 2) storage URLs in kube apiserver will also have to be with an IP from same subnet as the masters or the cert signer will have to produce certs with all IPs included in the SAN.
Clone Of:
Environment:
Last Closed:	2020-03-10 16:22:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 3175	0	None	closed	bug 1807169: use localhost for bootstrap IP until bootkube is fixed	2020-05-27 03:25:33 UTC

Description Alay Patel 2020-02-03 21:29:03 UTC

Description of problem:

In 4.4, the cluster-etcd-operator(CEO) scales the etcd cluster from bootstrap node to 4 member control plane (3 etcd pods for each master). Sometimes, the scaling times out because CEO pod is not able to talk to the bootstrap etcd in order to add other etcd nodes as members of etcd. The error from operator logs is:

------
I0201 18:29:56.506190       1 util.go:37] checking against etcd-2.ci-op-1yrd4g86-e4498.origin-ci-int-gce.dev.openshift.com.
W0201 18:29:57.079291       1 clientconn.go:1156] grpc: addrConn.createTransport failed to connect to {https://10.0.0.5:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority". Reconnecting...



How reproducible:
This is probably major component of bootstrapping failures in CI. grep for "Err :connection" in [1][2][3]



Expected results:

The operator pod is expected to be able to have correct certs to talk to bootstrap etcd


Additional info:

Another quick way to spot this bug in CI is looking for etcd resource in must-gather. If one member is in Ready state, and other two are in unknown state, it is because the etcd-operaror is likely erroring out on auth failures in adding the member to the cluster, example as follows: 


---------
  observedConfig:
    cluster:
      members:
      - name: etcd-bootstrap
        peerURLs: https://10.0.0.6:2380
        status: Unknown
      pending:
      - name: etcd-member-ci-op-kd2mp-m-1.c.openshift-gce-devel-ci.internal
        peerURLs: https://etcd-1.ci-op-9d6rs79x-15937.origin-ci-int-gce.dev.openshift.com:2380
        status: Unknown
      - name: etcd-member-ci-op-kd2mp-m-0.c.openshift-gce-devel-ci.internal
        peerURLs: https://etcd-0.ci-op-9d6rs79x-15937.origin-ci-int-gce.dev.openshift.com:2380
        status: Ready
      - name: etcd-member-ci-op-kd2mp-m-2.c.openshift-gce-devel-ci.internal
        peerURLs: https://etcd-2.ci-op-9d6rs79x-15937.origin-ci-int-gce.dev.openshift.com:2380
        status: Unknown


1. https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2745/pull-ci-openshift-installer-master-e2e-gcp/222/artifacts/e2e-gcp/pods/openshift-etcd-operator_etcd-operator-f78f5b65c-jzqlz_operator.log
2.https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/68/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade/195/artifacts/e2e-gcp-upgrade/pods/openshift-etcd-operator_etcd-operator-55f94bfd85-hhvck_operator.log
3.https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/6986/rehearse-6986-pull-ci-openshift-origin-master-e2e-conformance-k8s/5/artifacts/e2e-conformance-k8s/pods/openshift-etcd-operator_etcd-operator-bbd958bb7-k476j_operator.log

Comment 1 Abhinav Dahiya 2020-02-05 18:09:38 UTC

seems like similar failures
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/986
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/987
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3059/pull-ci-openshift-installer-master-e2e-gcp/234

Comment 6 Roy Golan 2020-02-12 09:29:21 UTC

I suspect we see the same for ovirt and its blocking us:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/3047/pull-ci-openshift-installer-master-e2e-ovirt/601/artifacts/e2e-ovirt/pods/openshift-etcd-operator_etcd-operator-6fbdf775c5-blkcw_operator.log

Comment 8 Sam Batschelet 2020-04-15 00:11:29 UTC


*** This bug has been marked as a duplicate of bug 1807169 ***

Comment 9 Sam Batschelet 2020-04-15 00:16:25 UTC


*** This bug has been marked as a duplicate of bug 1808060 ***

Note You need to log in before you can comment on or make changes to this bug.