Bug 1797796

Summary: Cluster etcd operator cannot talk to bootstrap pod because of auth failures
Product: OpenShift Container Platform Reporter: Alay Patel <alpatel>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.4CC: adahiya, augol, eslutsky, mfojtik, mvirgil, rgolan, skolicha, wking, zshi
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Release Note
Doc Text:
When there are mutliple networks, it is important to remember the following guidelines: 1) bootkube would have to be populated with BOOTSTRAP_IP in the same subnet as the masters 2) storage URLs in kube apiserver will also have to be with an IP from same subnet as the masters or the cert signer will have to produce certs with all IPs included in the SAN.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-10 16:22:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1771572    

Description Alay Patel 2020-02-03 21:29:03 UTC
Description of problem:

In 4.4, the cluster-etcd-operator(CEO) scales the etcd cluster from bootstrap node to 4 member control plane (3 etcd pods for each master). Sometimes, the scaling times out because CEO pod is not able to talk to the bootstrap etcd in order to add other etcd nodes as members of etcd. The error from operator logs is:

------
I0201 18:29:56.506190       1 util.go:37] checking against etcd-2.ci-op-1yrd4g86-e4498.origin-ci-int-gce.dev.openshift.com.
W0201 18:29:57.079291       1 clientconn.go:1156] grpc: addrConn.createTransport failed to connect to {https://10.0.0.5:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority". Reconnecting...



How reproducible:
This is probably major component of bootstrapping failures in CI. grep for "Err :connection" in [1][2][3]



Expected results:

The operator pod is expected to be able to have correct certs to talk to bootstrap etcd


Additional info:

Another quick way to spot this bug in CI is looking for etcd resource in must-gather. If one member is in Ready state, and other two are in unknown state, it is because the etcd-operaror is likely erroring out on auth failures in adding the member to the cluster, example as follows: 


---------
  observedConfig:
    cluster:
      members:
      - name: etcd-bootstrap
        peerURLs: https://10.0.0.6:2380
        status: Unknown
      pending:
      - name: etcd-member-ci-op-kd2mp-m-1.c.openshift-gce-devel-ci.internal
        peerURLs: https://etcd-1.ci-op-9d6rs79x-15937.origin-ci-int-gce.dev.openshift.com:2380
        status: Unknown
      - name: etcd-member-ci-op-kd2mp-m-0.c.openshift-gce-devel-ci.internal
        peerURLs: https://etcd-0.ci-op-9d6rs79x-15937.origin-ci-int-gce.dev.openshift.com:2380
        status: Ready
      - name: etcd-member-ci-op-kd2mp-m-2.c.openshift-gce-devel-ci.internal
        peerURLs: https://etcd-2.ci-op-9d6rs79x-15937.origin-ci-int-gce.dev.openshift.com:2380
        status: Unknown


1. https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2745/pull-ci-openshift-installer-master-e2e-gcp/222/artifacts/e2e-gcp/pods/openshift-etcd-operator_etcd-operator-f78f5b65c-jzqlz_operator.log
2.https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/68/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade/195/artifacts/e2e-gcp-upgrade/pods/openshift-etcd-operator_etcd-operator-55f94bfd85-hhvck_operator.log
3.https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/6986/rehearse-6986-pull-ci-openshift-origin-master-e2e-conformance-k8s/5/artifacts/e2e-conformance-k8s/pods/openshift-etcd-operator_etcd-operator-bbd958bb7-k476j_operator.log

Comment 8 Sam Batschelet 2020-04-15 00:11:29 UTC

*** This bug has been marked as a duplicate of bug 1807169 ***

Comment 9 Sam Batschelet 2020-04-15 00:16:25 UTC

*** This bug has been marked as a duplicate of bug 1808060 ***