Bug 1797796 - Cluster etcd operator cannot talk to bootstrap pod because of auth failures
Summary: Cluster etcd operator cannot talk to bootstrap pod because of auth failures
Keywords:
Status: CLOSED DUPLICATE of bug 1808060
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 1771572
TreeView+ depends on / blocked
 
Reported: 2020-02-03 21:29 UTC by Alay Patel
Modified: 2020-04-15 00:16 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
When there are mutliple networks, it is important to remember the following guidelines: 1) bootkube would have to be populated with BOOTSTRAP_IP in the same subnet as the masters 2) storage URLs in kube apiserver will also have to be with an IP from same subnet as the masters or the cert signer will have to produce certs with all IPs included in the SAN.
Clone Of:
Environment:
Last Closed: 2020-03-10 16:22:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3175 0 None closed bug 1807169: use localhost for bootstrap IP until bootkube is fixed 2020-05-27 03:25:33 UTC

Description Alay Patel 2020-02-03 21:29:03 UTC
Description of problem:

In 4.4, the cluster-etcd-operator(CEO) scales the etcd cluster from bootstrap node to 4 member control plane (3 etcd pods for each master). Sometimes, the scaling times out because CEO pod is not able to talk to the bootstrap etcd in order to add other etcd nodes as members of etcd. The error from operator logs is:

------
I0201 18:29:56.506190       1 util.go:37] checking against etcd-2.ci-op-1yrd4g86-e4498.origin-ci-int-gce.dev.openshift.com.
W0201 18:29:57.079291       1 clientconn.go:1156] grpc: addrConn.createTransport failed to connect to {https://10.0.0.5:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority". Reconnecting...



How reproducible:
This is probably major component of bootstrapping failures in CI. grep for "Err :connection" in [1][2][3]



Expected results:

The operator pod is expected to be able to have correct certs to talk to bootstrap etcd


Additional info:

Another quick way to spot this bug in CI is looking for etcd resource in must-gather. If one member is in Ready state, and other two are in unknown state, it is because the etcd-operaror is likely erroring out on auth failures in adding the member to the cluster, example as follows: 


---------
  observedConfig:
    cluster:
      members:
      - name: etcd-bootstrap
        peerURLs: https://10.0.0.6:2380
        status: Unknown
      pending:
      - name: etcd-member-ci-op-kd2mp-m-1.c.openshift-gce-devel-ci.internal
        peerURLs: https://etcd-1.ci-op-9d6rs79x-15937.origin-ci-int-gce.dev.openshift.com:2380
        status: Unknown
      - name: etcd-member-ci-op-kd2mp-m-0.c.openshift-gce-devel-ci.internal
        peerURLs: https://etcd-0.ci-op-9d6rs79x-15937.origin-ci-int-gce.dev.openshift.com:2380
        status: Ready
      - name: etcd-member-ci-op-kd2mp-m-2.c.openshift-gce-devel-ci.internal
        peerURLs: https://etcd-2.ci-op-9d6rs79x-15937.origin-ci-int-gce.dev.openshift.com:2380
        status: Unknown


1. https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2745/pull-ci-openshift-installer-master-e2e-gcp/222/artifacts/e2e-gcp/pods/openshift-etcd-operator_etcd-operator-f78f5b65c-jzqlz_operator.log
2.https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/68/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade/195/artifacts/e2e-gcp-upgrade/pods/openshift-etcd-operator_etcd-operator-55f94bfd85-hhvck_operator.log
3.https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/6986/rehearse-6986-pull-ci-openshift-origin-master-e2e-conformance-k8s/5/artifacts/e2e-conformance-k8s/pods/openshift-etcd-operator_etcd-operator-bbd958bb7-k476j_operator.log

Comment 8 Sam Batschelet 2020-04-15 00:11:29 UTC

*** This bug has been marked as a duplicate of bug 1807169 ***

Comment 9 Sam Batschelet 2020-04-15 00:16:25 UTC

*** This bug has been marked as a duplicate of bug 1808060 ***


Note You need to log in before you can comment on or make changes to this bug.