Description of problem: In upgrade logs I observed > 2020-05-01T18:36:37.0791842Z 2020-05-01 18:36:37.079092 E | rafthttp: failed to dial d8027fcd63ed8f3f on stream MsgApp v2 (x509: certificate is valid for localhost, mffaz1.qe.azure.devcluster.openshift.com, 10.0.0.6, not etcd-0.mffaz1.qe.azure.devcluster.openshift.com) This is a regression, in 4.3 peer and server certs both had wildcard. https://github.com/openshift/machine-config-operator/blob/a8b6ec1b0c6cb544e6160ef2f65a7c2b59e6d199/pkg/controller/template/render.go#L382 while in 4.4 we only include the domain without wildcard. X509v3 Subject Alternative Name: DNS:localhost, DNS:mffaz1.qe.azure.devcluster.openshift.com, DNS:10.0.0.4, IP Address:10.0.0.4 This regression could affect upgrades. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. 2. 3. Actual results: peer certs are missing *.etcdDiscoveryDomain wildcard in SAN Expected results: etcd peers certs contain proper SAN Additional info:
The exposure to the SAN issue outlined in this bug is isolated to 4.4 where we are migrating from etcd peers with FQDN for the peerURLs. What this means is peer <--> communications were dialed by FQDN thus the SAN required this DNS record. But in 4.4+ clusters we dropped all DNS deps and peers use IP.
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug (as the master-most bug in its chain). It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: 2% of customers upgrading from 4.3.z to 4.4 before #340 lands. This bug has no impact on 4.2 -> 4.3, 4.3 -> 4.3, 4.4 -> 4.4, 4.4 -> 4.5, 4.3 -> 4.4 -> 4.5, etc. updates. What is the impact? Is it serious enough to warrant blocking edges? example: etcd members get mad and stop talking to each other, quorum lost, cluster lurches to a halt. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Twiddling a DNS record manually like... $SOMETHING unsticks the update. Clearing the stuck update to fall back to the previous version also resolves the issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2133
## Public Impact Statement: # Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? This is a race condition during 4.3.z -> 4.4 upgrade. 4.3.z etcd clusters define PeerURLs as FQDN. What this means is anytime a peer needs to communicate with another peer (raft) it will dial using FQDN address[1]. So TLS authentication will hinge on etcd peer certificate SAN including the FQDN of the peer. In 4.4 we move to IP based peers and a controller manages this migration process during upgrade. The bug stems from the cert signer controller that is tasked with creating TLS certificates does not properly define the DNS portion of the SAN to include this FQDN. During upgrade, a window can exist where peer communications will fail to one or more of the etcd instances. [1] ``` +------------------+---------+--------------------------------------------------------+--------------------------------------------------------------------------------+---------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+--------------------------------------------------------+--------------------------------------------------------------------------------+---------------------------+ | 23b4148511d9af22 | started | etcd-member-ip-10-0-139-143.us-west-1.compute.internal | https://etcd-0.test.devcluster.openshift.com:2380 | https://10.0.139.143:2379 | +------------------+---------+--------------------------------------------------------+--------------------------------------------------------------------------------+---------------------------+ # This bug has no impact on: - 4.2 -> 4.3 - 4.3 -> 4.3, - 4.4 -> 4.5 # What is the impact? Is it serious enough to warrant blocking edges? This is fairly serious, the result of an upgrade affected by this bug could take 2 or more hours to complete. It is very important to be patient and let the process complete. If you feel the cluster can not make progress please reach out to support. # How do cluster admins detect they are impacted? The failure will manifest itself in a few different ways. The first being observing the etcd cluster operator in a progressing status for a long period of time. $ oc get co etcd NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.4.4 False True True 3h54m Check the naming of the etcd pods. Pods that are having trouble upgrading will still have the old pod name of `etcd-member` vs `etcd`. $ oc get pods -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-ip-10-0-136-48.us-west-1.compute.internal 3/3 Running 0 3h59m etcd-ip-10-0-182-151.us-west-1.compute.internal 3/3 Running 0 3h58m etcd-member-ip-10-0-207-45.us-west-1.compute.internal 2/2 Running 0 3h57m Lastly, you will see evidence in the etcd server logs of the TLS auth failure > 2020-05-01T18:36:37.0791842Z 2020-05-01 18:36:37.079092 E | rafthttp: failed to dial d8027fcd63ed8f3f on stream MsgApp v2 (x509: certificate is valid for localhost, test.devcluster.openshift.com, 10.0.0.6, not etcd-0.test.devcluster.openshift.com) # How will cluster admin know they are in the clear. If you exec into the etcdctl(4.4)/etcd-member(4.3) container of the running etcd pods you can run $ etcdctl member list -w table If all of the members listed contain IPs for "PEER ADDRS" then this issue no longer affects your cluster.
*** Bug 1827132 has been marked as a duplicate of this bug. ***
*** Bug 1830409 has been marked as a duplicate of this bug. ***
*** Bug 1830789 has been marked as a duplicate of this bug. ***
*** Bug 1833250 has been marked as a duplicate of this bug. ***
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475