1830510 – cluster-etcd-operator: peer cert DNS SAN is populated incorrectly

Bug 1830510 - cluster-etcd-operator: peer cert DNS SAN is populated incorrectly

Summary: cluster-etcd-operator: peer cert DNS SAN is populated incorrectly

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	1827132 1830409 1830789 1833250 (view as bug list)
Depends On:
Blocks:	1830789 1953291
TreeView+	depends on / blocked

Reported:	2020-05-02 11:24 UTC by Sam Batschelet
Modified:	2023-10-06 19:52 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1830505
Clones:	1953291 (view as bug list)
Environment:
Last Closed:	2020-05-18 13:35:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 340	0	None	closed	Bug 1830510: pkg/operator/etcdcertsigner: sign peer DNS SAN with explicit FQDN	2021-02-06 13:12:08 UTC
Red Hat Product Errata	RHBA-2020:2133	0	None	None	None	2020-05-18 13:35:19 UTC

Description Sam Batschelet 2020-05-02 11:24:07 UTC

Description of problem: In upgrade logs I observed 

> 2020-05-01T18:36:37.0791842Z 2020-05-01 18:36:37.079092 E | rafthttp: failed to dial d8027fcd63ed8f3f on stream MsgApp v2 (x509: certificate is valid for localhost, mffaz1.qe.azure.devcluster.openshift.com, 10.0.0.6, not etcd-0.mffaz1.qe.azure.devcluster.openshift.com)

This is a regression, in 4.3 peer and server certs both had wildcard.

https://github.com/openshift/machine-config-operator/blob/a8b6ec1b0c6cb544e6160ef2f65a7c2b59e6d199/pkg/controller/template/render.go#L382

while in 4.4 we only include the domain without wildcard.

X509v3 Subject Alternative Name: 
   DNS:localhost, DNS:mffaz1.qe.azure.devcluster.openshift.com, DNS:10.0.0.4, IP Address:10.0.0.4

This regression could affect upgrades.

Version-Release number of selected component (if applicable):


How reproducible: 100%


Steps to Reproduce:
1.
2.
3.

Actual results: peer certs are missing *.etcdDiscoveryDomain wildcard in SAN


Expected results: etcd peers certs contain proper SAN 


Additional info:

Comment 5 Sam Batschelet 2020-05-14 20:13:51 UTC

The exposure to the SAN issue outlined in this bug is isolated to 4.4 where we are migrating from etcd peers with FQDN for the peerURLs. What this means is peer <--> communications were dialed by FQDN thus the SAN required this DNS record. But in 4.4+ clusters we dropped all DNS deps and peers use IP.

Comment 6 W. Trevor King 2020-05-14 20:28:33 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug (as the master-most bug in its chain). It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: 2% of customers upgrading from 4.3.z to 4.4 before #340 lands.  This bug has no impact on 4.2 -> 4.3, 4.3 -> 4.3, 4.4 -> 4.4, 4.4 -> 4.5, 4.3 -> 4.4 -> 4.5, etc. updates.
What is the impact?  Is it serious enough to warrant blocking edges?
  example: etcd members get mad and stop talking to each other, quorum lost, cluster lurches to a halt.
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Twiddling a DNS record manually like... $SOMETHING unsticks the update.  Clearing the stuck update to fall back to the previous version also resolves the issue.

Comment 8 errata-xmlrpc 2020-05-18 13:35:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2133

Comment 10 Sam Batschelet 2020-05-19 19:44:44 UTC

## Public Impact Statement:

# Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?

This is a race condition during 4.3.z -> 4.4 upgrade. 4.3.z etcd clusters define PeerURLs as FQDN. What this means is anytime a peer needs to communicate with another peer (raft) it will dial using FQDN address[1]. So TLS authentication will hinge on etcd peer certificate SAN including the FQDN of the peer. In 4.4 we move to IP based peers and a controller manages this migration process during upgrade. The bug stems from the cert signer controller that is tasked with creating TLS certificates does not properly define the DNS portion of the SAN to include this FQDN. During upgrade, a window can exist where peer communications will fail to one or more of the etcd instances.

[1]
```
+------------------+---------+--------------------------------------------------------+--------------------------------------------------------------------------------+---------------------------+
|        ID        | STATUS  |                          NAME                          |                                   PEER ADDRS                                   |       CLIENT ADDRS        |
+------------------+---------+--------------------------------------------------------+--------------------------------------------------------------------------------+---------------------------+
| 23b4148511d9af22 | started | etcd-member-ip-10-0-139-143.us-west-1.compute.internal | https://etcd-0.test.devcluster.openshift.com:2380                              | https://10.0.139.143:2379 |
+------------------+---------+--------------------------------------------------------+--------------------------------------------------------------------------------+---------------------------+

# This bug has no impact on:

- 4.2 -> 4.3
- 4.3 -> 4.3,
- 4.4 -> 4.5

# What is the impact?  Is it serious enough to warrant blocking edges?

This is fairly serious, the result of an upgrade affected by this bug could take 2 or more hours to complete. It is very important to be patient and let the process complete. If you feel the cluster can not make progress please reach out to support.

# How do cluster admins detect they are impacted?

The failure will manifest itself in a few different ways. The first being observing the etcd cluster operator in a progressing status for a long period of time.

$ oc get co etcd
NAME   VERSION                   AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd   4.4.4                     False       True          True      3h54m

Check the naming of the etcd pods. Pods that are having trouble upgrading will still have the old pod name of `etcd-member` vs `etcd`.

$ oc get pods -n openshift-etcd
NAME                                                           READY   STATUS      RESTARTS   AGE
etcd-ip-10-0-136-48.us-west-1.compute.internal                 3/3     Running     0          3h59m
etcd-ip-10-0-182-151.us-west-1.compute.internal                3/3     Running     0          3h58m
etcd-member-ip-10-0-207-45.us-west-1.compute.internal          2/2     Running     0          3h57m

Lastly, you will see evidence in the etcd server logs of the TLS auth failure

> 2020-05-01T18:36:37.0791842Z 2020-05-01 18:36:37.079092 E | rafthttp: failed to dial d8027fcd63ed8f3f on stream MsgApp v2 (x509: certificate is valid for localhost, test.devcluster.openshift.com, 10.0.0.6, not etcd-0.test.devcluster.openshift.com)

# How will cluster admin know they are in the clear.

If you exec into the etcdctl(4.4)/etcd-member(4.3) container of the running etcd pods you can run
 
$ etcdctl member list -w table

If all of the members listed contain IPs for "PEER ADDRS" then this issue no longer affects your cluster.

Comment 11 Sam Batschelet 2020-05-20 14:03:28 UTC

*** Bug 1827132 has been marked as a duplicate of this bug. ***

Comment 12 Sam Batschelet 2020-05-22 19:55:45 UTC

*** Bug 1830409 has been marked as a duplicate of this bug. ***

Comment 13 Sam Batschelet 2020-05-27 00:12:01 UTC

*** Bug 1830789 has been marked as a duplicate of this bug. ***

Comment 14 Dan Mace 2020-06-18 18:29:30 UTC

*** Bug 1833250 has been marked as a duplicate of this bug. ***

Comment 15 W. Trevor King 2021-04-05 17:47:45 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.