Bug 1957640

Summary:	EtcdCertSignerControllerDegraded error when upgrading from OCP 4.6 to 4.7
Product:	OpenShift Container Platform	Reporter:	Lucas López Montero <llopezmo>
Component:	Etcd	Assignee:	Maru Newby <mnewby>
Status:	CLOSED NOTABUG	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.7	CC:	mnewby, rsandu, sbatsche
Target Milestone:	---
Target Release:	4.7.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-06 19:30:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1954129
Bug Blocks:

Comment 4 Maru Newby 2021-05-06 15:28:13 UTC

If the ip changed, the ip changed. There's nothing we can do about that happening, and the fix is to

Comment 5 Maru Newby 2021-05-06 16:05:27 UTC

Comment #4 is incomplete, apologies. Please disregard.

For all released versions of OS4 today, the etcd operator assumes an ip address change indicates a change in etcd membership that requires manual intervention. For an ip address change, though, it shouldn't require replacing members. Isntead, trigger certificate replacement by deleting one cert secret for each node. I suggest removing `openshift-etcd/etcd-serving-metrics-*`. The operator will be prompted by the absence of these secrets to recreate all etcd certificates for all nodes.

A fix merged in 4.8 (https://github.com/openshift/cluster-etcd-operator/pull/540) to ensure automatic cert regeneration in the event of an ip adddress change, a backport is already underway for 4.7 (https://github.com/openshift/cluster-etcd-operator/pull/577), and once that merges we can attempt to backport to 4.6. The catch is that an unpatched release may still exhibit the reported issue on upgrade to a patched release. The fix depends on checking node identity against a uid saved on each cert secret, and the absence of that saved uid on the secrets created by an unpatched release will prevent automatic cert regeneration.

Comment 6 Maru Newby 2021-05-06 19:30:48 UTC

I'm afraid I was confusing this issue with another recent issue in which ip addresses were added rather than simply being changed.

Two steps are require to fix:

 - Trigger cert regeneration by deleting `openshift-etcd/etcd-serving-metrics-*`
 - Update advertised peer urls to reflect the new ip address(es): https://etcd.io/docs/v3.3/op-guide/runtime-configuration/#update-advertise-peer-urls

A larger issue for the customer is ensuring that the ip addresses of master nodes do not vary. Ensuring static ip assignment is platform-specific and outside the scope of something the etcd team can assist with. If this is not fixed, the next master node reboot (whether due to upgrade or another trigger) is likely to see the recurrence of the reported issue.

Comment 7 Lucas López Montero 2021-05-07 07:38:52 UTC

I wrote the KCS article https://access.redhat.com/node/6021331 and I am working to correct it with the new information.

Regarding the first step, are all the files listed below the ones that have to be removed?

$ oc rsh etcd-ip-10-0-130-174.eu-central-1.compute.internal
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-ip-10-0-130-174.eu-central-1.compute.internal -n openshift-etcd' to see all of the containers in this pod.
sh-4.4# find / -iname "etcd-serving-metrics*"
/etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-130-174.eu-central-1.compute.internal.crt
/etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-130-174.eu-central-1.compute.internal.key
/etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-171-199.eu-central-1.compute.internal.crt
/etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-171-199.eu-central-1.compute.internal.key
/etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-195-125.eu-central-1.compute.internal.crt
/etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-195-125.eu-central-1.compute.internal.key
/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-130-174.eu-central-1.compute.internal.key
/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-171-199.eu-central-1.compute.internal.crt
/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-171-199.eu-central-1.compute.internal.key
/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-195-125.eu-central-1.compute.internal.crt
/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-195-125.eu-central-1.compute.internal.key
/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-130-174.eu-central-1.compute.internal.crt

Comment 8 Maru Newby 2021-05-10 20:44:34 UTC

Apologies for not being clear. The first step is deleting secrets with a name prefix of 'etcd-serving-metrics-' in the 'openshift-etcd' namespace. This will prompt recreation of all secrets for all etcd members.

Comment 9 Lucas López Montero 2021-05-12 08:26:57 UTC

No problem, Maru. Thank you very much for your clarification. The KCS article has been edited with the optimal solution.