Bug 1957640
| Summary: | EtcdCertSignerControllerDegraded error when upgrading from OCP 4.6 to 4.7 | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lucas López Montero <llopezmo> |
| Component: | Etcd | Assignee: | Maru Newby <mnewby> |
| Status: | CLOSED NOTABUG | QA Contact: | ge liu <geliu> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.7 | CC: | mnewby, rsandu, sbatsche |
| Target Milestone: | --- | ||
| Target Release: | 4.7.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-06 19:30:48 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1954129 | ||
| Bug Blocks: | |||
|
Comment 4
Maru Newby
2021-05-06 15:28:13 UTC
Comment #4 is incomplete, apologies. Please disregard. For all released versions of OS4 today, the etcd operator assumes an ip address change indicates a change in etcd membership that requires manual intervention. For an ip address change, though, it shouldn't require replacing members. Isntead, trigger certificate replacement by deleting one cert secret for each node. I suggest removing `openshift-etcd/etcd-serving-metrics-*`. The operator will be prompted by the absence of these secrets to recreate all etcd certificates for all nodes. A fix merged in 4.8 (https://github.com/openshift/cluster-etcd-operator/pull/540) to ensure automatic cert regeneration in the event of an ip adddress change, a backport is already underway for 4.7 (https://github.com/openshift/cluster-etcd-operator/pull/577), and once that merges we can attempt to backport to 4.6. The catch is that an unpatched release may still exhibit the reported issue on upgrade to a patched release. The fix depends on checking node identity against a uid saved on each cert secret, and the absence of that saved uid on the secrets created by an unpatched release will prevent automatic cert regeneration. I'm afraid I was confusing this issue with another recent issue in which ip addresses were added rather than simply being changed. Two steps are require to fix: - Trigger cert regeneration by deleting `openshift-etcd/etcd-serving-metrics-*` - Update advertised peer urls to reflect the new ip address(es): https://etcd.io/docs/v3.3/op-guide/runtime-configuration/#update-advertise-peer-urls A larger issue for the customer is ensuring that the ip addresses of master nodes do not vary. Ensuring static ip assignment is platform-specific and outside the scope of something the etcd team can assist with. If this is not fixed, the next master node reboot (whether due to upgrade or another trigger) is likely to see the recurrence of the reported issue. I wrote the KCS article https://access.redhat.com/node/6021331 and I am working to correct it with the new information. Regarding the first step, are all the files listed below the ones that have to be removed? $ oc rsh etcd-ip-10-0-130-174.eu-central-1.compute.internal Defaulting container name to etcdctl. Use 'oc describe pod/etcd-ip-10-0-130-174.eu-central-1.compute.internal -n openshift-etcd' to see all of the containers in this pod. sh-4.4# find / -iname "etcd-serving-metrics*" /etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-130-174.eu-central-1.compute.internal.crt /etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-130-174.eu-central-1.compute.internal.key /etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-171-199.eu-central-1.compute.internal.crt /etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-171-199.eu-central-1.compute.internal.key /etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-195-125.eu-central-1.compute.internal.crt /etc/kubernetes/static-pod-resources/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-195-125.eu-central-1.compute.internal.key /etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-130-174.eu-central-1.compute.internal.key /etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-171-199.eu-central-1.compute.internal.crt /etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-171-199.eu-central-1.compute.internal.key /etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-195-125.eu-central-1.compute.internal.crt /etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-195-125.eu-central-1.compute.internal.key /etc/kubernetes/static-pod-certs/secrets/etcd-all-serving-metrics/etcd-serving-metrics-ip-10-0-130-174.eu-central-1.compute.internal.crt Apologies for not being clear. The first step is deleting secrets with a name prefix of 'etcd-serving-metrics-' in the 'openshift-etcd' namespace. This will prompt recreation of all secrets for all etcd members. No problem, Maru. Thank you very much for your clarification. The KCS article has been edited with the optimal solution. |