Bug 2046335 - ETCD Operator goes degraded when a second internal node ip is added
Summary: ETCD Operator goes degraded when a second internal node ip is added
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Thomas Jungblut
QA Contact: ge liu
URL:
Whiteboard:
: 2068382 2085335 (view as bug list)
Depends On:
Blocks: 2117582 2118212
TreeView+ depends on / blocked
 
Reported: 2022-01-26 16:01 UTC by Gabriel Meghnagi
Modified: 2023-01-17 19:47 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2117582 2118212 (view as bug list)
Environment:
Last Closed: 2023-01-17 19:47:08 UTC
Target Upstream Version:
Embargoed:
tjungblu: needinfo-
tjungblu: needinfo-
tjungblu: needinfo-
tjungblu: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 921 0 None open Bug 2046335: fix cert rotation on IP changes 2022-08-31 14:29:55 UTC
Red Hat Knowledge Base (Solution) 6978395 0 None None None 2022-09-30 15:30:00 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:47:32 UTC

Description Gabriel Meghnagi 2022-01-26 16:01:14 UTC
Description of problem:

The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)


How reproducible:

Attach a NIC to a master node will trigger the issue

Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")

Actual results:

~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal  -o json | jq ".status.addresses"
[
  {
    "address": "10.0.178.163",
    "type": "InternalIP"
  },
  {
    "address": "10.0.187.247",
    "type": "InternalIP"
  },
  {
    "address": "ip-10-0-178-163.eu-central-1.compute.internal",
    "type": "Hostname"
  },
  {
    "address": "ip-10-0-178-163.eu-central-1.compute.internal",
    "type": "InternalDNS"
  }
]

$ oc get co etcd                                                                           
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd   4.8.11    True        False         True       31h

$ oc get co etcd -o json | jq ".status.conditions[0]"
{
  "lastTransitionTime": "2022-01-26T15:47:42Z",
  "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]",
  "reason": "EtcdCertSignerController_Error",
  "status": "True",
  "type": "Degraded"
}
~~~

Expected results:

To have the certificate valid also for the second IP (the newly created one "10.0.187.247")

Additional info:

Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret -n openshift-etcd | grep kubernetes.io/tls | grep ^etcd-                                                                       
etcd-client                                                          kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal               kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal              kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal              kubernetes.io/tls                     2      60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal            kubernetes.io/tls                     2      60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal           kubernetes.io/tls                     2      59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal           kubernetes.io/tls                     2      59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal    kubernetes.io/tls                     2      58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal   kubernetes.io/tls                     2      59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal   kubernetes.io/tls                     2      58s

$ oc get secret -n openshift-etcd | grep kubernetes.io/tls | grep ^etcd- | awk '{print $1}' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted

$ oc get co etcd -o json | jq ".status.conditions[0]"                    
{
  "lastTransitionTime": "2022-01-26T15:52:21Z",
  "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found",
  "reason": "AsExpected",
  "status": "False",
  "type": "Degraded"
}
~~~

Comment 4 Ashish Vyawahare 2022-06-28 06:36:50 UTC
Hi All,
I am also facing this issue while OCP upgrade.
https://bugzilla.redhat.com/show_bug.cgi?id=2085335

Comment 15 Thomas Jungblut 2022-09-07 11:04:54 UTC
*** Bug 2068382 has been marked as a duplicate of this bug. ***

Comment 16 Thomas Jungblut 2022-09-08 14:14:33 UTC
*** Bug 2085335 has been marked as a duplicate of this bug. ***

Comment 22 errata-xmlrpc 2023-01-17 19:47:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.