2046335 – ETCD Operator goes degraded when a second internal node ip is added

Bug 2046335 - ETCD Operator goes degraded when a second internal node ip is added

Summary: ETCD Operator goes degraded when a second internal node ip is added

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Thomas Jungblut
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	2068382 2085335 (view as bug list)
Depends On:
Blocks:	2117582 2118212
TreeView+	depends on / blocked

Reported:	2022-01-26 16:01 UTC by Gabriel Meghnagi
Modified:	2023-01-17 19:47 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2117582 2118212 (view as bug list)
Environment:
Last Closed:	2023-01-17 19:47:08 UTC
Target Upstream Version:
Embargoed:
Flags:	tjungblu: needinfo- tjungblu: needinfo- tjungblu: needinfo- tjungblu: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 921	None	open	Bug 2046335: fix cert rotation on IP changes	2022-08-31 14:29:55 UTC
Red Hat Knowledge Base (Solution)	6978395	None	None	None	2022-09-30 15:30:00 UTC
Red Hat Product Errata	RHSA-2022:7399	None	None	None	2023-01-17 19:47:32 UTC

Description Gabriel Meghnagi 2022-01-26 16:01:14 UTC

Description of problem:

The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)


How reproducible:

Attach a NIC to a master node will trigger the issue

Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")

Actual results:

~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal  -o json | jq ".status.addresses"
[
  {
    "address": "10.0.178.163",
    "type": "InternalIP"
  },
  {
    "address": "10.0.187.247",
    "type": "InternalIP"
  },
  {
    "address": "ip-10-0-178-163.eu-central-1.compute.internal",
    "type": "Hostname"
  },
  {
    "address": "ip-10-0-178-163.eu-central-1.compute.internal",
    "type": "InternalDNS"
  }
]

$ oc get co etcd                                                                           
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd   4.8.11    True        False         True       31h

$ oc get co etcd -o json | jq ".status.conditions[0]"
{
  "lastTransitionTime": "2022-01-26T15:47:42Z",
  "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]",
  "reason": "EtcdCertSignerController_Error",
  "status": "True",
  "type": "Degraded"
}
~~~

Expected results:

To have the certificate valid also for the second IP (the newly created one "10.0.187.247")

Additional info:

Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret -n openshift-etcd | grep kubernetes.io/tls | grep ^etcd-                                                                       
etcd-client                                                          kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal               kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal              kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal              kubernetes.io/tls                     2      60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal            kubernetes.io/tls                     2      60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal           kubernetes.io/tls                     2      59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal           kubernetes.io/tls                     2      59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal    kubernetes.io/tls                     2      58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal   kubernetes.io/tls                     2      59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal   kubernetes.io/tls                     2      58s

$ oc get secret -n openshift-etcd | grep kubernetes.io/tls | grep ^etcd- | awk '{print $1}' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted

$ oc get co etcd -o json | jq ".status.conditions[0]"                    
{
  "lastTransitionTime": "2022-01-26T15:52:21Z",
  "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found",
  "reason": "AsExpected",
  "status": "False",
  "type": "Degraded"
}
~~~

Comment 4 Ashish Vyawahare 2022-06-28 06:36:50 UTC

Hi All,
I am also facing this issue while OCP upgrade.
https://bugzilla.redhat.com/show_bug.cgi?id=2085335

Comment 15 Thomas Jungblut 2022-09-07 11:04:54 UTC

*** Bug 2068382 has been marked as a duplicate of this bug. ***

Comment 16 Thomas Jungblut 2022-09-08 14:14:33 UTC

*** Bug 2085335 has been marked as a duplicate of this bug. ***

Comment 22 errata-xmlrpc 2023-01-17 19:47:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.