Bug 2046335

Summary:	ETCD Operator goes degraded when a second internal node ip is added
Product:	OpenShift Container Platform	Reporter:	Gabriel Meghnagi <gmeghnag>
Component:	Etcd	Assignee:	Thomas Jungblut <tjungblu>
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.8	CC:	acandelp, alray, atn, avyawahare87, aygarg, dwest, geliu, juqiao, kahara, mapandey, milang, openshift-bugs-escalate, pkhaire, rh-container, smerrow, tjungblu, vsolanki, yuokada
Target Milestone:	---	Flags:	tjungblu: needinfo- tjungblu: needinfo- tjungblu: needinfo- tjungblu: needinfo-
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2117582 2118212 (view as bug list)		Environment:
Last Closed:	2023-01-17 19:47:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2117582, 2118212

Description Gabriel Meghnagi 2022-01-26 16:01:14 UTC

Description of problem:

The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)


How reproducible:

Attach a NIC to a master node will trigger the issue

Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")

Actual results:

~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal  -o json | jq ".status.addresses"
[
  {
    "address": "10.0.178.163",
    "type": "InternalIP"
  },
  {
    "address": "10.0.187.247",
    "type": "InternalIP"
  },
  {
    "address": "ip-10-0-178-163.eu-central-1.compute.internal",
    "type": "Hostname"
  },
  {
    "address": "ip-10-0-178-163.eu-central-1.compute.internal",
    "type": "InternalDNS"
  }
]

$ oc get co etcd                                                                           
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd   4.8.11    True        False         True       31h

$ oc get co etcd -o json | jq ".status.conditions[0]"
{
  "lastTransitionTime": "2022-01-26T15:47:42Z",
  "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]",
  "reason": "EtcdCertSignerController_Error",
  "status": "True",
  "type": "Degraded"
}
~~~

Expected results:

To have the certificate valid also for the second IP (the newly created one "10.0.187.247")

Additional info:

Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret -n openshift-etcd | grep kubernetes.io/tls | grep ^etcd-                                                                       
etcd-client                                                          kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal               kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal              kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal              kubernetes.io/tls                     2      60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal            kubernetes.io/tls                     2      60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal           kubernetes.io/tls                     2      59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal           kubernetes.io/tls                     2      59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal    kubernetes.io/tls                     2      58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal   kubernetes.io/tls                     2      59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal   kubernetes.io/tls                     2      58s

$ oc get secret -n openshift-etcd | grep kubernetes.io/tls | grep ^etcd- | awk '{print $1}' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted

$ oc get co etcd -o json | jq ".status.conditions[0]"                    
{
  "lastTransitionTime": "2022-01-26T15:52:21Z",
  "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found",
  "reason": "AsExpected",
  "status": "False",
  "type": "Degraded"
}
~~~

Comment 4 Ashish Vyawahare 2022-06-28 06:36:50 UTC

Hi All,
I am also facing this issue while OCP upgrade.
https://bugzilla.redhat.com/show_bug.cgi?id=2085335

Comment 15 Thomas Jungblut 2022-09-07 11:04:54 UTC

*** Bug 2068382 has been marked as a duplicate of this bug. ***

Comment 16 Thomas Jungblut 2022-09-08 14:14:33 UTC

*** Bug 2085335 has been marked as a duplicate of this bug. ***

Comment 22 errata-xmlrpc 2023-01-17 19:47:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399