Bug 2046335

Summary: ETCD Operator goes degraded when a second internal node ip is added
Product: OpenShift Container Platform Reporter: Gabriel Meghnagi <gmeghnag>
Component: EtcdAssignee: Thomas Jungblut <tjungblu>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: acandelp, alray, atn, avyawahare87, aygarg, dwest, geliu, juqiao, kahara, mapandey, milang, openshift-bugs-escalate, pkhaire, rh-container, smerrow, tjungblu, vsolanki, yuokada
Target Milestone: ---Flags: tjungblu: needinfo-
tjungblu: needinfo-
tjungblu: needinfo-
tjungblu: needinfo-
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2117582 2118212 (view as bug list) Environment:
Last Closed: 2023-01-17 19:47:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2117582, 2118212    

Description Gabriel Meghnagi 2022-01-26 16:01:14 UTC
Description of problem:

The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)


How reproducible:

Attach a NIC to a master node will trigger the issue

Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")

Actual results:

~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal  -o json | jq ".status.addresses"
[
  {
    "address": "10.0.178.163",
    "type": "InternalIP"
  },
  {
    "address": "10.0.187.247",
    "type": "InternalIP"
  },
  {
    "address": "ip-10-0-178-163.eu-central-1.compute.internal",
    "type": "Hostname"
  },
  {
    "address": "ip-10-0-178-163.eu-central-1.compute.internal",
    "type": "InternalDNS"
  }
]

$ oc get co etcd                                                                           
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd   4.8.11    True        False         True       31h

$ oc get co etcd -o json | jq ".status.conditions[0]"
{
  "lastTransitionTime": "2022-01-26T15:47:42Z",
  "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]",
  "reason": "EtcdCertSignerController_Error",
  "status": "True",
  "type": "Degraded"
}
~~~

Expected results:

To have the certificate valid also for the second IP (the newly created one "10.0.187.247")

Additional info:

Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret -n openshift-etcd | grep kubernetes.io/tls | grep ^etcd-                                                                       
etcd-client                                                          kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal               kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal              kubernetes.io/tls                     2      61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal              kubernetes.io/tls                     2      60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal            kubernetes.io/tls                     2      60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal           kubernetes.io/tls                     2      59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal           kubernetes.io/tls                     2      59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal    kubernetes.io/tls                     2      58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal   kubernetes.io/tls                     2      59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal   kubernetes.io/tls                     2      58s

$ oc get secret -n openshift-etcd | grep kubernetes.io/tls | grep ^etcd- | awk '{print $1}' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted

$ oc get co etcd -o json | jq ".status.conditions[0]"                    
{
  "lastTransitionTime": "2022-01-26T15:52:21Z",
  "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found",
  "reason": "AsExpected",
  "status": "False",
  "type": "Degraded"
}
~~~

Comment 4 Ashish Vyawahare 2022-06-28 06:36:50 UTC
Hi All,
I am also facing this issue while OCP upgrade.
https://bugzilla.redhat.com/show_bug.cgi?id=2085335

Comment 15 Thomas Jungblut 2022-09-07 11:04:54 UTC
*** Bug 2068382 has been marked as a duplicate of this bug. ***

Comment 16 Thomas Jungblut 2022-09-08 14:14:33 UTC
*** Bug 2085335 has been marked as a duplicate of this bug. ***

Comment 22 errata-xmlrpc 2023-01-17 19:47:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399