Bug 1838625

Summary: After upgrade from OCP v4.3.0 to v4.3.18 one worker node is NotReady and additional localhost with the same IP is present as a node
Product: OpenShift Container Platform Reporter: Radomir Ludva <rludva>
Component: RHCOSAssignee: Ben Howard <behoward>
Status: CLOSED DUPLICATE QA Contact: Michael Nguyen <mnguyen>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3.zCC: aos-bugs, bbreard, dornelas, eparis, imcleod, jligon, jokerman, miabbott, nstielau
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-10 16:58:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1186913    

Description Radomir Ludva 2020-05-21 13:11:24 UTC
Description of problem:
-----------------------
One node in the cluster with static IP address of nodes in OCP 4.3.0 after upgrade to OCP v4.3.18 is NotReady and an additional localhost node with the same IP address is present.

The monitoring and network are processing and degraded other cluster operators are in 4.3.18 and correct.
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
monitoring                                 4.3.18    False       True          True       2d17h
network                                    4.3.18    True        True          True       62d


Version-Release number:
-----------------------
Server Version: 4.3.18
Kubernetes Version: v1.16.2


Additional info:
----------------
There is probably a solution on how to correct this issue with:
$ oc delete node localhost
$ oc delete node worker-04
$ -> restart of worker-04
But it is important for us to know the root cause.

- There is a lot of pending CSR for the localhost node.

Comment 2 Ryan Phillips 2020-05-22 14:17:40 UTC
Reassigning to RHCOS. There is a bug regarding localhost being set in certain situations.

Comment 3 Radomir Ludva 2020-05-22 16:07:34 UTC
Additional info:
================

The oc delete localhost and worker-node plus restart of worker-node did not help to solve it. Worker-node was connected to the cluster like localhost. But this time this localhost node is regular part of the cluster with IP for the worker-node. DNS and PTR records are set correctly.  

After restart of another worker node this restarted worker node is not ready with localhost:
$ openssl x509 -text -in /var/lib/kubelet/pki/kubelet-client-current.pem | grep CN
        Issuer: CN = kube-csr-signer_@1588779788
        Subject: O = system:nodes, CN = system:node:localhost


From the logs of network-online: It looks like it is set correctly but after 4 seconds back as localhost:
--------------------------------
[debug node] $ journalctl -u network-online.target
 -- Logs begin at Thu 2020-05-21 10:30:01 UTC, end at Fri 2020-05-22 13:43:38 UTC. --
 May 22 10:40:21 worker-04.example.com systemd[1]: Stopped target Network is Online.
 -- Reboot --
 May 22 10:43:11 localhost systemd[1]: Reached target Network is Online.

Comment 4 Micah Abbott 2020-05-26 13:56:40 UTC
Setting priority as medium and targeted for 4.6.  There are a handful of other BZs related to how the hostname is handled that may be related to this one.  We will investigate and do more diligent triage of this issue when capacity allows.

Comment 6 Ben Howard 2020-06-10 16:58:02 UTC
This is a duplicate of 1809345
Backport was released via https://github.com/openshift/machine-config-operator/commit/0b2741b3c0d735446cedb3d2494d85a4cbd74b90

*** This bug has been marked as a duplicate of bug 1809345 ***