Bug 1838625 - After upgrade from OCP v4.3.0 to v4.3.18 one worker node is NotReady and additional localhost with the same IP is present as a node
Summary: After upgrade from OCP v4.3.0 to v4.3.18 one worker node is NotReady and addi...
Keywords:
Status: CLOSED DUPLICATE of bug 1809345
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Ben Howard
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1186913
TreeView+ depends on / blocked
 
Reported: 2020-05-21 13:11 UTC by Radomir Ludva
Modified: 2023-10-06 20:11 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-10 16:58:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Radomir Ludva 2020-05-21 13:11:24 UTC
Description of problem:
-----------------------
One node in the cluster with static IP address of nodes in OCP 4.3.0 after upgrade to OCP v4.3.18 is NotReady and an additional localhost node with the same IP address is present.

The monitoring and network are processing and degraded other cluster operators are in 4.3.18 and correct.
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
monitoring                                 4.3.18    False       True          True       2d17h
network                                    4.3.18    True        True          True       62d


Version-Release number:
-----------------------
Server Version: 4.3.18
Kubernetes Version: v1.16.2


Additional info:
----------------
There is probably a solution on how to correct this issue with:
$ oc delete node localhost
$ oc delete node worker-04
$ -> restart of worker-04
But it is important for us to know the root cause.

- There is a lot of pending CSR for the localhost node.

Comment 2 Ryan Phillips 2020-05-22 14:17:40 UTC
Reassigning to RHCOS. There is a bug regarding localhost being set in certain situations.

Comment 3 Radomir Ludva 2020-05-22 16:07:34 UTC
Additional info:
================

The oc delete localhost and worker-node plus restart of worker-node did not help to solve it. Worker-node was connected to the cluster like localhost. But this time this localhost node is regular part of the cluster with IP for the worker-node. DNS and PTR records are set correctly.  

After restart of another worker node this restarted worker node is not ready with localhost:
$ openssl x509 -text -in /var/lib/kubelet/pki/kubelet-client-current.pem | grep CN
        Issuer: CN = kube-csr-signer_@1588779788
        Subject: O = system:nodes, CN = system:node:localhost


From the logs of network-online: It looks like it is set correctly but after 4 seconds back as localhost:
--------------------------------
[debug node] $ journalctl -u network-online.target
 -- Logs begin at Thu 2020-05-21 10:30:01 UTC, end at Fri 2020-05-22 13:43:38 UTC. --
 May 22 10:40:21 worker-04.example.com systemd[1]: Stopped target Network is Online.
 -- Reboot --
 May 22 10:43:11 localhost systemd[1]: Reached target Network is Online.

Comment 4 Micah Abbott 2020-05-26 13:56:40 UTC
Setting priority as medium and targeted for 4.6.  There are a handful of other BZs related to how the hostname is handled that may be related to this one.  We will investigate and do more diligent triage of this issue when capacity allows.

Comment 6 Ben Howard 2020-06-10 16:58:02 UTC
This is a duplicate of 1809345
Backport was released via https://github.com/openshift/machine-config-operator/commit/0b2741b3c0d735446cedb3d2494d85a4cbd74b90

*** This bug has been marked as a duplicate of bug 1809345 ***


Note You need to log in before you can comment on or make changes to this bug.