1929702 – Bare metal UPI nodes have been disconnected from the cluster after OCP upgrade to 4.6.16 and can't register since then

Bug 1929702 - Bare metal UPI nodes have been disconnected from the cluster after OCP upgrade to 4.6.16 and can't register since then

Summary: Bare metal UPI nodes have been disconnected from the cluster after OCP upgrad...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.7
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Gal Zaidman
QA Contact:	Guilherme Santos
Docs Contact:
URL:
Whiteboard:
Depends On:	1937694
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-17 13:25 UTC by Oren Cohen
Modified:	2021-06-25 17:00 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-16 14:28:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-ovirt pull 95	0	None	open	Bug 1929702: providerIDController ignore nodes that have no machine	2021-03-07 14:46:28 UTC

Description Oren Cohen 2021-02-17 13:25:24 UTC

Description of problem:
An OCP cluster consisting of virtualized IPI master and workers nodes based on RHV platform, in addition to 3 bare metal UPI RHCOS 46 workers, have been upgraded from 4.6.8 to 4.6.16.
Since then, all the three UPI nodes were removed from the cluster and are not joining back, also if their kubelet is restarted and the machines themselves are rebooted.
Re-installing one of them from scratch using PXE [*] with up-to-date images for kernel, initramfs and rootfs resulted in the same issue.

[*] https://docs.openshift.com/container-platform/4.6/installing/installing_bare_metal/installing-bare-metal.html#installation-user-infra-machines-pxe_installing-bare-metal

Notes:
1. when restarting the kubelet on the UPI node, it appears as "NotReady" node for a brief moment in the cluster, and then disappears.
2. there are no CSRs to approve.

Version-Release number of selected component (if applicable):
OCP 4.6.16
Red Hat Enterprise Linux CoreOS 46.82.202012051820-0 (UPI bare metal)
Red Hat Enterprise Linux CoreOS 46.82.202101301821-0 (masters)


How reproducible:
N/A - there are no similar clusters.

Steps to Reproduce:
1. upgrade OCP-over-RHV cluster, containing BM UPI workers, from 4.6.8 to 4.6.16.
2.
3.

Actual results:
Upgrade completed successfully, but the UPI workers dissappeared from the cluster and can't join back.

Expected results:
The UPI workers remain as cluster nodes, with ready state.

Additional info:
* UPI node's kubelet log (starting at kubelet start up):
https://drive.google.com/file/d/1eqjpHNqrg8742RxdXkDLzKOtD0n9OXtJ/view?usp=sharing

* CRI-O containers are not running on the UPI node.

* must-gather of the cluster is available here:
https://drive.google.com/file/d/1QUIjKF6mTv_Oi61MzYLGY-ZhjC7BMakA/view?usp=sharing

* This is the "oc describe node" output for the node that comes visible for a brief moment when kubelet restarts:

$ oc describe nodes zeus08.lab.eng.tlv2.redhat.com
Name:               zeus08.lab.eng.tlv2.redhat.com
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=zeus08.lab.eng.tlv2.redhat.com
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 17 Feb 2021 14:51:23 +0200
Taints:             node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  zeus08.lab.eng.tlv2.redhat.com
  AcquireTime:     <unset>
  RenewTime:       Wed, 17 Feb 2021 14:51:23 +0200
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 17 Feb 2021 14:51:23 +0200   Wed, 17 Feb 2021 14:51:23 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 17 Feb 2021 14:51:23 +0200   Wed, 17 Feb 2021 14:51:23 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 17 Feb 2021 14:51:23 +0200   Wed, 17 Feb 2021 14:51:23 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Wed, 17 Feb 2021 14:51:23 +0200   Wed, 17 Feb 2021 14:51:23 +0200   KubeletNotReady              [CSINode is not yet initialized, missing node capacity for resources: ephemeral-storage]
Addresses:
  InternalIP:  10.46.41.3
  Hostname:    zeus08.lab.eng.tlv2.redhat.com
Capacity:
  cpu:            16
  hugepages-1Gi:  0
  hugepages-2Mi:  0
  memory:         131992196Ki
  pods:           250
Allocatable:
  cpu:            15500m
  hugepages-1Gi:  0
  hugepages-2Mi:  0
  memory:         130841220Ki
  pods:           250
System Info:
  Machine ID:                 d471f29e5442448d913e3a363ce8c090
  System UUID:                4c4c4544-0039-5010-8056-c2c04f325332
  Boot ID:                    a436386d-d644-4c53-97cf-6734719d274a
  Kernel Version:             4.18.0-193.29.1.el8_2.x86_64
  OS Image:                   Red Hat Enterprise Linux CoreOS 46.82.202012051820-0 (Ootpa)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.19.0-26.rhaos4.6.git8a05a29.el8
  Kubelet Version:            v1.19.0+7070803
  Kube-Proxy Version:         v1.19.0+7070803
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                             CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                             ------------  ----------  ---------------  -------------  ---
  openshift-image-registry    node-ca-qgzgs                                    10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         1s
  openshift-monitoring        node-exporter-sf6f8                              9m (0%)       0 (0%)      210Mi (0%)       0 (0%)         1s
  openshift-multus            multus-2cn8k                                     10m (0%)      0 (0%)      150Mi (0%)       0 (0%)         1s
  openshift-ovirt-infra       coredns-zeus08.lab.eng.tlv2.redhat.com           100m (0%)     0 (0%)      200Mi (0%)       0 (0%)         1s
  openshift-ovirt-infra       keepalived-zeus08.lab.eng.tlv2.redhat.com        100m (0%)     0 (0%)      200Mi (0%)       0 (0%)         1s
  openshift-ovirt-infra       mdns-publisher-zeus08.lab.eng.tlv2.redhat.com    100m (0%)     0 (0%)      200Mi (0%)       0 (0%)         1s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                329m (2%)   0 (0%)
  memory             970Mi (0%)  0 (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:
  Type    Reason                   Age                    From     Message
  ----    ------                   ----                   ----     -------
  Normal  NodeHasSufficientMemory  16m (x26239 over 27h)  kubelet  Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasSufficientMemory
  Normal  Starting                 12m                    kubelet  Starting kubelet.
  Normal  NodeAllocatableEnforced  12m                    kubelet  Updated Node Allocatable limit across pods
  Normal  NodeHasNoDiskPressure    11m (x8 over 12m)      kubelet  Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     11m (x7 over 12m)      kubelet  Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasSufficientPID
  Normal  NodeHasSufficientMemory  119s (x217 over 12m)   kubelet  Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasSufficientMemory
  Normal  Starting                 1s                     kubelet  Starting kubelet.
  Normal  NodeHasSufficientMemory  1s                     kubelet  Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    1s                     kubelet  Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     1s                     kubelet  Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  1s                     kubelet  Updated Node Allocatable limit across pods

Comment 9 Gal Zaidman 2021-03-11 11:14:52 UTC

This will be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1937694

Note You need to log in before you can comment on or make changes to this bug.