Description of problem: An OCP cluster consisting of virtualized IPI master and workers nodes based on RHV platform, in addition to 3 bare metal UPI RHCOS 46 workers, have been upgraded from 4.6.8 to 4.6.16. Since then, all the three UPI nodes were removed from the cluster and are not joining back, also if their kubelet is restarted and the machines themselves are rebooted. Re-installing one of them from scratch using PXE [*] with up-to-date images for kernel, initramfs and rootfs resulted in the same issue. [*] https://docs.openshift.com/container-platform/4.6/installing/installing_bare_metal/installing-bare-metal.html#installation-user-infra-machines-pxe_installing-bare-metal Notes: 1. when restarting the kubelet on the UPI node, it appears as "NotReady" node for a brief moment in the cluster, and then disappears. 2. there are no CSRs to approve. Version-Release number of selected component (if applicable): OCP 4.6.16 Red Hat Enterprise Linux CoreOS 46.82.202012051820-0 (UPI bare metal) Red Hat Enterprise Linux CoreOS 46.82.202101301821-0 (masters) How reproducible: N/A - there are no similar clusters. Steps to Reproduce: 1. upgrade OCP-over-RHV cluster, containing BM UPI workers, from 4.6.8 to 4.6.16. 2. 3. Actual results: Upgrade completed successfully, but the UPI workers dissappeared from the cluster and can't join back. Expected results: The UPI workers remain as cluster nodes, with ready state. Additional info: * UPI node's kubelet log (starting at kubelet start up): https://drive.google.com/file/d/1eqjpHNqrg8742RxdXkDLzKOtD0n9OXtJ/view?usp=sharing * CRI-O containers are not running on the UPI node. * must-gather of the cluster is available here: https://drive.google.com/file/d/1QUIjKF6mTv_Oi61MzYLGY-ZhjC7BMakA/view?usp=sharing * This is the "oc describe node" output for the node that comes visible for a brief moment when kubelet restarts: $ oc describe nodes zeus08.lab.eng.tlv2.redhat.com Name: zeus08.lab.eng.tlv2.redhat.com Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=zeus08.lab.eng.tlv2.redhat.com kubernetes.io/os=linux node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos Annotations: volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Wed, 17 Feb 2021 14:51:23 +0200 Taints: node.kubernetes.io/not-ready:NoSchedule Unschedulable: false Lease: HolderIdentity: zeus08.lab.eng.tlv2.redhat.com AcquireTime: <unset> RenewTime: Wed, 17 Feb 2021 14:51:23 +0200 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 17 Feb 2021 14:51:23 +0200 Wed, 17 Feb 2021 14:51:23 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 17 Feb 2021 14:51:23 +0200 Wed, 17 Feb 2021 14:51:23 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 17 Feb 2021 14:51:23 +0200 Wed, 17 Feb 2021 14:51:23 +0200 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Wed, 17 Feb 2021 14:51:23 +0200 Wed, 17 Feb 2021 14:51:23 +0200 KubeletNotReady [CSINode is not yet initialized, missing node capacity for resources: ephemeral-storage] Addresses: InternalIP: 10.46.41.3 Hostname: zeus08.lab.eng.tlv2.redhat.com Capacity: cpu: 16 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 131992196Ki pods: 250 Allocatable: cpu: 15500m hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 130841220Ki pods: 250 System Info: Machine ID: d471f29e5442448d913e3a363ce8c090 System UUID: 4c4c4544-0039-5010-8056-c2c04f325332 Boot ID: a436386d-d644-4c53-97cf-6734719d274a Kernel Version: 4.18.0-193.29.1.el8_2.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 46.82.202012051820-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.19.0-26.rhaos4.6.git8a05a29.el8 Kubelet Version: v1.19.0+7070803 Kube-Proxy Version: v1.19.0+7070803 Non-terminated Pods: (6 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- openshift-image-registry node-ca-qgzgs 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 1s openshift-monitoring node-exporter-sf6f8 9m (0%) 0 (0%) 210Mi (0%) 0 (0%) 1s openshift-multus multus-2cn8k 10m (0%) 0 (0%) 150Mi (0%) 0 (0%) 1s openshift-ovirt-infra coredns-zeus08.lab.eng.tlv2.redhat.com 100m (0%) 0 (0%) 200Mi (0%) 0 (0%) 1s openshift-ovirt-infra keepalived-zeus08.lab.eng.tlv2.redhat.com 100m (0%) 0 (0%) 200Mi (0%) 0 (0%) 1s openshift-ovirt-infra mdns-publisher-zeus08.lab.eng.tlv2.redhat.com 100m (0%) 0 (0%) 200Mi (0%) 0 (0%) 1s Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 329m (2%) 0 (0%) memory 970Mi (0%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeHasSufficientMemory 16m (x26239 over 27h) kubelet Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasSufficientMemory Normal Starting 12m kubelet Starting kubelet. Normal NodeAllocatableEnforced 12m kubelet Updated Node Allocatable limit across pods Normal NodeHasNoDiskPressure 11m (x8 over 12m) kubelet Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 11m (x7 over 12m) kubelet Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasSufficientPID Normal NodeHasSufficientMemory 119s (x217 over 12m) kubelet Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasSufficientMemory Normal Starting 1s kubelet Starting kubelet. Normal NodeHasSufficientMemory 1s kubelet Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 1s kubelet Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 1s kubelet Node zeus08.lab.eng.tlv2.redhat.com status is now: NodeHasSufficientPID Normal NodeAllocatableEnforced 1s kubelet Updated Node Allocatable limit across pods
This will be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1937694