Description of problem: Following an 4.6.17 -> 4.6.25 -> 4.7.11 OCP upgrade and CNV 2.5.7 -> 2.6.5 upgrade one of the virt-launcher pods remains in Init:0/1 status with describe logs showing "error adding container to network "bigip1ens4f0vf2": SRIOV-CNI failed to load netconf: LoadConf(): VF pci addr is required" [kni@ocp-edge18 ~]$ oc -n f5-lb get pods NAME READY STATUS RESTARTS AGE virt-launcher-bigip0-5g7c7 1/1 Running 0 38m virt-launcher-bigip1-gfx9q 0/1 Init:0/1 0 68m [kni@ocp-edge18 ~]$ oc -n f5-lb describe pods virt-launcher-bigip1-gfx9q | grep -i fail Warning FailedScheduling 69m default-scheduler 0/7 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity. Warning FailedScheduling 69m default-scheduler 0/7 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity. Warning FailedScheduling 68m default-scheduler 0/7 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedScheduling 62m default-scheduler 0/7 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity. Warning FailedCreatePodSandBox 61m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_virt-launcher-bigip1-gfx9q_f5-lb_28cb4897-3d43-4811-8787-c8609e0d5261_0(9b028b2436d825d3217322a0f93a94fc8e7d1b6af52af434b32edcaf2bd2b5be): [f5-lb/virt-launcher-bigip1-gfx9q:bigip1ens4f0vf2]: error adding container to network "bigip1ens4f0vf2": SRIOV-CNI failed to load netconf: LoadConf(): VF pci addr is required Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Deploy OCP 4.6.17 with CNV and SR-IOV operators 2. Create attached sriovnetworknodepolicy, sriovnetwork, NodeNetworkConfigurationPolicy and VirtualMachine 3. Upgrade OCP to 4.6.25 and then to 4.7.11 4. Upgrade CNV to 2.6.5 5. Upgrade SR-IOV network operator from 4.6.0-202106010807.p0.git.78e7139 to 4.7.0-202105211528.p0 Actual results: virt-launcher pods assigned to 1 of the 2 VirtualMachines is in Init:0/1 status Expected results: all virt-launcher pods are in Running status as they were before the upgrade procedure Additional info: After deleting the virt-launcher pod stuck in Init:0/1 status it gets recreated in Running state. Attaching must-gather and manifests used to create the VMs and networks.
It seems the issue was that sriov device is not yet ready when kubelet tried to start a new virt-launcher-bigip1. which results in failure getting sriov device ID in sriov-cni. From kubelet log: # Jun 09 10:06:51 openshift-worker-3 systemd[1]: Starting Kubernetes Kubelet... kubelet starts at 10:06:51 # Jun 09 10:08:22 openshift-worker-3 hyperkube[4248]: I0609 10:08:22.922738 4248 kuberuntime_manager.go:443] No sandbox for pod "virt-launcher-bigip1-gfx9q_f5-lb(28cb4897-3d43-4811-8787-c8609e0d5261)" can be found. Need to start a new one kubelet tried to start a new virt-launcher-bigip1. <== this is key for the failure, usually pods won't be scheduled on the node whose sriov device is not ready. but in the above case, pods get re-created by kubelet and I think it skipped the kube-scheduler part which would block the pods from being scheduled on this particular node until sriov device becomes ready (node status reporting non-zero allocatable resource). # 1680244 Jun 09 10:08:25 openshift-worker-3 hyperkube[4248]: E0609 10:08:25.449851 4248 remote_runtime.go:116] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_virt-launcher-bigip1-gfx9q_f5-lb_28cb4897-3d43-4811-8787-c8609e0d5261_0(9b028b2436d825d3217322a0f93a94fc8e7d1b6af52af434b32edcaf2bd2b5be): [f5-lb/virt-launcher-bigip1-gfx9q:bigip1ens4f0vf2]: error adding container to network "bigip1ens4f0vf2": SRIOV-CNI failed to load netconf: LoadConf(): VF pci addr is required sriov-cni failed to local netconf since VF pci addr (device ID) is not provided (sriov device is not ready) # Jun 09 10:31:04 openshift-worker-3 hyperkube[4248]: I0609 10:31:04.021215 4248 setters.go:389] Update allocatable for openshift.io/bigip0ens4f0vf1 to 0 kubelet reports sriov device status, indicating sriov device was still not ready. From the rest of kubelet log, I didn't see `openshift.io/bigip0ens4f0vf1` be set to 1 (non-zero) value by kubelet, do we miss any log from kubelet? I think the reason why it worked by deleting and re-creating the virt-launcher-bigip1 pods was that the pod got re-created after sriov device becomes ready. (this could be a workaround)
Marius, Would it be possible to run the same test, but in addition to the virt-launcher pods, can you also have a deployment that places pods that consume vfs on the same nodes and verify that those pods also have the same issue? Just want to be determine if we have the same behavior as with the virt-launcher pod. thanks Nabeel
(In reply to zenghui.shi from comment #8) > It seems the issue was that sriov device is not yet ready when kubelet tried > to start a new virt-launcher-bigip1. which results in failure getting sriov > device ID in sriov-cni. > > From kubelet log: > > # Jun 09 10:06:51 openshift-worker-3 systemd[1]: Starting Kubernetes > Kubelet... > > kubelet starts at 10:06:51 > > # Jun 09 10:08:22 openshift-worker-3 hyperkube[4248]: I0609 10:08:22.922738 > 4248 kuberuntime_manager.go:443] No sandbox for pod > "virt-launcher-bigip1-gfx9q_f5-lb(28cb4897-3d43-4811-8787-c8609e0d5261)" can > be found. Need to start a new one > > kubelet tried to start a new virt-launcher-bigip1. <== this is key for the > failure, usually pods won't be scheduled on the node whose sriov device is > not ready. but in the above case, pods get re-created by kubelet and I think > it skipped the kube-scheduler part which would block the pods from being > scheduled on this particular node until sriov device becomes ready (node > status reporting non-zero allocatable resource). > Could someone from kubelet helps to understand: when kubelet tries to restart the pod, does it re-evaluate the pod device assignment (e.g. update the checkpoint file and podresource endpoint)? Since CNI relies on the podresource endpoint to get the allocated device for a container, it would fail if the device doesn't exist in podresource endpoint or checkpoint file. > > From the rest of kubelet log, I didn't see `openshift.io/bigip0ens4f0vf1` be > set to 1 (non-zero) value by kubelet, do we miss any log from kubelet? Can we also get sriov-network-device-plugin log while the issue happens? it would tell whether devices have been reported to kubelet or not.
I reproduced the issue and adding some more data points: - I created non-CNV related deployments that consume VFs per Nabeel suggestion and they are affected as well - the issue doesn't show up after 4.6.17 -> 4.6.25 upgrade - the issue starts showing up after 4.6.25 to 4.7.11 OCP upgrade and before the CNV/SR-IOV operators upgrade - the issue seems to be affecting the pods running on the first node of each machine config pool; pods start fine on the second node in the mcp To give more context on the setup's nodes assignment to machine config pools - the setup has 4 worker nodes, the first 2 nodes(worker-0, worker-1) are part of worker-cnf mcp and the last 2 nodes(worker-2, worker-3) are part of load-balancer mcp. After the upgrade to 4.7.11 completes the issue affects only the pods scheduled to run on the first worker of each mcp, e.g. worker-0 and worker-2 while pods scheduled on worker-1 and worker-3 are running fine. [kni@ocp-edge18 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION openshift-master-0 Ready master 16h v1.20.0+75370d3 openshift-master-1 Ready master 16h v1.20.0+75370d3 openshift-master-2 Ready master 16h v1.20.0+75370d3 openshift-worker-0 Ready worker,worker-cnf 15h v1.20.0+75370d3 openshift-worker-1 Ready worker,worker-cnf 15h v1.20.0+75370d3 openshift-worker-2 Ready load-balancer,worker 15h v1.20.0+75370d3 openshift-worker-3 Ready load-balancer,worker 15h v1.20.0+75370d3 [kni@ocp-edge18 ~]$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE load-balancer rendered-load-balancer-3cb175e59327a8b5519217413b157fbb True False False 2 2 2 0 14h master rendered-master-49bc369f4185154baf6e63499638a235 True False False 3 3 3 0 16h worker rendered-worker-317a9f85779c73c6f6b222f8a02132a3 True False False 0 0 0 0 16h worker-cnf rendered-worker-cnf-317a9f85779c73c6f6b222f8a02132a3 True False False 2 2 2 0 5h11m [kni@ocp-edge18 ~]$ oc -n f5-lb get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES app0-5c88dc9c58-mvf2c 0/1 ContainerCreating 0 39m <none> openshift-worker-0 <none> <none> app1-854b554b69-p4ffz 1/1 Running 0 28m 10.130.2.6 openshift-worker-1 <none> <none> app2-646c7dfb88-dmmd7 0/1 ContainerCreating 0 53m <none> openshift-worker-2 <none> <none> app3-9cd657b59-pq2kf 1/1 Running 0 25m 10.128.2.10 openshift-worker-3 <none> <none> virt-launcher-bigip0-fp2z7 0/1 Init:0/1 0 53m <none> openshift-worker-2 <none> <none> virt-launcher-bigip1-54b4s 1/1 Running 0 25m 10.128.2.8 openshift-worker-3 <none> <none>
Debugging on the same environment as in comment #13, some more findings: `app2-646c7dfb88-dmmd7` is a sriov pod on `openshift-worker-2` worker node which is failing to start. $ oc describe pod app2-646c7dfb88-dmmd7 Name: app2-646c7dfb88-dmmd7 Namespace: f5-lb Priority: 0 Node: openshift-worker-2/10.46.58.25 Start Time: Thu, 10 Jun 2021 06:45:56 -0400 Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/app2-646c7dfb88 Containers: app2: Container ID: Image: xxxxxx State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Limits: openshift.io/app2ens4f0vf0: 1 <===== sriov resource is requested Requests: openshift.io/app2ens4f0vf0: 1 $ oc get node openshift-worker-2 -o json | jq -r '.status.allocatable' { ... "openshift.io/app2ens4f0vf0": "1", <===== node has sriov resource "openshift.io/app3ens4f0vf0": "1", "openshift.io/bigip0ens4f0vf0": "1", "openshift.io/bigip0ens4f0vf1": "1", "openshift.io/bigip1ens4f0vf2": "1", "openshift.io/bigip1ens4f0vf3": "1", "pods": "250" } On the openshift-worker-2 node, kubelet internal checkpoint file shows that `app2ens4f0vf0` resource is registried: "RegisteredDevices": { ... "openshift.io/app2ens4f0vf0": [ "0000:af:00.7" ... ], } but it is not assigned to any containers: "PodDeviceEntries": [ { "PodUID": "cbb6dd83-99c4-4d0e-8683-b4cc8d7ccdc7", "ContainerName": "compute", "ResourceName": "devices.kubevirt.io/vhost-net", "DeviceIDs": { "0": [ "vhost-net83" ] }, "AllocResp": "GiQKDi9kZXYvdmhvc3QtbmV0Eg4vZGV2L3Zob3N0LW5ldBoCcnc=" }, { "PodUID": "cbb6dd83-99c4-4d0e-8683-b4cc8d7ccdc7", "ContainerName": "compute", "ResourceName": "devices.kubevirt.io/kvm", "DeviceIDs": { "0": [ "kvm31" ] }, "AllocResp": "GhgKCC9kZXYva3ZtEggvZGV2L2t2bRoCcnc=" }, { "PodUID": "cbb6dd83-99c4-4d0e-8683-b4cc8d7ccdc7", "ContainerName": "compute", "ResourceName": "devices.kubevirt.io/tun", "DeviceIDs": { "0": [ "tun102" ] }, "AllocResp": "GiAKDC9kZXYvbmV0L3R1bhIML2Rldi9uZXQvdHVuGgJydw==" } ], Enable multus cni debug log (multus cni query pod resource information from kubelet podresource endpoint), it confirms that device is not assigned to the pod: 2021-06-10T11:44:37Z [debug] getKubernetesDelegate: found resourceName annotation : openshift.io/app2ens4f0vf0 2021-06-10T11:44:37Z [debug] GetResourceClient: using Kubelet resource API endpoint 2021-06-10T11:44:37Z [debug] getKubernetesDelegate: resourceMap instance: map[] <==== multus gets no resource from kubelet podresource endpoint and device ID is not passed to sriov-cni sriov-cni failed to get the VF PCI address (device ID): 2021-06-10T11:44:51Z [error] [f5-lb/virt-launcher-bigip0-fp2z7:bigip0ens4f0vf0]: error adding container to network "bigip0ens4f0vf0": SRIOV-CNI failed to load netconf: LoadConf(): VF pci addr is required It looks like kubelet starts the pod w/o requesting a sriov resource for it. We need input from kubelet team on why this happens.
Changing component to Node - Kubelet to get input on comment #15.
I don't think there is anything obvious wrong from podresource API point of view. The podresource API client is accessed without any issue but in multus the reliance is on the podresource API endpoint for obtaining resource map and device ID. Because the device is registered and not assigned to the pod, so we won't get the device id when the podresource endpoint is queried. Appears to me there is something wrong in the device manager.
The other issue I see here is that it appears that multus is failing to obtain device ID information from the podresource API but the pod is not running on the node in the first place so how do we get that device ID information of a pod that is not running from the podresource API??
(In reply to Swati Sehgal from comment #18) > I don't think there is anything obvious wrong from podresource API point of > view. The podresource API client is accessed without any issue but in multus > the reliance is on the podresource API endpoint for obtaining resource map > and device ID. Because the device is registered and not assigned to the pod, > so we won't get the device id when the podresource endpoint is queried. > Appears to me there is something wrong in the device manager. Kubelet assigns a device to the pod, on subsequence retries (sriov resource is ready), will it re-evaluate the device assignment and update through the podresource API? If so, should we expect the pod be created successfully even the first few retries failed due to sriov resource not ready? From the sriov device plugin log, it seems not, since device plugin didn't receive any Allocate call from kubelet: trimmed log from sriov device plugin pod running on openshift-worker-2 node: I0610 10:46:57.643418 26 manager.go:193] validating resource name "openshift.io/app2ens4f0vf0" I0610 10:46:57.715955 26 manager.go:116] Creating new ResourcePool: app2ens4f0vf0 I0610 10:46:57.715964 26 manager.go:117] DeviceType: netDevice I0610 10:46:57.721723 26 factory.go:106] device added: [pciAddr: 0000:af:00.7, vendor: 15b3, device: 1018, driver: mlx5_core] I0610 10:46:57.721749 26 manager.go:145] New resource server is created for app2ens4f0vf0 ResourcePool I0610 10:46:57.738235 26 main.go:72] Starting all servers... I0610 10:46:57.738252 26 server.go:191] starting app2ens4f0vf0 device plugin endpoint at: openshift.io_app2ens4f0vf0.sock I0610 10:46:57.739793 26 server.go:217] app2ens4f0vf0 device plugin endpoint started serving I0610 10:46:57.744269 26 main.go:77] All servers started. I0610 10:46:57.744284 26 main.go:78] Listening for term signals I0610 10:46:57.769271 26 server.go:106] Plugin: openshift.io_app2ens4f0vf0.sock gets registered successfully at Kubelet I0610 10:46:57.769983 26 server.go:131] ListAndWatch(app2ens4f0vf0) invoked I0610 10:46:57.770002 26 server.go:139] ListAndWatch(app2ens4f0vf0): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:af:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},}
I just stumbled upon a similar BZ that was reported on 4.8. Any chance these two are related at all? https://bugzilla.redhat.com/show_bug.cgi?id=1968625
(In reply to Petr Horáček from comment #22) > I just stumbled upon a similar BZ that was reported on 4.8. Any chance these > two are related at all? https://bugzilla.redhat.com/show_bug.cgi?id=1968625 Or this which looks even closer (although not the same) https://bugzilla.redhat.com/show_bug.cgi?id=1962634
(In reply to Petr Horáček from comment #23) > (In reply to Petr Horáček from comment #22) > > I just stumbled upon a similar BZ that was reported on 4.8. Any chance these > > two are related at all? https://bugzilla.redhat.com/show_bug.cgi?id=1968625 > > Or this which looks even closer (although not the same) > https://bugzilla.redhat.com/show_bug.cgi?id=1962634 Yep, we were missing details from the virt-launcher pod, and the resource was available on the node. Even there, the only log from the dp is about sending the devices to the kubelet, not receiving the allocation request. I thikn we are looking at the same bug
(In reply to Petr Horáček from comment #22) > I just stumbled upon a similar BZ that was reported on 4.8. Any chance these > two are related at all? https://bugzilla.redhat.com/show_bug.cgi?id=1968625 no, this is a regression introduced in 4.8, it fails the sriov pod creation the first time it gets created.
Just to clarify: we have this issue after a kubelet restart or is it independent on the kubelet state? IOW does it happen shortly after a kubelet restart, or can happen also when the kubelet is up and running for hours/days? If it happens regardless of the restart (so also on steady state), does it happen with the same frequency?
Reassigning to Francesco, adding NEEDINFO for his questions.
think it goes like this. The bug triggers only after a upgrade, so kubelet is stopped, updated (4.6 -> 4.7) and started again. The updated kubelet fails to restore the checkpoint data: ``` Jun 15 07:31:49.743990 openshift-worker-1 hyperkube[4852]: I0615 07:31:49.743846 4852 manager.go:236] Starting Device Plugin manager Jun 15 07:31:49.744225 openshift-worker-1 hyperkube[4852]: I0615 07:31:49.744202 4852 container_manager_linux.go:899] attempting to apply oom_score_adj of -999 to pid 4852 Jun 15 07:31:49.744225 openshift-worker-1 hyperkube[4852]: I0615 07:31:49.744227 4852 oom_linux.go:65] attempting to set "/proc/4852/oom_score_adj" to "-999" Jun 15 07:31:49.744332 openshift-worker-1 hyperkube[4852]: I0615 07:31:49.744264 4852 event.go:291] "Event occurred" object="openshift-worker-1" kind="Node" apiVersion="" type="Normal" reason="NodeAllocatableEnforced" message="Updated Node Allocatable limit across pods" Jun 15 07:31:49.744489 openshift-worker-1 hyperkube[4852]: I0615 07:31:49.744471 4852 qos_container_manager_linux.go:333] [ContainerManager]: Updated QoS cgroup configuration Jun 15 07:31:49.744795 openshift-worker-1 hyperkube[4852]: W0615 07:31:49.744770 4852 manager.go:244] Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date. Err: json: cannot unmarshal array into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type checkpoint.DevicesPerNUMA ``` So this means the internal devicemanager maps will be **completely** empty. The expected state after recovery is that the *key* of the relevant devices will be present. This is a key difference. After the failed recovery, the kubelet will happily run, but the node update callback (https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/devicemanager/manager.go#L516) will **not** update the node objects. This is another key difference. The expected behaviour is to reset the allocatable device counters to 0 (zero) up until the device plugins register themselves again. So the scheduler will send pods on this node, while it should not (devices not ready). In the pod admission flow, and we call `DeviceManager.Allocate()`, which calls `allocateContainerResources` (this happens in each and every flow, including the one we seen in this BZ) https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/devicemanager/manager.go#L830 Each loop, we have a call to `isDevicePluginResources` (https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/devicemanager/manager.go#L843). Problem is: exactly because the **empty healthyDevices** maps, it always return false. Worth stressing that the bug is that the maps are completely empty. The expected flows is to have the **keys** in the map, but with zero values (until the device plugin registers itself again). Should that have been the case, an device allocation request should have been performed. This explains the weird behaviour documented on https://bugzilla.redhat.com/show_bug.cgi?id=1969870#c15 I believe that fixing the checkpoint recovery should fix all the issues reported here. Further investigation should explore 1. if the bug presents itself in other scenarios than the upgrade flow (seems unlikely) 2. if the bug was present in earlier versions (no data, if so there could be a different root cause).
*** Bug 1969869 has been marked as a duplicate of this bug. ***
(In reply to Francesco Romani from comment #29) > think it goes like this. > > The bug triggers only after a upgrade, so kubelet is stopped, updated (4.6 > -> 4.7) and started again. > > The updated kubelet fails to restore the checkpoint data: > ``` > Jun 15 07:31:49.743990 openshift-worker-1 hyperkube[4852]: I0615 > 07:31:49.743846 4852 manager.go:236] Starting Device Plugin manager > Jun 15 07:31:49.744225 openshift-worker-1 hyperkube[4852]: I0615 > 07:31:49.744202 4852 container_manager_linux.go:899] attempting to apply > oom_score_adj of -999 to pid 4852 > Jun 15 07:31:49.744225 openshift-worker-1 hyperkube[4852]: I0615 > 07:31:49.744227 4852 oom_linux.go:65] attempting to set > "/proc/4852/oom_score_adj" to "-999" > Jun 15 07:31:49.744332 openshift-worker-1 hyperkube[4852]: I0615 > 07:31:49.744264 4852 event.go:291] "Event occurred" > object="openshift-worker-1" kind="Node" apiVersion="" type="Normal" > reason="NodeAllocatableEnforced" message="Updated Node Allocatable limit > across pods" > Jun 15 07:31:49.744489 openshift-worker-1 hyperkube[4852]: I0615 > 07:31:49.744471 4852 qos_container_manager_linux.go:333] > [ContainerManager]: Updated QoS cgroup configuration > Jun 15 07:31:49.744795 openshift-worker-1 hyperkube[4852]: W0615 > 07:31:49.744770 4852 manager.go:244] Continue after failing to read > checkpoint file. Device allocation info may NOT be up-to-date. Err: json: > cannot unmarshal array into Go struct field > PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type > checkpoint.DevicesPerNUMA > ``` > > So this means the internal devicemanager maps will be **completely** empty. > The expected state after recovery is that the *key* of the relevant devices > will be present. This is a key difference. > After the failed recovery, the kubelet will happily run, but the node update > callback > (https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/ > devicemanager/manager.go#L516) will **not** update the node objects. > > This is another key difference. The expected behaviour is to reset the > allocatable device counters to 0 (zero) up until the device plugins register > themselves again. > So the scheduler will send pods on this node, while it should not (devices > not ready). > > In the pod admission flow, and we call `DeviceManager.Allocate()`, which > calls `allocateContainerResources` (this happens in each and every flow, > including the one we seen in this BZ) > https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/ > devicemanager/manager.go#L830 > Each loop, we have a call to `isDevicePluginResources` > (https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/ > devicemanager/manager.go#L843). Problem is: exactly because the **empty > healthyDevices** maps, it always return false. > Worth stressing that the bug is that the maps are completely empty. The > expected flows is to have the **keys** in the map, but with zero values > (until the device plugin registers itself again). Should that have been the > case, an device allocation request should have been performed. > This explains the weird behaviour documented on > https://bugzilla.redhat.com/show_bug.cgi?id=1969870#c15 > > I believe that fixing the checkpoint recovery should fix all the issues > reported here. > Further investigation should explore > 1. if the bug presents itself in other scenarios than the upgrade flow > (seems unlikely) > 2. if the bug was present in earlier versions (no data, if so there could be > a different root cause). Further investiagion reveals that the device plugin is not particularly slow to re-register. It's "just" that the node is recovering, so all the pods -including the device plugin one- need to be recreated, and this take some time. Ultimately, it really seems a plain old race, which is possible because the device manager state is corrupted. Should the device manager state be correct, the workload pod would have been correctly kept pending by the basic kubernetes/ocp infrastructure until the devices where registered in the kubelet again. It seems the only needed fix is to properly recover the device manager state.
We're experiencing this same issue. 1)Is there a workaround ? 2) Do you have an ETA or release target ? Thank you, in advance !
(In reply to Abdul from comment #35) > We're experiencing this same issue. > 1)Is there a workaround ? > 2) Do you have an ETA or release target ? > > Thank you, in advance ! Hi! 1. A workaround is to wait for the scheduler plugin to register itself again before to schedule any pod to the node - note this bug should trigger only after a kubelet upgrade. This could be not so trivial to be implemented, maybe cordon the node after the upgrade? Unfortunately it's also hard to understand when the node heals only looking at the node objects, it's probably safer to inspect the kubelet logs 2. regarding ETA depends most on upstream acceptance, I'm re-pinged some key people lately. I will update with more details as soon as I get them.
(In reply to Francesco Romani from comment #36) > 1. A workaround is to wait for the scheduler plugin to register itself again Of course I meant "device plugins" here.
Thank you, for the prompt response, Francesco ! I've redeployed the device-plugin-mgr over and over again...and unfortunately that workaround is not working for me - unless there's some other step involved that I missed. Thank you, once more !
(In reply to Abdul from comment #38) > Thank you, for the prompt response, Francesco ! > > I've redeployed the device-plugin-mgr over and over again...and > unfortunately that workaround is not working for me - unless there's some > other step involved that I missed. > > Thank you, once more ! Sorry I was not clear. Redeploying the device plugin will most likely make things worse :) The issue here is that *after a kubelet upgrade* there is a time window on which the cluster state wrt device availability is inconsistent and misleading. This time window span from kubelet restart up to device-plugin(s) re-registration. The kubelet does not expose when the re-registration happens, so it is challenging to understand when the node heals - without looking at kubelet logs. The workaround/mitigation is to give time - "how much time" is challenging to find for the reasons above - to the node to heal not scheduling pods on it. So something like: 1. cordon the node (= set it Unschedulable/NotReady) 2. update the kubelet 3. wait some time, say 5 minutes 4. uncordon the node (= set it Schedulable again) if a workload fail to start for device unavailability, try to reschedule the workload again. I'll work to accelerate the u/s consumption of the fix once kubernetes 1.22 is out.
Thank you, for the clarification !
all the relevant PRs and backports were merged! https://github.com/kubernetes/kubernetes/issues/102880#issuecomment-982744510 once OCP will rebase, we will get them
considering kubernetes, the fix is available in 1.20.14 and onwards 1.21.8 and onwards 1.22.5 and onwards 1.23.0 and onwards
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.59 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6322