Bug 1969870 - virt-launcher pod stuck in Init:0/1 status: error adding container to network "bigip1ens4f0vf2": SRIOV-CNI failed to load netconf: LoadConf(): VF pci addr is required
Summary: virt-launcher pod stuck in Init:0/1 status: error adding container to network...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.z
Assignee: Francesco Romani
QA Contact: Weinan Liu
URL:
Whiteboard:
: 1969869 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-09 11:26 UTC by Marius Cornea
Modified: 2022-09-13 13:54 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-13 13:53:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 102880 0 None open Device manager silently fails to restore checkpoint after upgrading to 1.20+ 2021-06-15 15:59:15 UTC
Github kubernetes kubernetes pull 102882 0 None open devicemanager: checkpoint: support pre-1.20 data 2021-06-17 16:16:17 UTC
Github kubernetes kubernetes pull 106295 0 None Merged Automated cherry pick of #102882: devicemanager: checkpoint: support pre-1.20 data 2021-12-02 16:11:56 UTC
Red Hat Product Errata RHSA-2022:6322 0 None None None 2022-09-13 13:54:03 UTC

Description Marius Cornea 2021-06-09 11:26:44 UTC
Description of problem:

Following an 4.6.17 -> 4.6.25 -> 4.7.11 OCP upgrade and CNV 2.5.7 -> 2.6.5 upgrade one of the virt-launcher pods remains in Init:0/1 status with describe logs showing "error adding container to network "bigip1ens4f0vf2": SRIOV-CNI failed to load netconf: LoadConf(): VF pci addr is required"


[kni@ocp-edge18 ~]$ oc -n f5-lb get pods
NAME                         READY   STATUS     RESTARTS   AGE
virt-launcher-bigip0-5g7c7   1/1     Running    0          38m
virt-launcher-bigip1-gfx9q   0/1     Init:0/1   0          68m
[kni@ocp-edge18 ~]$ oc -n f5-lb describe pods virt-launcher-bigip1-gfx9q | grep -i fail
  Warning  FailedScheduling        69m                   default-scheduler  0/7 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity.
  Warning  FailedScheduling        69m                   default-scheduler  0/7 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity.
  Warning  FailedScheduling        68m                   default-scheduler  0/7 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling        62m                   default-scheduler  0/7 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity.
  Warning  FailedCreatePodSandBox  61m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_virt-launcher-bigip1-gfx9q_f5-lb_28cb4897-3d43-4811-8787-c8609e0d5261_0(9b028b2436d825d3217322a0f93a94fc8e7d1b6af52af434b32edcaf2bd2b5be): [f5-lb/virt-launcher-bigip1-gfx9q:bigip1ens4f0vf2]: error adding container to network "bigip1ens4f0vf2": SRIOV-CNI failed to load netconf: LoadConf(): VF pci addr is required


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP 4.6.17 with CNV and SR-IOV operators

2. Create attached sriovnetworknodepolicy, sriovnetwork, NodeNetworkConfigurationPolicy and VirtualMachine

3. Upgrade OCP to 4.6.25 and then to 4.7.11

4. Upgrade CNV to 2.6.5

5. Upgrade SR-IOV network operator from 4.6.0-202106010807.p0.git.78e7139 to 4.7.0-202105211528.p0


Actual results:
virt-launcher pods assigned to 1 of the 2 VirtualMachines is in Init:0/1 status

Expected results:
all virt-launcher pods are in Running status as they were before the upgrade procedure

Additional info:

After deleting the virt-launcher pod stuck in Init:0/1 status it gets recreated in Running state.

Attaching must-gather and manifests used to create the VMs and networks.

Comment 8 zenghui.shi 2021-06-10 01:47:24 UTC
It seems the issue was that sriov device is not yet ready when kubelet tried to start a new virt-launcher-bigip1. which results in failure getting sriov device ID in sriov-cni.

From kubelet log:

# Jun 09 10:06:51 openshift-worker-3 systemd[1]: Starting Kubernetes Kubelet...

kubelet starts at 10:06:51

# Jun 09 10:08:22 openshift-worker-3 hyperkube[4248]: I0609 10:08:22.922738    4248 kuberuntime_manager.go:443] No sandbox for pod "virt-launcher-bigip1-gfx9q_f5-lb(28cb4897-3d43-4811-8787-c8609e0d5261)" can be found. Need to start a new one

kubelet tried to start a new virt-launcher-bigip1. <== this is key for the failure, usually pods won't be scheduled on the node whose sriov device is not ready. but in the above case, pods get re-created by kubelet and I think it skipped the kube-scheduler part which would block the pods from being scheduled on this particular node until sriov device becomes ready (node status reporting non-zero allocatable resource).

# 1680244 Jun 09 10:08:25 openshift-worker-3 hyperkube[4248]: E0609 10:08:25.449851    4248 remote_runtime.go:116] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_virt-launcher-bigip1-gfx9q_f5-lb_28cb4897-3d43-4811-8787-c8609e0d5261_0(9b028b2436d825d3217322a0f93a94fc8e7d1b6af52af434b32edcaf2bd2b5be): [f5-lb/virt-launcher-bigip1-gfx9q:bigip1ens4f0vf2]: error adding container to network "bigip1ens4f0vf2": SRIOV-CNI failed to load netconf: LoadConf(): VF pci addr is required

sriov-cni failed to local netconf since VF pci addr (device ID) is not provided (sriov device is not ready)

# Jun 09 10:31:04 openshift-worker-3 hyperkube[4248]: I0609 10:31:04.021215    4248 setters.go:389] Update allocatable for openshift.io/bigip0ens4f0vf1 to 0

kubelet reports sriov device status, indicating sriov device was still not ready.

From the rest of kubelet log, I didn't see `openshift.io/bigip0ens4f0vf1` be set to 1 (non-zero) value by kubelet, do we miss any log from kubelet?

I think the reason why it worked by deleting and re-creating the virt-launcher-bigip1 pods was that the pod got re-created after sriov device becomes ready. (this could be a workaround)

Comment 9 Nabeel Cocker 2021-06-10 04:11:52 UTC
Marius,

Would it be possible to run the same test, but in addition to the virt-launcher pods, can you also have a deployment that places pods that consume vfs on the same nodes and verify that those pods also have the same issue?

Just want to be determine if we have the same behavior as with the virt-launcher pod.

thanks
Nabeel

Comment 10 zenghui.shi 2021-06-10 04:32:57 UTC
(In reply to zenghui.shi from comment #8)
> It seems the issue was that sriov device is not yet ready when kubelet tried
> to start a new virt-launcher-bigip1. which results in failure getting sriov
> device ID in sriov-cni.
> 
> From kubelet log:
> 
> # Jun 09 10:06:51 openshift-worker-3 systemd[1]: Starting Kubernetes
> Kubelet...
> 
> kubelet starts at 10:06:51
> 
> # Jun 09 10:08:22 openshift-worker-3 hyperkube[4248]: I0609 10:08:22.922738 
> 4248 kuberuntime_manager.go:443] No sandbox for pod
> "virt-launcher-bigip1-gfx9q_f5-lb(28cb4897-3d43-4811-8787-c8609e0d5261)" can
> be found. Need to start a new one
> 
> kubelet tried to start a new virt-launcher-bigip1. <== this is key for the
> failure, usually pods won't be scheduled on the node whose sriov device is
> not ready. but in the above case, pods get re-created by kubelet and I think
> it skipped the kube-scheduler part which would block the pods from being
> scheduled on this particular node until sriov device becomes ready (node
> status reporting non-zero allocatable resource).
> 

Could someone from kubelet helps to understand: when kubelet tries to restart the pod, does it re-evaluate the pod device assignment (e.g. update the checkpoint file and podresource endpoint)? Since CNI relies on the podresource endpoint to get the allocated device for a container, it would fail if the device doesn't exist in podresource endpoint or checkpoint file.

> 
> From the rest of kubelet log, I didn't see `openshift.io/bigip0ens4f0vf1` be
> set to 1 (non-zero) value by kubelet, do we miss any log from kubelet?

Can we also get sriov-network-device-plugin log while the issue happens? it would tell whether devices have been reported to kubelet or not.

Comment 13 Marius Cornea 2021-06-10 11:34:32 UTC
I reproduced the issue and adding some more data points:

- I created non-CNV related deployments that consume VFs per Nabeel suggestion and they are affected as well
- the issue doesn't show up after 4.6.17 -> 4.6.25 upgrade
- the issue starts showing up after 4.6.25 to 4.7.11 OCP upgrade and before the CNV/SR-IOV operators upgrade
- the issue seems to be affecting the pods running on the first node of each machine config pool; pods start fine on the second node in the mcp

To give more context on the setup's nodes assignment to machine config pools - the setup has 4 worker nodes, the first 2 nodes(worker-0, worker-1) are part of worker-cnf mcp and the last 2 nodes(worker-2, worker-3) are part of load-balancer mcp. After the upgrade to 4.7.11 completes the issue affects only the pods scheduled to run on the first worker of each mcp, e.g. worker-0 and worker-2 while pods scheduled on worker-1 and worker-3 are running fine.

[kni@ocp-edge18 ~]$ oc get nodes
NAME                 STATUS   ROLES                  AGE   VERSION
openshift-master-0   Ready    master                 16h   v1.20.0+75370d3
openshift-master-1   Ready    master                 16h   v1.20.0+75370d3
openshift-master-2   Ready    master                 16h   v1.20.0+75370d3
openshift-worker-0   Ready    worker,worker-cnf      15h   v1.20.0+75370d3
openshift-worker-1   Ready    worker,worker-cnf      15h   v1.20.0+75370d3
openshift-worker-2   Ready    load-balancer,worker   15h   v1.20.0+75370d3
openshift-worker-3   Ready    load-balancer,worker   15h   v1.20.0+75370d3


[kni@ocp-edge18 ~]$ oc get mcp
NAME            CONFIG                                                    UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
load-balancer   rendered-load-balancer-3cb175e59327a8b5519217413b157fbb   True      False      False      2              2                   2                     0                      14h
master          rendered-master-49bc369f4185154baf6e63499638a235          True      False      False      3              3                   3                     0                      16h
worker          rendered-worker-317a9f85779c73c6f6b222f8a02132a3          True      False      False      0              0                   0                     0                      16h
worker-cnf      rendered-worker-cnf-317a9f85779c73c6f6b222f8a02132a3      True      False      False      2              2                   2                     0                      5h11m


[kni@ocp-edge18 ~]$ oc -n f5-lb get pods -o wide
NAME                         READY   STATUS              RESTARTS   AGE   IP            NODE                 NOMINATED NODE   READINESS GATES
app0-5c88dc9c58-mvf2c        0/1     ContainerCreating   0          39m   <none>        openshift-worker-0   <none>           <none>
app1-854b554b69-p4ffz        1/1     Running             0          28m   10.130.2.6    openshift-worker-1   <none>           <none>
app2-646c7dfb88-dmmd7        0/1     ContainerCreating   0          53m   <none>        openshift-worker-2   <none>           <none>
app3-9cd657b59-pq2kf         1/1     Running             0          25m   10.128.2.10   openshift-worker-3   <none>           <none>
virt-launcher-bigip0-fp2z7   0/1     Init:0/1            0          53m   <none>        openshift-worker-2   <none>           <none>
virt-launcher-bigip1-54b4s   1/1     Running             0          25m   10.128.2.8    openshift-worker-3   <none>           <none>

Comment 15 zenghui.shi 2021-06-10 12:22:48 UTC
Debugging on the same environment as in comment #13, some more findings:

`app2-646c7dfb88-dmmd7` is a sriov pod on `openshift-worker-2` worker node which is failing to start.

$ oc describe pod app2-646c7dfb88-dmmd7 
Name:           app2-646c7dfb88-dmmd7
Namespace:      f5-lb
Priority:       0
Node:           openshift-worker-2/10.46.58.25
Start Time:     Thu, 10 Jun 2021 06:45:56 -0400
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/app2-646c7dfb88
Containers:
  app2:
    Container ID:   
    Image:          xxxxxx
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      openshift.io/app2ens4f0vf0:  1   <===== sriov resource is requested
    Requests:
      openshift.io/app2ens4f0vf0:  1


$ oc get node openshift-worker-2 -o json | jq -r '.status.allocatable'
{
...
  "openshift.io/app2ens4f0vf0": "1",   <===== node has sriov resource
  "openshift.io/app3ens4f0vf0": "1",
  "openshift.io/bigip0ens4f0vf0": "1",
  "openshift.io/bigip0ens4f0vf1": "1",
  "openshift.io/bigip1ens4f0vf2": "1",
  "openshift.io/bigip1ens4f0vf3": "1",
  "pods": "250"
}

On the openshift-worker-2 node, kubelet internal checkpoint file shows that `app2ens4f0vf0` resource is registried:

    "RegisteredDevices": {
...
      "openshift.io/app2ens4f0vf0": [
        "0000:af:00.7"
...
      ],
    }

but it is not assigned to any containers:

    "PodDeviceEntries": [
      {
        "PodUID": "cbb6dd83-99c4-4d0e-8683-b4cc8d7ccdc7",
        "ContainerName": "compute",
        "ResourceName": "devices.kubevirt.io/vhost-net",
        "DeviceIDs": {
          "0": [
            "vhost-net83"
          ]
        },
        "AllocResp": "GiQKDi9kZXYvdmhvc3QtbmV0Eg4vZGV2L3Zob3N0LW5ldBoCcnc="
      },
      {
        "PodUID": "cbb6dd83-99c4-4d0e-8683-b4cc8d7ccdc7",
        "ContainerName": "compute",
        "ResourceName": "devices.kubevirt.io/kvm",
        "DeviceIDs": {
          "0": [
            "kvm31"
          ]
        },
        "AllocResp": "GhgKCC9kZXYva3ZtEggvZGV2L2t2bRoCcnc="
      },
      {
        "PodUID": "cbb6dd83-99c4-4d0e-8683-b4cc8d7ccdc7",
        "ContainerName": "compute",
        "ResourceName": "devices.kubevirt.io/tun",
        "DeviceIDs": {
          "0": [
            "tun102"
          ]
        },
        "AllocResp": "GiAKDC9kZXYvbmV0L3R1bhIML2Rldi9uZXQvdHVuGgJydw=="
      }
    ],


Enable multus cni debug log (multus cni query pod resource information from kubelet podresource endpoint), it confirms that device is not assigned to the pod:

2021-06-10T11:44:37Z [debug] getKubernetesDelegate: found resourceName annotation : openshift.io/app2ens4f0vf0
2021-06-10T11:44:37Z [debug] GetResourceClient: using Kubelet resource API endpoint
2021-06-10T11:44:37Z [debug] getKubernetesDelegate: resourceMap instance: map[]     <==== multus gets no resource from kubelet podresource endpoint and device ID is not passed to sriov-cni

sriov-cni failed to get the VF PCI address (device ID):

2021-06-10T11:44:51Z [error] [f5-lb/virt-launcher-bigip0-fp2z7:bigip0ens4f0vf0]: error adding container to network "bigip0ens4f0vf0": SRIOV-CNI failed to load netconf: LoadConf(): VF pci addr is required

It looks like kubelet starts the pod w/o requesting a sriov resource for it.
We need input from kubelet team on why this happens.

Comment 17 zenghui.shi 2021-06-10 15:11:15 UTC
Changing component to Node - Kubelet to get input on comment #15.

Comment 18 Swati Sehgal 2021-06-10 18:26:08 UTC
I don't think there is anything obvious wrong from podresource API point of view. The podresource API client is accessed without any issue but in multus the reliance is on the podresource API endpoint for obtaining resource map and device ID. Because the device is registered and not assigned to the pod, so we won't get the device id when the podresource endpoint is queried. Appears to me there is something wrong in the device manager.

Comment 19 Swati Sehgal 2021-06-10 18:59:54 UTC
The other issue I see here is that it appears that multus is failing to obtain device ID information from the podresource API but the pod is not running on the node in the first place so how do we get that device ID information of a pod that is not running from the podresource API??

Comment 21 zenghui.shi 2021-06-11 01:56:36 UTC
(In reply to Swati Sehgal from comment #18)
> I don't think there is anything obvious wrong from podresource API point of
> view. The podresource API client is accessed without any issue but in multus
> the reliance is on the podresource API endpoint for obtaining resource map
> and device ID. Because the device is registered and not assigned to the pod,
> so we won't get the device id when the podresource endpoint is queried.
> Appears to me there is something wrong in the device manager.

Kubelet assigns a device to the pod, on subsequence retries (sriov resource is ready), will it re-evaluate the device assignment and update through the podresource API? 
If so, should we expect the pod be created successfully even the first few retries failed due to sriov resource not ready?

From the sriov device plugin log, it seems not, since device plugin didn't receive any Allocate call from kubelet:

trimmed log from sriov device plugin pod running on openshift-worker-2 node:

I0610 10:46:57.643418      26 manager.go:193] validating resource name "openshift.io/app2ens4f0vf0"
I0610 10:46:57.715955      26 manager.go:116] Creating new ResourcePool: app2ens4f0vf0
I0610 10:46:57.715964      26 manager.go:117] DeviceType: netDevice
I0610 10:46:57.721723      26 factory.go:106] device added: [pciAddr: 0000:af:00.7, vendor: 15b3, device: 1018, driver: mlx5_core]
I0610 10:46:57.721749      26 manager.go:145] New resource server is created for app2ens4f0vf0 ResourcePool
I0610 10:46:57.738235      26 main.go:72] Starting all servers...
I0610 10:46:57.738252      26 server.go:191] starting app2ens4f0vf0 device plugin endpoint at: openshift.io_app2ens4f0vf0.sock
I0610 10:46:57.739793      26 server.go:217] app2ens4f0vf0 device plugin endpoint started serving
I0610 10:46:57.744269      26 main.go:77] All servers started.
I0610 10:46:57.744284      26 main.go:78] Listening for term signals
I0610 10:46:57.769271      26 server.go:106] Plugin: openshift.io_app2ens4f0vf0.sock gets registered successfully at Kubelet
I0610 10:46:57.769983      26 server.go:131] ListAndWatch(app2ens4f0vf0) invoked
I0610 10:46:57.770002      26 server.go:139] ListAndWatch(app2ens4f0vf0): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:af:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},}

Comment 22 Petr Horáček 2021-06-11 07:56:22 UTC
I just stumbled upon a similar BZ that was reported on 4.8. Any chance these two are related at all? https://bugzilla.redhat.com/show_bug.cgi?id=1968625

Comment 23 Petr Horáček 2021-06-11 08:00:05 UTC
(In reply to Petr Horáček from comment #22)
> I just stumbled upon a similar BZ that was reported on 4.8. Any chance these
> two are related at all? https://bugzilla.redhat.com/show_bug.cgi?id=1968625

Or this which looks even closer (although not the same) https://bugzilla.redhat.com/show_bug.cgi?id=1962634

Comment 24 Federico Paolinelli 2021-06-11 08:04:03 UTC
(In reply to Petr Horáček from comment #23)
> (In reply to Petr Horáček from comment #22)
> > I just stumbled upon a similar BZ that was reported on 4.8. Any chance these
> > two are related at all? https://bugzilla.redhat.com/show_bug.cgi?id=1968625
> 
> Or this which looks even closer (although not the same)
> https://bugzilla.redhat.com/show_bug.cgi?id=1962634

Yep, we were missing details from the virt-launcher pod, and the resource was available on the node.
Even there, the only log from the dp is about sending the devices to the kubelet, not receiving the allocation request.
I thikn we are looking at the same bug

Comment 25 zenghui.shi 2021-06-11 08:58:14 UTC
(In reply to Petr Horáček from comment #22)
> I just stumbled upon a similar BZ that was reported on 4.8. Any chance these
> two are related at all? https://bugzilla.redhat.com/show_bug.cgi?id=1968625

no, this is a regression introduced in 4.8, it fails the sriov pod creation the first time it gets created.

Comment 26 Francesco Romani 2021-06-11 15:37:28 UTC
Just to clarify: we have this issue after a kubelet restart or is it independent on the kubelet state? IOW does it happen shortly after a kubelet restart, or can happen also when the kubelet is up and running for hours/days? If it happens regardless of the restart (so also on steady state), does it happen with the same frequency?

Comment 28 Elana Hashman 2021-06-11 20:11:38 UTC
Reassigning to Francesco, adding NEEDINFO for his questions.

Comment 29 Francesco Romani 2021-06-15 10:05:12 UTC
think it goes like this.

The bug triggers only after a upgrade, so kubelet is stopped, updated (4.6 -> 4.7) and started again.

The updated kubelet fails to restore the checkpoint data:
```
Jun 15 07:31:49.743990 openshift-worker-1 hyperkube[4852]: I0615 07:31:49.743846    4852 manager.go:236] Starting Device Plugin manager
Jun 15 07:31:49.744225 openshift-worker-1 hyperkube[4852]: I0615 07:31:49.744202    4852 container_manager_linux.go:899] attempting to apply oom_score_adj of -999 to pid 4852
Jun 15 07:31:49.744225 openshift-worker-1 hyperkube[4852]: I0615 07:31:49.744227    4852 oom_linux.go:65] attempting to set "/proc/4852/oom_score_adj" to "-999"
Jun 15 07:31:49.744332 openshift-worker-1 hyperkube[4852]: I0615 07:31:49.744264    4852 event.go:291] "Event occurred" object="openshift-worker-1" kind="Node" apiVersion="" type="Normal" reason="NodeAllocatableEnforced" message="Updated Node Allocatable limit across pods"
Jun 15 07:31:49.744489 openshift-worker-1 hyperkube[4852]: I0615 07:31:49.744471    4852 qos_container_manager_linux.go:333] [ContainerManager]: Updated QoS cgroup configuration
Jun 15 07:31:49.744795 openshift-worker-1 hyperkube[4852]: W0615 07:31:49.744770    4852 manager.go:244] Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date. Err: json: cannot unmarshal array into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type checkpoint.DevicesPerNUMA
```

So this means the internal devicemanager maps will be **completely** empty. The expected state after recovery is that the *key* of the relevant devices will be present. This is a key difference.
After the failed recovery, the kubelet will happily run, but the node update callback (https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/devicemanager/manager.go#L516) will **not** update the node objects.

This is another key difference. The expected behaviour is to reset the allocatable device counters to 0 (zero) up until the device plugins register themselves again.
So the scheduler will send pods on this node, while it should not (devices not ready).

In the pod admission flow, and we call `DeviceManager.Allocate()`, which calls `allocateContainerResources` (this happens in each and every flow, including the one we seen in this BZ) https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/devicemanager/manager.go#L830
Each loop, we have a call to `isDevicePluginResources` (https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/devicemanager/manager.go#L843). Problem is: exactly because the **empty healthyDevices** maps, it always return false.
Worth stressing that the bug is that the maps are completely empty. The expected flows is to have the **keys** in the map, but with zero values (until the device plugin registers itself again). Should that have been the case, an device allocation request should have been performed.
This explains the weird behaviour documented on https://bugzilla.redhat.com/show_bug.cgi?id=1969870#c15

I believe that fixing the checkpoint recovery should fix all the issues reported here.
Further investigation should explore
1. if the bug presents itself in other scenarios than the upgrade flow (seems unlikely)
2. if the bug was present in earlier versions (no data, if so there could be a different root cause).

Comment 31 Petr Horáček 2021-06-16 12:24:13 UTC
*** Bug 1969869 has been marked as a duplicate of this bug. ***

Comment 32 Francesco Romani 2021-06-17 08:43:40 UTC
(In reply to Francesco Romani from comment #29)
> think it goes like this.
> 
> The bug triggers only after a upgrade, so kubelet is stopped, updated (4.6
> -> 4.7) and started again.
> 
> The updated kubelet fails to restore the checkpoint data:
> ```
> Jun 15 07:31:49.743990 openshift-worker-1 hyperkube[4852]: I0615
> 07:31:49.743846    4852 manager.go:236] Starting Device Plugin manager
> Jun 15 07:31:49.744225 openshift-worker-1 hyperkube[4852]: I0615
> 07:31:49.744202    4852 container_manager_linux.go:899] attempting to apply
> oom_score_adj of -999 to pid 4852
> Jun 15 07:31:49.744225 openshift-worker-1 hyperkube[4852]: I0615
> 07:31:49.744227    4852 oom_linux.go:65] attempting to set
> "/proc/4852/oom_score_adj" to "-999"
> Jun 15 07:31:49.744332 openshift-worker-1 hyperkube[4852]: I0615
> 07:31:49.744264    4852 event.go:291] "Event occurred"
> object="openshift-worker-1" kind="Node" apiVersion="" type="Normal"
> reason="NodeAllocatableEnforced" message="Updated Node Allocatable limit
> across pods"
> Jun 15 07:31:49.744489 openshift-worker-1 hyperkube[4852]: I0615
> 07:31:49.744471    4852 qos_container_manager_linux.go:333]
> [ContainerManager]: Updated QoS cgroup configuration
> Jun 15 07:31:49.744795 openshift-worker-1 hyperkube[4852]: W0615
> 07:31:49.744770    4852 manager.go:244] Continue after failing to read
> checkpoint file. Device allocation info may NOT be up-to-date. Err: json:
> cannot unmarshal array into Go struct field
> PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type
> checkpoint.DevicesPerNUMA
> ```
> 
> So this means the internal devicemanager maps will be **completely** empty.
> The expected state after recovery is that the *key* of the relevant devices
> will be present. This is a key difference.
> After the failed recovery, the kubelet will happily run, but the node update
> callback
> (https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/
> devicemanager/manager.go#L516) will **not** update the node objects.
> 
> This is another key difference. The expected behaviour is to reset the
> allocatable device counters to 0 (zero) up until the device plugins register
> themselves again.
> So the scheduler will send pods on this node, while it should not (devices
> not ready).
> 
> In the pod admission flow, and we call `DeviceManager.Allocate()`, which
> calls `allocateContainerResources` (this happens in each and every flow,
> including the one we seen in this BZ)
> https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/
> devicemanager/manager.go#L830
> Each loop, we have a call to `isDevicePluginResources`
> (https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/cm/
> devicemanager/manager.go#L843). Problem is: exactly because the **empty
> healthyDevices** maps, it always return false.
> Worth stressing that the bug is that the maps are completely empty. The
> expected flows is to have the **keys** in the map, but with zero values
> (until the device plugin registers itself again). Should that have been the
> case, an device allocation request should have been performed.
> This explains the weird behaviour documented on
> https://bugzilla.redhat.com/show_bug.cgi?id=1969870#c15
> 
> I believe that fixing the checkpoint recovery should fix all the issues
> reported here.
> Further investigation should explore
> 1. if the bug presents itself in other scenarios than the upgrade flow
> (seems unlikely)
> 2. if the bug was present in earlier versions (no data, if so there could be
> a different root cause).

Further investiagion reveals that the device plugin is not particularly slow to re-register.
It's "just" that the node is recovering, so all the pods -including the device plugin one- need
to be recreated, and this take some time.
Ultimately, it really seems a plain old race, which is possible because the device manager state is corrupted.
Should the device manager state be correct, the workload pod would have been correctly kept pending by
the basic kubernetes/ocp infrastructure until the devices where registered in the kubelet again.

It seems the only needed fix is to properly recover the device manager state.

Comment 35 Abdul 2021-07-28 14:57:59 UTC
We're experiencing this same issue. 
 1)Is there a workaround ? 
 2) Do you have an ETA or release target ?

Thank you, in advance !

Comment 36 Francesco Romani 2021-07-28 15:06:45 UTC
(In reply to Abdul from comment #35)
> We're experiencing this same issue. 
>  1)Is there a workaround ? 
>  2) Do you have an ETA or release target ?
> 
> Thank you, in advance !

Hi!

1. A workaround is to wait for the scheduler plugin to register itself again before to schedule any pod to the node - note this bug should trigger only after a kubelet upgrade. This could be not so trivial to be implemented, maybe cordon the node after the upgrade? Unfortunately it's also hard to understand when the node heals only looking at the node objects, it's probably safer to inspect the kubelet logs
2. regarding ETA depends most on upstream acceptance, I'm re-pinged some key people lately. I will update with more details as soon as I get them.

Comment 37 Francesco Romani 2021-07-28 15:07:29 UTC
(In reply to Francesco Romani from comment #36)
> 1. A workaround is to wait for the scheduler plugin to register itself again

Of course I meant "device plugins" here.

Comment 38 Abdul 2021-07-28 15:28:58 UTC
Thank you, for the prompt response, Francesco !

I've redeployed the device-plugin-mgr over and over again...and unfortunately that workaround is not working for me - unless there's some other step involved that I missed.

Thank you, once more !

Comment 39 Francesco Romani 2021-07-29 06:12:40 UTC
(In reply to Abdul from comment #38)
> Thank you, for the prompt response, Francesco !
> 
> I've redeployed the device-plugin-mgr over and over again...and
> unfortunately that workaround is not working for me - unless there's some
> other step involved that I missed.
> 
> Thank you, once more !

Sorry I was not clear. Redeploying the device plugin will most likely make things worse :)

The issue here is that *after a kubelet upgrade* there is a time window on which the cluster state wrt device availability is inconsistent and misleading. This time window span from kubelet restart up to device-plugin(s) re-registration. The kubelet does not expose when the re-registration happens, so it is challenging to understand when the node heals - without looking at kubelet logs.

The workaround/mitigation is to give time - "how much time" is challenging to find for the reasons above - to the node to heal not scheduling pods on it. So something like:
1. cordon the node (= set it Unschedulable/NotReady)
2. update the kubelet
3. wait some time, say 5 minutes
4. uncordon the node (= set it Schedulable again)

if a workload fail to start for device unavailability, try to reschedule the workload again.

I'll work to accelerate the u/s consumption of the fix once kubernetes 1.22 is out.

Comment 40 Abdul 2021-07-29 22:21:23 UTC
Thank you, for the clarification !

Comment 43 Francesco Romani 2021-12-02 16:12:49 UTC
all the relevant PRs and backports were merged! https://github.com/kubernetes/kubernetes/issues/102880#issuecomment-982744510
once OCP will rebase, we will get them

Comment 47 Francesco Romani 2022-06-17 10:55:17 UTC
considering kubernetes, the fix is available in
1.20.14 and onwards
1.21.8 and onwards
1.22.5 and onwards
1.23.0 and onwards

Comment 54 errata-xmlrpc 2022-09-13 13:53:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.59 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6322


Note You need to log in before you can comment on or make changes to this bug.