Description of problem: On the cluster with sriov-device-plugin running, try to delete the socket file under /var/lib/kubelet/plugins_registry/sriovNet.sock The device-plugin should be able to detect the problem and restart itself but it does not. Version-Release number of selected component (if applicable): payload: 4.1.0-0.nightly-2019-05-05-070156 node: Red Hat Enterprise Linux CoreOS 410.8.20190505.0 (Ootpa) How reproducible: always Steps to Reproduce: 1. Setup ocp cluster on baremetal with sriov enabled 2. Try to create a pod which requests a vf apiVersion: v1 kind: Pod metadata: generateName: sriovpod- labels: env: test annotations: k8s.v1.cni.cncf.io/networks: sriov-conf spec: containers: - name: test-pod image: bmeng/centos-network resources: requests: openshift.io/sriov: 1 limits: openshift.io/sriov: 1 3. Delete the sriov socket file on the node # rm -f /var/lib/kubelet/plugins_registry/sriovNet.sock 4. Try to create another pod with the same template above Actual results: 2. The pod created successfully and running well. 4. The pod falls into Pending status. The sriov-device-plugin cannot detect the socket file missing and restart itself. Expected results: Should be able to detect and restart. Additional info: The last few lines of the device plugin log: I0508 09:14:21.588941 14 sriov-device-plugin.go:321] Starting SRIOV Network Device Plugin server at: /var/lib/kubelet/plugins_registry/sriovNet.sock I0508 09:14:21.590038 14 sriov-device-plugin.go:348] SRIOV Network Device Plugin server started serving I0508 09:14:21.593116 14 sriov-device-plugin.go:461] Plugin: sriovNet.sock gets registered successfully at Kubelet I0508 09:14:21.593252 14 sriov-device-plugin.go:476] ListAndWatch: send initial devices &ListAndWatchResponse{Devices:[&Device{ID:0000:3b:02.3,Health:Healthy,} &Device{ID:0000:3b:02.4,Health:Healthy,} &Device{ID:0000:3b:02.5,Health:Healthy,} &Device{ID:0000:3b:02.6,Health:Healthy,} &Device{ID:0000:3b:02.7,Health:Healthy,} &Device{ID:0000:3b:02.0,Health:Healthy,} &Device{ID:0000:3b:02.1,Health:Healthy,} &Device{ID:0000:3b:02.2,Health:Healthy,}],} I0508 09:15:49.287774 14 sriov-device-plugin.go:527] DeviceID in Allocate: 0000:3b:02.4 I0508 09:15:49.287849 14 sriov-device-plugin.go:541] PCI Addrs allocated: 0000:3b:02.4, I0508 09:18:11.840716 14 sriov-device-plugin.go:461] Plugin: sriovNet.sock gets registered successfully at Kubelet I0508 09:18:11.840790 14 sriov-device-plugin.go:476] ListAndWatch: send initial devices &ListAndWatchResponse{Devices:[&Device{ID:0000:3b:02.7,Health:Healthy,} &Device{ID:0000:3b:02.0,Health:Healthy,} &Device{ID:0000:3b:02.1,Health:Healthy,} &Device{ID:0000:3b:02.2,Health:Healthy,} &Device{ID:0000:3b:02.3,Health:Healthy,} &Device{ID:0000:3b:02.4,Health:Healthy,} &Device{ID:0000:3b:02.5,Health:Healthy,} &Device{ID:0000:3b:02.6,Health:Healthy,}],} I0508 09:18:40.330417 14 sriov-device-plugin.go:527] DeviceID in Allocate: 0000:3b:02.7 I0508 09:18:40.330472 14 sriov-device-plugin.go:541] PCI Addrs allocated: 0000:3b:02.7, The pod status: # oc get po NAME READY STATUS RESTARTS AGE sriovpod-2ps96 1/1 Running 0 19m sriovpod-9rf6w 1/1 Running 0 16m sriovpod-kt5rs 0/1 Pending 0 9m4s # oc describe po sriovpod-kt5rs Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 52s (x9 over 7m16s) default-scheduler 0/2 nodes are available: 2 Insufficient openshift.io/sriov.
Probably the right way to do this is to specify a liveness check that pings the socket. This would also catch a crashed device plugin. I don't think this is urgent; assigning to 4.2. I may transition this to a jira card as a RFE.
Agreed with Casey that this is not an urgent issue since this will only be caused by a mis-operation. Another thought to fix this might be watch the deletion event of socket file and re-create it, will make the change in 4.2 cycle.
Remove the betablocker keyword according to the #comment 1 and 2
moving to 4.3
moving to 4.4