Bug 1707761 - [sriov-dp] The sriov-device-plugin cannot detect the socket missing and restart itself
Summary: [sriov-dp] The sriov-device-plugin cannot detect the socket missing and resta...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.4.0
Assignee: zenghui.shi
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-08 09:41 UTC by Meng Bo
Modified: 2020-02-06 00:36 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-06 00:36:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Meng Bo 2019-05-08 09:41:28 UTC
Description of problem:
On the cluster with sriov-device-plugin running, try to delete the socket file under /var/lib/kubelet/plugins_registry/sriovNet.sock

The device-plugin should be able to detect the problem and restart itself but it does not.

Version-Release number of selected component (if applicable):
payload: 4.1.0-0.nightly-2019-05-05-070156
node: Red Hat Enterprise Linux CoreOS 410.8.20190505.0 (Ootpa)

How reproducible:
always

Steps to Reproduce:
1. Setup ocp cluster on baremetal with sriov enabled
2. Try to create a pod which requests a vf
apiVersion: v1
kind: Pod
metadata:
  generateName: sriovpod-
  labels:
    env: test
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-conf
spec:
  containers:
  - name: test-pod
    image: bmeng/centos-network
    resources:
      requests:
        openshift.io/sriov: 1
      limits:
        openshift.io/sriov: 1

3. Delete the sriov socket file on the node
# rm -f /var/lib/kubelet/plugins_registry/sriovNet.sock

4. Try to create another pod with the same template above

Actual results:
2. The pod created successfully and running well.
4. The pod falls into Pending status.

The sriov-device-plugin cannot detect the socket file missing and restart itself.

Expected results:
Should be able to detect and restart.

Additional info:
The last few lines of the device plugin log:
I0508 09:14:21.588941      14 sriov-device-plugin.go:321] Starting SRIOV Network Device Plugin server at: /var/lib/kubelet/plugins_registry/sriovNet.sock
I0508 09:14:21.590038      14 sriov-device-plugin.go:348] SRIOV Network Device Plugin server started serving
I0508 09:14:21.593116      14 sriov-device-plugin.go:461] Plugin: sriovNet.sock gets registered successfully at Kubelet
I0508 09:14:21.593252      14 sriov-device-plugin.go:476] ListAndWatch: send initial devices &ListAndWatchResponse{Devices:[&Device{ID:0000:3b:02.3,Health:Healthy,} &Device{ID:0000:3b:02.4,Health:Healthy,} &Device{ID:0000:3b:02.5,Health:Healthy,} &Device{ID:0000:3b:02.6,Health:Healthy,} &Device{ID:0000:3b:02.7,Health:Healthy,} &Device{ID:0000:3b:02.0,Health:Healthy,} &Device{ID:0000:3b:02.1,Health:Healthy,} &Device{ID:0000:3b:02.2,Health:Healthy,}],}
I0508 09:15:49.287774      14 sriov-device-plugin.go:527] DeviceID in Allocate: 0000:3b:02.4
I0508 09:15:49.287849      14 sriov-device-plugin.go:541] PCI Addrs allocated: 0000:3b:02.4,
I0508 09:18:11.840716      14 sriov-device-plugin.go:461] Plugin: sriovNet.sock gets registered successfully at Kubelet
I0508 09:18:11.840790      14 sriov-device-plugin.go:476] ListAndWatch: send initial devices &ListAndWatchResponse{Devices:[&Device{ID:0000:3b:02.7,Health:Healthy,} &Device{ID:0000:3b:02.0,Health:Healthy,} &Device{ID:0000:3b:02.1,Health:Healthy,} &Device{ID:0000:3b:02.2,Health:Healthy,} &Device{ID:0000:3b:02.3,Health:Healthy,} &Device{ID:0000:3b:02.4,Health:Healthy,} &Device{ID:0000:3b:02.5,Health:Healthy,} &Device{ID:0000:3b:02.6,Health:Healthy,}],}
I0508 09:18:40.330417      14 sriov-device-plugin.go:527] DeviceID in Allocate: 0000:3b:02.7
I0508 09:18:40.330472      14 sriov-device-plugin.go:541] PCI Addrs allocated: 0000:3b:02.7,


The pod status:
# oc get po
NAME             READY   STATUS    RESTARTS   AGE
sriovpod-2ps96   1/1     Running   0          19m
sriovpod-9rf6w   1/1     Running   0          16m
sriovpod-kt5rs   0/1     Pending   0          9m4s

# oc describe po sriovpod-kt5rs
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  52s (x9 over 7m16s)  default-scheduler  0/2 nodes are available: 2 Insufficient openshift.io/sriov.

Comment 1 Casey Callendrello 2019-05-08 09:47:41 UTC
Probably the right way to do this is to specify a liveness check that pings the socket. This would also catch a crashed device plugin.

I don't think this is urgent; assigning to 4.2. I may transition this to a jira card as a RFE.

Comment 2 zenghui.shi 2019-05-08 11:45:27 UTC
Agreed with Casey that this is not an urgent issue since this will only be caused by a mis-operation.
Another thought to fix this might be watch the deletion event of socket file and re-create it, will make the change in 4.2 cycle.

Comment 3 Wei Sun 2019-05-08 11:54:33 UTC
Remove the betablocker keyword according to the #comment 1 and 2

Comment 4 zenghui.shi 2019-08-21 13:06:54 UTC
moving to 4.3

Comment 5 zenghui.shi 2019-12-06 03:02:40 UTC
moving to 4.4


Note You need to log in before you can comment on or make changes to this bug.