Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1707761

Summary:	[sriov-dp] The sriov-device-plugin cannot detect the socket missing and restart itself
Product:	OpenShift Container Platform	Reporter:	Meng Bo <bmeng>
Component:	Networking	Assignee:	zenghui.shi <zshi>
Networking sub component:	SR-IOV	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, nagrawal, wsun, zshi
Version:	4.1.0
Target Milestone:	---
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-02-06 00:36:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Meng Bo 2019-05-08 09:41:28 UTC

Description of problem:
On the cluster with sriov-device-plugin running, try to delete the socket file under /var/lib/kubelet/plugins_registry/sriovNet.sock

The device-plugin should be able to detect the problem and restart itself but it does not.

Version-Release number of selected component (if applicable):
payload: 4.1.0-0.nightly-2019-05-05-070156
node: Red Hat Enterprise Linux CoreOS 410.8.20190505.0 (Ootpa)

How reproducible:
always

Steps to Reproduce:
1. Setup ocp cluster on baremetal with sriov enabled
2. Try to create a pod which requests a vf
apiVersion: v1
kind: Pod
metadata:
  generateName: sriovpod-
  labels:
    env: test
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-conf
spec:
  containers:
  - name: test-pod
    image: bmeng/centos-network
    resources:
      requests:
        openshift.io/sriov: 1
      limits:
        openshift.io/sriov: 1

3. Delete the sriov socket file on the node
# rm -f /var/lib/kubelet/plugins_registry/sriovNet.sock

4. Try to create another pod with the same template above

Actual results:
2. The pod created successfully and running well.
4. The pod falls into Pending status.

The sriov-device-plugin cannot detect the socket file missing and restart itself.

Expected results:
Should be able to detect and restart.

Additional info:
The last few lines of the device plugin log:
I0508 09:14:21.588941      14 sriov-device-plugin.go:321] Starting SRIOV Network Device Plugin server at: /var/lib/kubelet/plugins_registry/sriovNet.sock
I0508 09:14:21.590038      14 sriov-device-plugin.go:348] SRIOV Network Device Plugin server started serving
I0508 09:14:21.593116      14 sriov-device-plugin.go:461] Plugin: sriovNet.sock gets registered successfully at Kubelet
I0508 09:14:21.593252      14 sriov-device-plugin.go:476] ListAndWatch: send initial devices &ListAndWatchResponse{Devices:[&Device{ID:0000:3b:02.3,Health:Healthy,} &Device{ID:0000:3b:02.4,Health:Healthy,} &Device{ID:0000:3b:02.5,Health:Healthy,} &Device{ID:0000:3b:02.6,Health:Healthy,} &Device{ID:0000:3b:02.7,Health:Healthy,} &Device{ID:0000:3b:02.0,Health:Healthy,} &Device{ID:0000:3b:02.1,Health:Healthy,} &Device{ID:0000:3b:02.2,Health:Healthy,}],}
I0508 09:15:49.287774      14 sriov-device-plugin.go:527] DeviceID in Allocate: 0000:3b:02.4
I0508 09:15:49.287849      14 sriov-device-plugin.go:541] PCI Addrs allocated: 0000:3b:02.4,
I0508 09:18:11.840716      14 sriov-device-plugin.go:461] Plugin: sriovNet.sock gets registered successfully at Kubelet
I0508 09:18:11.840790      14 sriov-device-plugin.go:476] ListAndWatch: send initial devices &ListAndWatchResponse{Devices:[&Device{ID:0000:3b:02.7,Health:Healthy,} &Device{ID:0000:3b:02.0,Health:Healthy,} &Device{ID:0000:3b:02.1,Health:Healthy,} &Device{ID:0000:3b:02.2,Health:Healthy,} &Device{ID:0000:3b:02.3,Health:Healthy,} &Device{ID:0000:3b:02.4,Health:Healthy,} &Device{ID:0000:3b:02.5,Health:Healthy,} &Device{ID:0000:3b:02.6,Health:Healthy,}],}
I0508 09:18:40.330417      14 sriov-device-plugin.go:527] DeviceID in Allocate: 0000:3b:02.7
I0508 09:18:40.330472      14 sriov-device-plugin.go:541] PCI Addrs allocated: 0000:3b:02.7,


The pod status:
# oc get po
NAME             READY   STATUS    RESTARTS   AGE
sriovpod-2ps96   1/1     Running   0          19m
sriovpod-9rf6w   1/1     Running   0          16m
sriovpod-kt5rs   0/1     Pending   0          9m4s

# oc describe po sriovpod-kt5rs
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  52s (x9 over 7m16s)  default-scheduler  0/2 nodes are available: 2 Insufficient openshift.io/sriov.

Comment 1 Casey Callendrello 2019-05-08 09:47:41 UTC

Probably the right way to do this is to specify a liveness check that pings the socket. This would also catch a crashed device plugin.

I don't think this is urgent; assigning to 4.2. I may transition this to a jira card as a RFE.

Comment 2 zenghui.shi 2019-05-08 11:45:27 UTC

Agreed with Casey that this is not an urgent issue since this will only be caused by a mis-operation.
Another thought to fix this might be watch the deletion event of socket file and re-create it, will make the change in 4.2 cycle.

Comment 3 Wei Sun 2019-05-08 11:54:33 UTC

Remove the betablocker keyword according to the #comment 1 and 2

Comment 4 zenghui.shi 2019-08-21 13:06:54 UTC

moving to 4.3

Comment 5 zenghui.shi 2019-12-06 03:02:40 UTC

moving to 4.4