Bug 2015481 - [4.10] sriov-network-operator daemon pods are failing to start
Summary: [4.10] sriov-network-operator daemon pods are failing to start
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.10.0
Assignee: Peng Liu
QA Contact: Ziv Greenberg
URL:
Whiteboard:
: 2028246 (view as bug list)
Depends On:
Blocks: 2015834 2015835 2028256
TreeView+ depends on / blocked
 
Reported: 2021-10-19 10:16 UTC by Ziv Greenberg
Modified: 2022-03-10 16:20 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2015834 2015835 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:20:14 UTC
Target Upstream Version:
Embargoed:
pliu: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift sriov-network-operator pull 570 0 None open Bug 2015481: Sync upstream: 2021-10-20 2021-10-20 08:39:22 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:20:43 UTC

Description Ziv Greenberg 2021-10-19 10:16:17 UTC
Description of problem:

After a successful deployment of the OCP on OSP (Shift-on-Stack), as part of our Telco testing, the sriov-network-operator is required to be installed and configured.
But immediately after the initial operator installation, I found the daemon pods in the "CrashLoopBackOff" status, rebooting them didn't fix the main problem:

[cloud-user@installer-host ~]$ oc get -n openshift-sriov-network-operator all
NAME                                        READY   STATUS             RESTARTS   AGE
pod/network-resources-injector-7zvb6        1/1     Running            0          3m22s
pod/network-resources-injector-llx8q        1/1     Running            0          3m22s
pod/network-resources-injector-swzxk        1/1     Running            0          3m22s
pod/sriov-network-config-daemon-5jd4m       0/1     CrashLoopBackOff   4          3m22s
pod/sriov-network-config-daemon-mwzmz       0/1     CrashLoopBackOff   4          3m22s
pod/sriov-network-operator-6947d96c-lmcxn   1/1     Running            0          3m37s
 
NAME                                         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/network-resources-injector-service   ClusterIP   172.30.179.150   <none>        443/TCP   3m22s
 
NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                 AGE
daemonset.apps/network-resources-injector    3         3         3       3            3           beta.kubernetes.io/os=linux,node-role.kubernetes.io/master=   3m22s
daemonset.apps/sriov-network-config-daemon   2         2         0       2            0           beta.kubernetes.io/os=linux,node-role.kubernetes.io/worker=   3m22s
 
NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/sriov-network-operator   1/1     1            1           3m37s




By looking at their specific logs, I could see the following errors:

[cloud-user@installer-host ~]$ oc logs pod/sriov-network-config-daemon-5jd4m -n openshift-sriov-network-operator
I1018 18:15:11.524256  190307 start.go:107] overriding kubernetes api to https://api-int.ostest.shiftstack.com:6443
I1018 18:15:11.525581  190307 start.go:138] starting node writer
I1018 18:15:11.534127  190307 start.go:158] Running on platform: Virtual/Openstack
I1018 18:15:11.534142  190307 writer.go:44] Run(): start writer
I1018 18:15:11.534146  190307 writer.go:47] Run(): once
I1018 18:15:11.560971  190307 utils.go:598] getLinkType(): Device 0000:00:03.0
I1018 18:15:11.561041  190307 utils.go:598] getLinkType(): Device 0000:00:04.0
I1018 18:15:11.561098  190307 utils.go:598] getLinkType(): Device 0000:00:05.0
I1018 18:15:11.566328  190307 writer.go:132] setNodeStateStatus(): syncStatus: , lastSyncError:
I1018 18:15:11.571454  190307 writer.go:170] writeCheckpointFile(): try to decode the checkpoint file
I1018 18:15:11.571553  190307 start.go:164] Starting SriovNetworkConfigDaemon
I1018 18:15:11.571572  190307 writer.go:44] Run(): start writer
I1018 18:15:11.571579  190307 daemon.go:257] Run(): start daemon
E1018 18:15:11.581359  190307 daemon.go:951] tryEnableRdma(): fail to enable rdma exit status 1:
I1018 18:15:11.587662  190307 daemon.go:442] Set log verbose level to: 2
I1018 18:15:16.686993  190307 daemon.go:319] Starting workers
I1018 18:15:16.687012  190307 daemon.go:322] Started workers
I1018 18:15:16.687027  190307 daemon.go:362] worker queue size: 1
I1018 18:15:16.687032  190307 daemon.go:364] get item: 1
I1018 18:15:16.687037  190307 daemon.go:454] nodeStateSyncHandler(): new generation is 1
I1018 18:15:16.689510  190307 daemon.go:689] loadVendorPlugins(): try to load plugin virtual_plugin
I1018 18:15:16.689523  190307 plugin.go:39] loadPlugin(): load plugin from /plugins/virtual_plugin.so
I1018 18:15:16.689576  190307 writer.go:61] Run(): refresh trigger
I1018 18:15:16.689584  190307 writer.go:80] pollNicStatus()
I1018 18:15:16.689588  190307 utils_virtual.go:158] DiscoverSriovDevicesVirtual
I1018 18:15:16.708806  190307 virtual_plugin.go:52] virtual-plugin OnNodeStateAdd()
I1018 18:15:16.708855  190307 daemon.go:509] nodeStateSyncHandler(): plugin virtual_plugin: reqDrain false, reqReboot false
I1018 18:15:16.708868  190307 daemon.go:513] nodeStateSyncHandler(): reqDrain false, reqReboot false disableDrain false
I1018 18:15:16.708875  190307 virtual_plugin.go:84] virtual-plugin Apply(): desiredState={186996 []}
I1018 18:15:16.718493  190307 utils.go:409] getNetdevMTU(): get MTU for device 0000:00:03.0
I1018 18:15:16.718606  190307 utils.go:598] getLinkType(): Device 0000:00:03.0
I1018 18:15:16.718720  190307 utils.go:409] getNetdevMTU(): get MTU for device 0000:00:04.0
I1018 18:15:16.718797  190307 utils.go:598] getLinkType(): Device 0000:00:04.0
I1018 18:15:16.718916  190307 utils.go:409] getNetdevMTU(): get MTU for device 0000:00:05.0
I1018 18:15:16.718991  190307 utils.go:598] getLinkType(): Device 0000:00:05.0
I1018 18:15:16.724615  190307 writer.go:132] setNodeStateStatus(): syncStatus: InProgress, lastSyncError:
E1018 18:15:16.730394  190307 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 102 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1eb0e00, 0x2f17450)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1eb0e00, 0x2f17450)
        /usr/lib/golang/src/runtime/panic.go:965 +0x1b9
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).nodeStateSyncHandler(0xc001440270, 0x1, 0xc0005ea0d0, 0xc001484630)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:548 +0x101b
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem.func1(0xc001440270, 0x1e3c2a0, 0x2f5e708, 0x0, 0x0)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:385 +0xdf
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem(0xc001440270, 0x203000)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:401 +0x169
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).runWorker(...)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:346
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00073e080)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00073e080, 0x22bc000, 0xc0001a69f0, 0x1, 0xc00010e360)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00073e080, 0x3b9aca00, 0x0, 0xc0004c8d01, 0xc00010e360)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc00073e080, 0x3b9aca00, 0xc00010e360)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).Run
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:321 +0xac5
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1d375bb]
 
goroutine 102 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1eb0e00, 0x2f17450)
        /usr/lib/golang/src/runtime/panic.go:965 +0x1b9
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).nodeStateSyncHandler(0xc001440270, 0x1, 0xc0005ea0d0, 0xc001484630)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:548 +0x101b
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem.func1(0xc001440270, 0x1e3c2a0, 0x2f5e708, 0x0, 0x0)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:385 +0xdf
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem(0xc001440270, 0x203000)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:401 +0x169
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).runWorker(...)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:346
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00073e080)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00073e080, 0x22bc000, 0xc0001a69f0, 0x1, 0xc00010e360)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00073e080, 0x3b9aca00, 0x0, 0xc0004c8d01, 0xc00010e360)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc00073e080, 0x3b9aca00, 0xc00010e360)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).Run
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:321 +0xac5


[cloud-user@installer-host ~]$ oc logs pod/sriov-network-config-daemon-mwzmz -n openshift-sriov-network-operator
I1018 18:16:50.082468  424839 start.go:107] overriding kubernetes api to https://api-int.ostest.shiftstack.com:6443
I1018 18:16:50.084404  424839 start.go:138] starting node writer
I1018 18:16:50.092922  424839 start.go:158] Running on platform: Virtual/Openstack
I1018 18:16:50.093055  424839 writer.go:44] Run(): start writer
I1018 18:16:50.093125  424839 writer.go:47] Run(): once
I1018 18:16:50.122076  424839 utils.go:598] getLinkType(): Device 0000:00:03.0
I1018 18:16:50.122282  424839 utils.go:598] getLinkType(): Device 0000:00:05.0
I1018 18:16:50.122887  424839 utils.go:598] getLinkType(): Device 0000:00:06.0
I1018 18:16:50.125775  424839 writer.go:132] setNodeStateStatus(): syncStatus: , lastSyncError:
I1018 18:16:50.131089  424839 writer.go:170] writeCheckpointFile(): try to decode the checkpoint file
I1018 18:16:50.131332  424839 start.go:164] Starting SriovNetworkConfigDaemon
I1018 18:16:50.131349  424839 writer.go:44] Run(): start writer
I1018 18:16:50.131496  424839 daemon.go:257] Run(): start daemon
E1018 18:16:50.142113  424839 daemon.go:951] tryEnableRdma(): fail to enable rdma exit status 1:
I1018 18:16:50.147463  424839 daemon.go:442] Set log verbose level to: 2
I1018 18:16:55.247070  424839 daemon.go:319] Starting workers
I1018 18:16:55.247230  424839 daemon.go:322] Started workers
I1018 18:16:55.247254  424839 daemon.go:362] worker queue size: 1
I1018 18:16:55.247382  424839 daemon.go:364] get item: 1
I1018 18:16:55.247449  424839 daemon.go:454] nodeStateSyncHandler(): new generation is 1
I1018 18:16:55.250544  424839 daemon.go:689] loadVendorPlugins(): try to load plugin virtual_plugin
I1018 18:16:55.250556  424839 plugin.go:39] loadPlugin(): load plugin from /plugins/virtual_plugin.so
I1018 18:16:55.250558  424839 writer.go:61] Run(): refresh trigger
I1018 18:16:55.250566  424839 writer.go:80] pollNicStatus()
I1018 18:16:55.250579  424839 utils_virtual.go:158] DiscoverSriovDevicesVirtual
I1018 18:16:55.270524  424839 virtual_plugin.go:52] virtual-plugin OnNodeStateAdd()
I1018 18:16:55.270568  424839 daemon.go:509] nodeStateSyncHandler(): plugin virtual_plugin: reqDrain false, reqReboot false
I1018 18:16:55.270577  424839 daemon.go:513] nodeStateSyncHandler(): reqDrain false, reqReboot false disableDrain false
I1018 18:16:55.270583  424839 virtual_plugin.go:84] virtual-plugin Apply(): desiredState={186996 []}
I1018 18:16:55.279729  424839 utils.go:409] getNetdevMTU(): get MTU for device 0000:00:03.0
I1018 18:16:55.279757  424839 utils.go:598] getLinkType(): Device 0000:00:03.0
I1018 18:16:55.279811  424839 utils.go:409] getNetdevMTU(): get MTU for device 0000:00:05.0
I1018 18:16:55.279853  424839 utils.go:404] tryGetInterfaceName(): name is ens5
I1018 18:16:55.279899  424839 utils.go:404] tryGetInterfaceName(): name is ens5
I1018 18:16:55.279902  424839 utils.go:430] getNetDevMac(): get Mac for device ens5
I1018 18:16:55.279923  424839 utils.go:442] getNetDevLinkSpeed(): get LinkSpeed for device ens5
I1018 18:16:55.279939  424839 utils.go:598] getLinkType(): Device 0000:00:05.0
I1018 18:16:55.280070  424839 utils.go:409] getNetdevMTU(): get MTU for device 0000:00:06.0
I1018 18:16:55.280112  424839 utils.go:404] tryGetInterfaceName(): name is ens6
I1018 18:16:55.280154  424839 utils.go:404] tryGetInterfaceName(): name is ens6
I1018 18:16:55.280157  424839 utils.go:430] getNetDevMac(): get Mac for device ens6
I1018 18:16:55.280179  424839 utils.go:442] getNetDevLinkSpeed(): get LinkSpeed for device ens6
I1018 18:16:55.280198  424839 utils.go:598] getLinkType(): Device 0000:00:06.0
I1018 18:16:55.282415  424839 writer.go:132] setNodeStateStatus(): syncStatus: InProgress, lastSyncError:
E1018 18:16:55.291837  424839 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 123 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1eb0e00, 0x2f17450)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1eb0e00, 0x2f17450)
        /usr/lib/golang/src/runtime/panic.go:965 +0x1b9
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).nodeStateSyncHandler(0xc0014984e0, 0x1, 0xc000b84000, 0xc000314000)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:548 +0x101b
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem.func1(0xc0014984e0, 0x1e3c2a0, 0x2f5e708, 0x0, 0x0)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:385 +0xdf
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem(0xc0014984e0, 0x203000)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:401 +0x169
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).runWorker(...)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:346
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc001000590)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001000590, 0x22bc000, 0xc000c17cb0, 0x1, 0xc00010e540)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001000590, 0x3b9aca00, 0x0, 0x217b801, 0xc00010e540)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc001000590, 0x3b9aca00, 0xc00010e540)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).Run
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:321 +0xac5
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1d375bb]
 
goroutine 123 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1eb0e00, 0x2f17450)
        /usr/lib/golang/src/runtime/panic.go:965 +0x1b9
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).nodeStateSyncHandler(0xc0014984e0, 0x1, 0xc000b84000, 0xc000314000)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:548 +0x101b
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem.func1(0xc0014984e0, 0x1e3c2a0, 0x2f5e708, 0x0, 0x0)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:385 +0xdf
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem(0xc0014984e0, 0x203000)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:401 +0x169
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).runWorker(...)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:346
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc001000590)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001000590, 0x22bc000, 0xc000c17cb0, 0x1, 0xc00010e540)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001000590, 0x3b9aca00, 0x0, 0x217b801, 0xc00010e540)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc001000590, 0x3b9aca00, 0xc00010e540)
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon.(*Daemon).Run
        /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/daemon/daemon.go:321 +0xac5




Version-Release number of selected component (if applicable):

Cluster version is 4.8.0-0.nightly-2021-10-16-024756




Additional info:

The actual bug and the proposed solution could be tracked here:
https://github.com/openshift/sriov-network-operator/commit/1d954a5304283f62808abbe13c55c6dd7b2b4083#diff-a53b7b593d3d778e62eaeeafa40088656f9212bfa2c2b7991df15fa78e60b0f0

Comment 1 Aaron Smith 2021-10-19 19:18:59 UTC
The issue affects both the 4.8 and 4.9 releases.  

I have verified an upstream patch by @pliu (https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/191/files)
that fixes the issue on the 4.9 release branch.

Comment 3 zhaozhanqi 2021-11-04 10:39:41 UTC
Hi, Ziv Could you help check the fix is works on 4.10 version ?

Comment 4 Ziv Greenberg 2021-11-04 10:46:34 UTC
Yes, of course.
This is exactly what I'm trying to achieve for a couple of days now.

The main problem is that 4.10 currently is not stable from the deployment point of view.
I'm trying to find a stable puddle to work with.
I'll update as soon as I'll have any progress.

Comment 5 Ziv Greenberg 2021-11-04 14:00:37 UTC
Hi,

I was able to verify it, please see the details below:

[cloud-user@installer-host ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-04-001635   True        False         71m     Cluster version is 4.10.0-0.nightly-2021-11-04-001635
[cloud-user@installer-host ~]$
[cloud-user@installer-host ~]$
[cloud-user@installer-host ~]$ oc get csv -n openshift-sriov-network-operator
NAME                                        DISPLAY                      VERSION              REPLACES   PHASE
performance-addon-operator.v4.9.0           Performance Addon Operator   4.9.0                           Succeeded
sriov-network-operator.4.9.0-202110182323   SR-IOV Network Operator      4.9.0-202110182323              Succeeded
[cloud-user@installer-host ~]$
[cloud-user@installer-host ~]$
[cloud-user@installer-host ~]$ oc get all -n openshift-sriov-network-operator
NAME                                         READY   STATUS    RESTARTS      AGE
pod/network-resources-injector-jp2l8         1/1     Running   0             44m
pod/network-resources-injector-p7tbw         1/1     Running   0             44m
pod/network-resources-injector-v8x6r         1/1     Running   0             44m
pod/sriov-device-plugin-knl7c                1/1     Running   0             31m
pod/sriov-network-config-daemon-67nhv        3/3     Running   7 (37m ago)   44m
pod/sriov-network-config-daemon-p5k2s        3/3     Running   7 (37m ago)   44m
pod/sriov-network-operator-976c7d6fc-4gjp8   1/1     Running   2 (32m ago)   44m

NAME                                         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/network-resources-injector-service   ClusterIP   172.30.223.210   <none>        443/TCP   44m

NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                 AGE
daemonset.apps/network-resources-injector    3         3         3       3            3           beta.kubernetes.io/os=linux                                   44m
daemonset.apps/sriov-device-plugin           1         1         1       1            1           beta.kubernetes.io/os=linux,node-role.kubernetes.io/worker=   43m
daemonset.apps/sriov-network-config-daemon   2         2         2       2            2           beta.kubernetes.io/os=linux,node-role.kubernetes.io/worker=   44m

NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/sriov-network-operator   1/1     1            1           44m

NAME                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/sriov-network-operator-976c7d6fc   1         1         1       44m

A question please, shouldn't we suppose to have 4.10 version of the sriov network operator instead of 4.9?

Comment 6 Peng Liu 2021-11-05 02:27:01 UTC
You shall use a 4.10 image to verify. The fix has not yet been merged in the 4.9 branch. Please try to use image sriov-network-operator.4.10.0-202111031923 or a newer one.

Comment 7 Ziv Greenberg 2021-11-07 10:39:08 UTC
Hello Peng,

Sorry, I have no experience with it as I've always used the latest marketplace version. Could you please elaborate how I should use/install this specific image? 
Additionally, if it is not yet merged in to the 4.9 branch, how come it is working on my current environment?

Thanks.

Comment 8 Peng Liu 2021-11-08 00:56:23 UTC
@zzhao Could you help Ziv to setup the QE operator repo in his environment?

Comment 11 Emilien Macchi 2021-12-01 19:48:03 UTC
*** Bug 2028246 has been marked as a duplicate of this bug. ***

Comment 14 Ziv Greenberg 2021-12-08 10:54:12 UTC
Hello,

I was able to verify it and also created a dedicated dut pod witch attached SR-IOV VF's:

(shiftstack) [cloud-user@installer-host ~]$ oc get clusterversions.config.openshift.io
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-12-06-162419   True        False         76m     Cluster version is 4.10.0-0.nightly-2021-12-06-162419
(shiftstack) [cloud-user@installer-host ~]$
(shiftstack) [cloud-user@installer-host ~]$
(shiftstack) [cloud-user@installer-host ~]$ oc get csv -n openshift-sriov-network-operator
NAME                                         DISPLAY                      VERSION               REPLACES   PHASE
performance-addon-operator.v4.9.2            Performance Addon Operator   4.9.2                            Succeeded
sriov-network-operator.4.10.0-202112070531   SR-IOV Network Operator      4.10.0-202112070531              Succeeded
(shiftstack) [cloud-user@installer-host ~]$
(shiftstack) [cloud-user@installer-host ~]$
(shiftstack) [cloud-user@installer-host ~]$ oc get all -n openshift-sriov-network-operator
NAME                                         READY   STATUS    RESTARTS   AGE
pod/network-resources-injector-bbct4         1/1     Running   0          32m
pod/network-resources-injector-m9n8b         1/1     Running   0          32m
pod/network-resources-injector-z2nzp         1/1     Running   0          32m
pod/sriov-device-plugin-tz7sr                1/1     Running   0          2m54s
pod/sriov-network-config-daemon-lllf4        3/3     Running   3          32m
pod/sriov-network-config-daemon-ngdrq        3/3     Running   3          32m
pod/sriov-network-operator-dfdf7b466-dgw6t   1/1     Running   0          32m

NAME                                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/network-resources-injector-service   ClusterIP   172.30.171.60   <none>        443/TCP   32m

NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                 AGE
daemonset.apps/network-resources-injector    3         3         3       3            3           beta.kubernetes.io/os=linux                                   32m
daemonset.apps/sriov-device-plugin           1         1         1       1            1           beta.kubernetes.io/os=linux,node-role.kubernetes.io/worker=   3m29s
daemonset.apps/sriov-network-config-daemon   2         2         2       2            2           beta.kubernetes.io/os=linux,node-role.kubernetes.io/worker=   32m

NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/sriov-network-operator   1/1     1            1           32m

NAME                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/sriov-network-operator-dfdf7b466   1         1         1       32m
(shiftstack) [cloud-user@installer-host ~]$
(shiftstack) [cloud-user@installer-host ~]$
(shiftstack) [cloud-user@installer-host ~]$ oc get pods
NAME           READY   STATUS    RESTARTS   AGE
dpdk-testpmd   1/1     Running   0          2m19s
(shiftstack) [cloud-user@installer-host ~]$

Thanks,
Ziv

Comment 17 errata-xmlrpc 2022-03-10 16:20:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.