Created attachment 1712000 [details] config daemon pod crash logs Description of problem: After sriov upgrade from : oc get csv NAME DISPLAY VERSION REPLACES PHASE sriov-network-operator.4.6.0-202008200814.p0 SR-IOV Network Operator 4.6.0-202008200814.p0 sriov-network-operator.4.6.0-202008121454.p0 Succeeded oc get pod -l app=sriov-network-config-daemon NAME READY STATUS RESTARTS AGE sriov-network-config-daemon-2ckzk 1/1 Running 0 26m sriov-network-config-daemon-72zvw 0/1 CrashLoopBackOff 9 27m sriov-network-config-daemon-dskb7 1/1 Running 0 28m found one configdaemon pod crashed Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. oc logs $crashpod 2. 3. Actual results: Expected results: Additional info: there is policy: oc get sriovnetworknodepolicy mlx278-netdevice -o yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: creationTimestamp: "2020-08-19T09:21:13Z" generation: 1 managedFields: - apiVersion: sriovnetwork.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:mtu: {} f:nicSelector: .: {} f:pfNames: {} f:rootDevices: {} f:vendor: {} f:nodeSelector: .: {} f:feature.node.kubernetes.io/sriov-capable: {} f:numVfs: {} f:resourceName: {} manager: kubectl-create operation: Update time: "2020-08-19T09:21:13Z" name: mlx278-netdevice namespace: openshift-sriov-network-operator resourceVersion: "2371593" selfLink: /apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodepolicies/mlx278-netdevice uid: 58fd90b8-8e78-44a9-ae96-d49888143da3 spec: deviceType: netdevice isRdma: false linkType: eth mtu: 1500 nicSelector: pfNames: - ens3f0 rootDevices: - 0000:5e:00.0 vendor: 15b3 nodeSelector: feature.node.kubernetes.io/sriov-capable: "true" numVfs: 2 priority: 99 resourceName: mlx278netdevice
The direct cause of this issue was that: 1) existing VF resources were allocated to pod (in a non-host namespace) 2) policy reconciling was triggered (not sure why the node is triggered to apply a policy without draining the node) 3) config daemon writer failed to get VF mtu in host namespace 4) needUpdate was evaluated to true in config daemon 5) config daemon tried to set VF mtu via setNetdevMTU and failed The fix would be to not trigger needUpdate when mtu is 0 (which means mtu cannot be retrieved), Since the minimal mtu value user can set is limited to 1, so zero mtu value means mtu is not retrievable. Panic: I0820 10:19:43.429444 3789528 generic_plugin.go:105] generic-plugin Apply(): desiredState={3312958 [{0000:5e:00.0 2 1500 ens3f0 eth [{mlx278netdevice netdevice 0-1 mlx278-netdevice}]}]} I0820 10:19:43.429611 3789528 utils.go:148] needUpdate(): VF MTU needs update, desired=1500 I0820 10:19:43.429649 3789528 utils.go:186] configSriovDevice(): config interface 0000:5e:00.0 with &{0000:5e:00.0 2 1500 ens3f0 eth [{mlx278netdevice netdevice 0-1 mlx278-netdevice}]} I0820 10:19:43.430586 3789528 driver.go:79] BindDefaultDriver(): bind device 0000:5e:00.2 to default driver I0820 10:19:43.430655 3789528 driver.go:112] hasDriver(): device 0000:5e:00.2 driver is mlx5_core I0820 10:19:43.430685 3789528 driver.go:83] BindDefaultDriver(): device 0000:5e:00.2 already bound to default driver mlx5_core I0820 10:19:43.430714 3789528 utils.go:280] setNetdevMTU(): set MTU for device 0000:5e:00.2 to 1500 E0820 10:19:43.431060 3789528 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0) goroutine 28 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1d03b20, 0xc000af2500) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82 panic(0x1d03b20, 0xc000af2500) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/panic.go:969 +0x16a github.com/openshift/sriov-network-operator/pkg/utils.setNetdevMTU.func1(0xc00088c2c0, 0xc00088c300) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:292 +0x322 github.com/cenkalti/backoff.RetryNotify(0xc0005f9300, 0x7f4fba626320, 0xc00088c2c0, 0x0, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/vendor/github.com/cenkalti/backoff/retry.go:37 +0xba github.com/cenkalti/backoff.Retry(...) /go/src/github.com/openshift/sriov-network-operator/vendor/github.com/cenkalti/backoff/retry.go:24 github.com/openshift/sriov-network-operator/pkg/utils.setNetdevMTU(0xc000c0a225, 0xc, 0x5dc, 0x0, 0x3) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:286 +0x280 github.com/openshift/sriov-network-operator/pkg/utils.configSriovDevice(0xc001480720, 0xc0005720b0, 0x1, 0xc000272601) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:247 +0x2e7 github.com/openshift/sriov-network-operator/pkg/utils.SyncNodeState(0xc000272780, 0x5, 0xc001040500) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:121 +0x429 github.com/openshift/sriov-network-operator/pkg/plugins/generic.(*GenericPlugin).Apply(0x7f4fbafce320, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/pkg/plugins/generic/generic_plugin.go:126 +0x15e github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).nodeStateSyncHandler(0xc0001aad10, 0xe, 0x2d3dad8, 0xc000a940d0) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:419 +0xfd9 github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem.func1(0xc0001aad10, 0x1b87ec0, 0xc000e191b8, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:287 +0xcf github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem(0xc0001aad10, 0x203000) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:303 +0x164 github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).runWorker(0xc0001aad10) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:248 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000aa00f0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000aa00f0, 0x200a2e0, 0xc000aa2270, 0xc000010001, 0xc0000d00c0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000aa00f0, 0x3b9aca00, 0x0, 0x100000000000001, 0xc0000d00c0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xe2 k8s.io/apimachinery/pkg/util/wait.Until(0xc000aa00f0, 0x3b9aca00, 0xc0000d00c0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).Run /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:228 +0xa29 panic: runtime error: index out of range [0] with length 0 [recovered] panic: runtime error: index out of range [0] with length 0
Verified this bug on 4.6.0-202008260226.p0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
@zhaozhanqi, Hello, As per the bug, the fix is available in 4.6.1 but the customer is still facing the issue on 4.6.17version. Please let me know if this issue requires a new BZ opened or can continue here? NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.17 True False 2d Cluster version is 4.6.17 oc -n openshift-sriov-network-operator get pods -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES sriov-network-config-daemon-rrkjl 0/1 CrashLoopBackOff 361 31h 10.146.136.5 worker-0.ocp1.pd.f5net.com <none> <none> . . . I0423 18:56:13.237429 3432740 utils.go:310] setNetdevMTU(): set MTU for device 0000:06:02.4 to 9000 I0423 18:56:13.238464 3432740 driver.go:79] BindDefaultDriver(): bind device 0000:06:02.5 to default driver I0423 18:56:13.238477 3432740 driver.go:112] hasDriver(): device 0000:06:02.5 driver is mlx5_core I0423 18:56:13.238482 3432740 driver.go:83] BindDefaultDriver(): device 0000:06:02.5 already bound to default driver mlx5_core I0423 18:56:13.238514 3432740 utils.go:310] setNetdevMTU(): set MTU for device 0000:06:02.5 to 9000 I0423 18:56:13.238795 3432740 driver.go:79] BindDefaultDriver(): bind device 0000:06:00.4 to default driver I0423 18:56:13.238808 3432740 driver.go:112] hasDriver(): device 0000:06:00.4 driver is mlx5_core I0423 18:56:13.238812 3432740 driver.go:83] BindDefaultDriver(): device 0000:06:00.4 already bound to default driver mlx5_core I0423 18:56:13.238850 3432740 utils.go:310] setNetdevMTU(): set MTU for device 0000:06:00.4 to 9000 I0423 18:56:13.239939 3432740 driver.go:79] BindDefaultDriver(): bind device 0000:06:02.6 to default driver I0423 18:56:13.239952 3432740 driver.go:112] hasDriver(): device 0000:06:02.6 driver is mlx5_core I0423 18:56:13.239957 3432740 driver.go:83] BindDefaultDriver(): device 0000:06:02.6 already bound to default driver mlx5_core I0423 18:56:13.239991 3432740 utils.go:310] setNetdevMTU(): set MTU for device 0000:06:02.6 to 9000 I0423 18:56:13.241126 3432740 driver.go:79] BindDefaultDriver(): bind device 0000:06:02.7 to default driver I0423 18:56:13.241140 3432740 driver.go:112] hasDriver(): device 0000:06:02.7 driver is mlx5_core I0423 18:56:13.241145 3432740 driver.go:83] BindDefaultDriver(): device 0000:06:02.7 already bound to default driver mlx5_core I0423 18:56:13.241178 3432740 utils.go:310] setNetdevMTU(): set MTU for device 0000:06:02.7 to 9000 I0423 18:56:13.242284 3432740 driver.go:79] BindDefaultDriver(): bind device 0000:06:03.0 to default driver I0423 18:56:13.242296 3432740 driver.go:112] hasDriver(): device 0000:06:03.0 driver is mlx5_core I0423 18:56:13.242301 3432740 driver.go:83] BindDefaultDriver(): device 0000:06:03.0 already bound to default driver mlx5_core I0423 18:56:13.242335 3432740 utils.go:310] setNetdevMTU(): set MTU for device 0000:06:03.0 to 9000 I0423 18:56:13.243436 3432740 driver.go:79] BindDefaultDriver(): bind device 0000:06:03.1 to default driver I0423 18:56:13.243448 3432740 driver.go:112] hasDriver(): device 0000:06:03.1 driver is mlx5_core I0423 18:56:13.243453 3432740 driver.go:83] BindDefaultDriver(): device 0000:06:03.1 already bound to default driver mlx5_core I0423 18:56:13.243483 3432740 utils.go:310] setNetdevMTU(): set MTU for device 0000:06:03.1 to 9000 I0423 18:56:13.244627 3432740 driver.go:79] BindDefaultDriver(): bind device 0000:06:03.2 to default driver I0423 18:56:13.244639 3432740 driver.go:112] hasDriver(): device 0000:06:03.2 driver is mlx5_core I0423 18:56:13.244645 3432740 driver.go:83] BindDefaultDriver(): device 0000:06:03.2 already bound to default driver mlx5_core I0423 18:56:13.244669 3432740 utils.go:310] setNetdevMTU(): set MTU for device 0000:06:03.2 to 9000 E0423 18:56:13.244812 3432740 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0) goroutine 75 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1c68480, 0xc000c80300) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89 panic(0x1c68480, 0xc000c80300) /usr/lib/golang/src/runtime/panic.go:969 +0x1c5 github.com/openshift/sriov-network-operator/pkg/utils.setNetdevMTU.func1(0xc000203f40, 0xc000203f80) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:322 +0x32c github.com/cenkalti/backoff.RetryNotify(0xc000dcd258, 0x7fb0c37e4d00, 0xc000203f40, 0x0, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/vendor/github.com/cenkalti/backoff/retry.go:37 +0x209 github.com/cenkalti/backoff.Retry(...) /go/src/github.com/openshift/sriov-network-operator/vendor/github.com/cenkalti/backoff/retry.go:24 github.com/openshift/sriov-network-operator/pkg/utils.setNetdevMTU(0xc000d12ea5, 0xc, 0x2328, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:316 +0x285 github.com/openshift/sriov-network-operator/pkg/utils.configSriovDevice(0xc0007b2540, 0xc000c2ab00, 0x1, 0x1) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:277 +0x405 github.com/openshift/sriov-network-operator/pkg/utils.SyncNodeState(0xc00022a900, 0x5, 0xc000126470) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:121 +0x432 github.com/openshift/sriov-network-operator/pkg/plugins/generic.(*GenericPlugin).Apply(0x7fb0c3fc9a40, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/pkg/plugins/generic/generic_plugin.go:126 +0x15f github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).nodeStateSyncHandler(0xc0001b8b00, 0x5, 0x2a0a1d8, 0xc000c8c000) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:427 +0x1005 github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem.func1(0xc0001b8b00, 0x1b15800, 0x29c6588, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:287 +0xcf github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem(0xc0001b8b00, 0x203000) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:303 +0x15f github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).runWorker(0xc0001b8b00) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:248 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000314270) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000314270, 0x1edb200, 0xc0002433e0, 0xc0001b2e01, 0xc0000e80c0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000314270, 0x3b9aca00, 0x0, 0xc0008b5001, 0xc0000e80c0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xe5 k8s.io/apimachinery/pkg/util/wait.Until(0xc000314270, 0x3b9aca00, 0xc0000e80c0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).Run /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:228 +0xa45 panic: runtime error: index out of range [0] with length 0 [recovered] panic: runtime error: index out of range [0] with length 0 goroutine 75 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c panic(0x1c68480, 0xc000c80300) /usr/lib/golang/src/runtime/panic.go:969 +0x1c5 github.com/openshift/sriov-network-operator/pkg/utils.setNetdevMTU.func1(0xc000203f40, 0xc000203f80) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:322 +0x32c github.com/cenkalti/backoff.RetryNotify(0xc000dcd258, 0x7fb0c37e4d00, 0xc000203f40, 0x0, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/vendor/github.com/cenkalti/backoff/retry.go:37 +0x209 github.com/cenkalti/backoff.Retry(...) /go/src/github.com/openshift/sriov-network-operator/vendor/github.com/cenkalti/backoff/retry.go:24 github.com/openshift/sriov-network-operator/pkg/utils.setNetdevMTU(0xc000d12ea5, 0xc, 0x2328, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:316 +0x285 github.com/openshift/sriov-network-operator/pkg/utils.configSriovDevice(0xc0007b2540, 0xc000c2ab00, 0x1, 0x1) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:277 +0x405 github.com/openshift/sriov-network-operator/pkg/utils.SyncNodeState(0xc00022a900, 0x5, 0xc000126470) /go/src/github.com/openshift/sriov-network-operator/pkg/utils/utils.go:121 +0x432 github.com/openshift/sriov-network-operator/pkg/plugins/generic.(*GenericPlugin).Apply(0x7fb0c3fc9a40, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/pkg/plugins/generic/generic_plugin.go:126 +0x15f github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).nodeStateSyncHandler(0xc0001b8b00, 0x5, 0x2a0a1d8, 0xc000c8c000) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:427 +0x1005 github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem.func1(0xc0001b8b00, 0x1b15800, 0x29c6588, 0x0, 0x0) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:287 +0xcf github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).processNextWorkItem(0xc0001b8b00, 0x203000) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:303 +0x15f github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).runWorker(0xc0001b8b00) /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:248 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000314270) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000314270, 0x1edb200, 0xc0002433e0, 0xc0001b2e01, 0xc0000e80c0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000314270, 0x3b9aca00, 0x0, 0xc0008b5001, 0xc0000e80c0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xe5 k8s.io/apimachinery/pkg/util/wait.Until(0xc000314270, 0x3b9aca00, 0xc0000e80c0) /go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by github.com/openshift/sriov-network-operator/pkg/daemon.(*Daemon).Run /go/src/github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:228 +0xa45
Siva, please open a new bug for 4.8, we will fix the panic in master then backport to 4.6.z
(In reply to zenghui.shi from comment #9) > Siva, please open a new bug for 4.8, we will fix the panic in master then > backport to 4.6.z There is already a bug that has the same panic as described in comment #8 https://bugzilla.redhat.com/show_bug.cgi?id=1955874, I will use it to fix 4.8.0, then backport to 4.7.z/4.6.z.