Bug 1840639

Summary: [sriov][4.4.z] sriov config daemon pod restarted due to panic
Product: OpenShift Container Platform Reporter: zhaozhanqi <zzhao>
Component: NetworkingAssignee: Peng Liu <pliu>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: dosmith, pliu
Version: 4.4   
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1840637
: 1840642 (view as bug list) Environment:
Last Closed: 2020-06-17 22:26:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1840637    
Bug Blocks: 1840642    

Description zhaozhanqi 2020-05-27 11:28:20 UTC
+++ This bug was initially created as a clone of Bug #1840637 +++

Description of problem:
Given the sriov pod running some days. found the sriov config daemon pod restarted. Check the logs with `--previous`, see: 

I0526 07:19:32.218230 1168941 utils.go:282] tryGetInterfaceName(): name is ens1f0
I0526 07:19:32.218299 1168941 utils.go:287] getNetdevMTU(): get MTU for device 0000:3b:00.1
I0526 07:19:32.218329 1168941 utils.go:282] tryGetInterfaceName(): name is ens1f1
I0526 07:19:32.218372 1168941 utils.go:282] tryGetInterfaceName(): name is ens1f1
I0526 07:19:32.218449 1168941 utils.go:287] getNetdevMTU(): get MTU for device 0000:5e:00.0
I0526 07:19:32.218478 1168941 utils.go:282] tryGetInterfaceName(): name is ens3f0
I0526 07:19:32.218525 1168941 utils.go:282] tryGetInterfaceName(): name is ens3f0
I0526 07:19:32.218852 1168941 utils.go:287] getNetdevMTU(): get MTU for device 0000:5e:00.2
I0526 07:19:32.218885 1168941 utils.go:282] tryGetInterfaceName(): name is ens3f0v0
I0526 07:19:32.219055 1168941 utils.go:287] getNetdevMTU(): get MTU for device 0000:5e:00.3
I0526 07:19:32.219091 1168941 utils.go:282] tryGetInterfaceName(): name is ens3f0v1
I0526 07:19:32.219138 1168941 utils.go:287] getNetdevMTU(): get MTU for device 0000:5e:00.1
I0526 07:19:32.219172 1168941 utils.go:282] tryGetInterfaceName(): name is ens3f1
I0526 07:19:32.219217 1168941 utils.go:282] tryGetInterfaceName(): name is ens3f1
I0526 07:19:32.219306 1168941 utils.go:287] getNetdevMTU(): get MTU for device 0000:60:00.0
I0526 07:19:32.219337 1168941 utils.go:282] tryGetInterfaceName(): name is ens2f0
I0526 07:19:32.219387 1168941 utils.go:282] tryGetInterfaceName(): name is ens2f0
I0526 07:19:32.219679 1168941 utils.go:287] getNetdevMTU(): get MTU for device 0000:60:00.2
I0526 07:19:32.219710 1168941 utils.go:282] tryGetInterfaceName(): name is ens2f0v0
I0526 07:19:32.219880 1168941 utils.go:287] getNetdevMTU(): get MTU for device 0000:60:00.3
I0526 07:19:32.219909 1168941 utils.go:282] tryGetInterfaceName(): name is ens2f0v1
I0526 07:19:32.219947 1168941 utils.go:287] getNetdevMTU(): get MTU for device 0000:60:00.1
I0526 07:19:32.219975 1168941 utils.go:282] tryGetInterfaceName(): name is ens2f1
I0526 07:19:32.220019 1168941 utils.go:282] tryGetInterfaceName(): name is ens2f1
I0526 07:19:32.532207 1168941 daemon.go:245] nodeStateChangeHandler(): new generation is 5
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x160 pc=0x1643514]

goroutine 70 [running]:
github.com/openshift/sriov-network-operator/pkg/daemon.setNodeStateStatus(0x1abdd60, 0xc000358cf0, 0xc00004800a, 0x27, 0xc000e14c00, 0xa, 0x10, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/openshift/sriov-network-operator/pkg/daemon/writer.go:111 +0x154
github.com/openshift/sriov-network-operator/pkg/daemon.(*NodeStateStatusWriter).Run(0xc000116b40, 0xc0000ea3c0, 0xc0000ea600, 0xc0000ea5a0, 0x0)
	/go/src/github.com/openshift/sriov-network-operator/pkg/daemon/writer.go:61 +0x42f
created by main.runStartCmd
	/go/src/github.com/openshift/sriov-network-operator/cmd/sriov-network-config-daemon/start.go:98 +0x4a9

Version-Release number of selected component (if applicable):
4.4.0-202005221118

How reproducible:
not sure

Steps to Reproduce:
1. oc logs sriov-network-config-daemon-7mlhz --previous
2.
3.

Actual results:

oc get pod sriov-network-config-daemon-7mlhz
NAME                                READY   STATUS    RESTARTS   AGE
sriov-network-config-daemon-7mlhz   1/1     Running   6          2d1h



Expected results:


Additional info:

Comment 1 Peng Liu 2020-06-01 03:51:08 UTC
Also fixed by https://github.com/openshift/sriov-network-operator/pull/225

Comment 4 zhaozhanqi 2020-06-08 05:55:04 UTC
Verified this bug on 4.4.0-202006061254

Comment 6 errata-xmlrpc 2020-06-17 22:26:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2445