Bug 1754738

Summary: [sriov] sriov operator pod crash when there is NAD created by manual
Product: OpenShift Container Platform Reporter: zhaozhanqi <zzhao>
Component: NetworkingAssignee: zenghui.shi <zshi>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bbennett, xtian
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1755188 (view as bug list) Environment:
Last Closed: 2020-01-23 11:06:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1755188    

Description zhaozhanqi 2019-09-24 02:54:40 UTC
Description of problem:
When created one NetworkAttachmentDefinition using yaml by manual not sync from the sriovnetwork. The Sriov operator pod become CrashLoopBackOff with logs:

 oc logs sriov-network-operator-67cbfc644f-trskn
{"level":"info","ts":1569292801.1042292,"logger":"cmd","msg":"Go Version: go1.11.13"}
{"level":"info","ts":1569292801.1042714,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1569292801.1042793,"logger":"cmd","msg":"Version of operator-sdk: v0.7.0+git"}
{"level":"info","ts":1569292801.1046636,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1569292801.3431642,"logger":"leader","msg":"Found existing lock with my name. I was likely restarted."}
{"level":"info","ts":1569292801.3432288,"logger":"leader","msg":"Continuing as the leader."}
{"level":"info","ts":1569292801.6252902,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1569292801.6259587,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"caconfig-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1569292801.626255,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"sriovnetworknodepolicy-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1569292801.6264331,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"sriovnetworknodepolicy-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1569292801.6266096,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"sriovnetworknodepolicy-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1569292801.6267514,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"sriovnetworknodepolicy-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1569292801.6267838,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"sriovnetworknodepolicy-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1569292801.6269283,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"sriovnetworknodepolicy-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1569292801.627098,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"sriovnetworknodepolicy-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1569292801.6272318,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"sriovnetworknodepolicy-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1569292801.627429,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"sriovnetwork-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1569292801.627605,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"sriovnetwork-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1569292802.0591447,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1569292802.1598394,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"caconfig-controller"}
{"level":"info","ts":1569292802.1598346,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"sriovnetworknodepolicy-controller"}
{"level":"info","ts":1569292802.1601403,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"sriovnetwork-controller"}
{"level":"info","ts":1569292802.260524,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"caconfig-controller","worker count":1}
{"level":"info","ts":1569292802.2606032,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"sriovnetwork-controller","worker count":1}
{"level":"info","ts":1569292802.2605076,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"sriovnetworknodepolicy-controller","worker count":1}
{"level":"info","ts":1569292802.2609465,"logger":"controller_caconfig","msg":"Reconciling CA config map","Request.Namespace":"sriov-network-operator","Request.Name":"sriov-network-operator-lock"}
{"level":"info","ts":1569292802.2609744,"logger":"controller_sriovnetwork","msg":"Reconciling SriovNetwork","Request.Namespace":"sriov-network-operator","Request.Name":"test"}
{"level":"info","ts":1569292802.261055,"logger":"controller_sriovnetwork.renderNetAttDef","msg":"Start to render SRIOV CNI NetworkAttachementDefinition"}
{"level":"info","ts":1569292802.2610908,"logger":"controller_sriovnetworknodepolicy","msg":"Reconciling SriovNetworkNodePolicy","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1569292802.2610555,"logger":"controller_caconfig","msg":"Reconciling CA config map","Request.Namespace":"sriov-network-operator","Request.Name":"device-plugin-config"}
{"level":"info","ts":1569292802.2612221,"logger":"controller_caconfig","msg":"Reconciling CA config map","Request.Namespace":"sriov-network-operator","Request.Name":"openshift-service-ca"}
manifest {"apiVersion":"k8s.cni.cncf.io/v1","kind":"NetworkAttachmentDefinition","metadata":{"annotations":{"k8s.v1.cni.cncf.io/resourceName":"openshift.io/intelnics"},"name":"test","namespace":"z1"},"spec":{"config":"{\"cniVersion\":\"0.3.1\",\"name\":\"sriov-net\",\"type\":\"sriov\",\"vlan\":0,\"ipam\":{\"type\":\"host-local\",\"subnet\":\"10.56.217.0/24\",\"rangeStart\":\"10.56.217.171\",\"rangeEnd\":\"10.56.217.181\",\"routes\":[{\"dst\":\"0.0.0.0/0\"}],\"gateway\":\"10.56.217.1\"}}\n"}}
{"level":"info","ts":1569292802.2635543,"logger":"controller_sriovnetwork","msg":"NetworkAttachmentDefinition already exist, updating","Request.Namespace":"sriov-network-operator","Request.Name":"test"}
E0924 02:40:02.273801       1 runtime.go:69] Observed a panic: "index out of range" (runtime error: index out of range)
/go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:522
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:513
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:44
/go/src/github.com/openshift/sriov-network-operator/pkg/controller/sriovnetwork/sriovnetwork_controller.go:163
/go/src/github.com/openshift/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215
/go/src/github.com/openshift/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
/go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:1333
panic: runtime error: index out of range [recovered]
	panic: runtime error: index out of range

goroutine 344 [running]:
github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x108
panic(0x14027e0, 0x24e4560)
	/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:513 +0x1b9
github.com/openshift/sriov-network-operator/pkg/controller/sriovnetwork.(*ReconcileSriovNetwork).Reconcile(0xc0006525a0, 0xc000042070, 0x16, 0xc0008fc060, 0x4, 0x2500d20, 0xc0002a1d98, 0x771348, 0xc0000dbdd0)
	/go/src/github.com/openshift/sriov-network-operator/pkg/controller/sriovnetwork/sriovnetwork_controller.go:163 +0x18a7
github.com/openshift/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000798d20, 0x0)
	/go/src/github.com/openshift/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215 +0x18f
github.com/openshift/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1()
	/go/src/github.com/openshift/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158 +0x36
github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00126f2e0)
	/go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00126f2e0, 0x3b9aca00, 0x0, 0x1, 0xc000ae2d20)
	/go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbe
github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc00126f2e0, 0x3b9aca00, 0xc000ae2d20)
	/go/src/github.com/openshift/sriov-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/src/github.com/openshift/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157 +0x32a

Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-v4.0-art-dev:v4.2.0-201909221318-ose-sriov-network-operator

How reproducible:
always

Steps to Reproduce:
1. Install the bm cluster and install the sriov operator
2. Create one mac-bridge NAD with the following yaml by manual:
   apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: macvlan-bridge
spec:
  config: '{
      "cniVersion": "0.3.0",
      "type": "macvlan",
      "mode": "bridge",
      "ipam": {
        "type": "host-local",
        "subnet": "10.1.1.0/24",
        "rangeStart": "10.1.1.100",
        "rangeEnd": "10.1.1.200",
        "routes": [
          { "dst": "0.0.0.0/0" }
        ],
        "gateway": "10.1.1.1"
      }
    }'

3. After a while. Found the sriov operator pod is crash. see logs in description

Actual results:

sriov operator pod crashed.

Expected results:

should work well

Additional info:
the NetworkAttachmentDefinition created in step 2 cannot be synced by sriov operator since there is no sriovnetwork mapping I guess.

Comment 1 zhaozhanqi 2019-09-24 09:03:24 UTC
more info to easy reproduce : in step 3: When edit the sriovnetwork CR or add a new sriovnetwork, it will also make the sriovoperator controller sync, then the sriovoperator pod crashed

Comment 3 zhaozhanqi 2019-09-25 00:59:39 UTC
hi, the sriov operator will be crashed if there are others multus NAD exist(for example, macvlan-bridge) at same time. I think we should fix this issue in 4.2. I will clone one bug for 4.2

Comment 8 errata-xmlrpc 2020-01-23 11:06:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062