Created attachment 1097232 [details] The full trace Description of problem: Reported by Diego Spinola Castro <spinolacastro>: I have an all-in-one origin-1.0.8-0 install running about 30 pods on a node. After a machine reboot i origin-node service didn't start, [the trace is attached]. The only way to bring it back was changing the networkPluginName to redhat/openshift-ovs-subnet on node and master configurations. Version-Release number of selected component (if applicable): origin v1.0.8-1-g8f1868d, kubernetes v1.1.0-origin-1107-g4c8e6f4 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Relevant bit from the logs: Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: panic: runtime error: index out of range Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: goroutine 82 [running]: Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: runtime.gopanic(0x2622a20, 0xc20802a000) Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: /usr/lib/golang/src/runtime/panic.go:425 +0x2a3 fp=0xc20901f300 sp=0xc20901f298 Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: runtime.panicindex() Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: /usr/lib/golang/src/runtime/panic.go:12 +0x4e fp=0xc20901f328 sp=0xc20901f300 Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: github.com/openshift/openshift-sdn/plugins/osdn.newSDNPod(0xc20901f538, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: /builddir/build/BUILD/origin-git-7227.8f1868d/_thirdpartyhacks/src/github.com/openshift/openshift-sdn/plugins/osdn/osdn.go:122 +0x151 fp=0xc20901f3c0 sp=0xc20901f328 Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: github.com/openshift/openshift-sdn/plugins/osdn.(*OsdnRegistryInterface).GetPods(0xc20866e370, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: /builddir/build/BUILD/origin-git-7227.8f1868d/_thirdpartyhacks/src/github.com/openshift/openshift-sdn/plugins/osdn/osdn.go:142 +0x428 fp=0xc20901f920 sp=0xc20901f3c0 Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: github.com/openshift/openshift-sdn/pkg/ovssubnet.(*OvsController).watchAndGetResource(0xc20870d050, 0x2a5d850, 0x3, 0x0, 0x0, 0x0, 0x0) Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: /builddir/build/BUILD/origin-git-7227.8f1868d/_thirdpartyhacks/src/github.com/openshift/openshift-sdn/pkg/ovssubnet/common.go:798 +0x98c fp=0xc20901fab0 sp=0xc20901f920 Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: github.com/openshift/openshift-sdn/pkg/ovssubnet.(*OvsController).StartNode(0xc20870d050, 0x22f7, 0x0, 0x0) Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: /builddir/build/BUILD/origin-git-7227.8f1868d/_thirdpartyhacks/src/github.com/openshift/openshift-sdn/pkg/ovssubnet/common.go:487 +0xf69 fp=0xc20901fed0 sp=0xc20901fab0 Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: github.com/openshift/openshift-sdn/plugins/osdn/multitenant.Node(0xc20866e370, 0xc2084cf640, 0x1c, 0xc2084c34a0, 0xa, 0xc20865e180, 0x7f27b9657890, 0xc208660560, 0x22f7) Nov 20 12:38:23 origin.v3.ops.getupcloud.com origin-node[9040]: /builddir/build/BUILD/origin-git-7227.8f1868d/_thirdpartyhacks/src/github.com/openshift/openshift-sdn/plugins/osdn/multitenant/multitenant.go:47 +0x254 fp=0xc20901ff98 sp=0xc20901fed0
func newSDNPod(kPod *kapi.Pod) osdnapi.Pod { containerID := "" if len(kPod.Status.ContainerStatuses) > 0 { // Extract only container ID, pod.Status.ContainerStatuses[0].ContainerID is of the format: docker://<containerID> containerID = strings.Split(kPod.Status.ContainerStatuses[0].ContainerID, "://")[1] } The error is on the Split line. My hunch is that because there are so many pods on the node, it hasn't managed to start them all, so there is a status, but it has not started completely so no container id yet.
Fixed in https://github.com/openshift/openshift-sdn/pull/214
@Ben I had been met the panic error several times before, which are like the error in the attachment. But I found all the panic errors I met, and the one in this bug were happening after adding the openflow rules by *multitenant.go*, like: "Oct 26 13:18:28 node2 openshift-node: I1026 13:18:28.117049 16948 multitenant.go:82] Output of adding table=4,priority=200,tcp,nw_dst=172.30.0.1,tp_dst=443,actions=output:2: (<nil>) Oct 26 13:18:28 node2 openshift-node: panic: runtime error: index out of range Oct 26 13:18:28 node2 openshift-node: goroutine 44 [running]: " But in the current build (v1.1-224-gb994599), they were adding such rules by controller.go, like: Nov 23 14:46:32 node1 openshift-node: I1123 14:46:32.414260 6787 controller.go:82] Output of adding table=4,tcp,nw_dst=172.30.0.1,tp_dst=443,priority=200,actions=output:2: (<nil>) Nov 23 14:46:32 node1 openshift-node: I1123 14:46:32.416477 6787 controller.go:82] Output of adding table=4,udp,nw_dst=172.30.0.1,tp_dst=53,priority=200,actions=output:2: (<nil>) Nov 23 14:46:32 node1 openshift-node: I1123 14:46:32.418416 6787 controller.go:82] Output of adding table=4,tcp,nw_dst=172.30.0.1,tp_dst=53,priority=200,actions=output:2: (<nil>) I cannot reproduce the panic error now even w/o the fix in comment#2, I suspect that the issue may have been fixed by some other refactors. Like this commit https://github.com/openshift/openshift-sdn/commit/c298d4776b6fc29e522e90d1bb01bc57d2307f14 Do you think it is ok to mark this bug as VERIFIED since the issue cannot be reproduced anymore?
I think it is fine to mark it verified.
Move the bug to verified since the 'index out of range' issue cannot be reproduced. Build number: v1.1-266-gba7f510-dirty
A fix for this was released downstream some time ago (bug 1288014), and the last few comments here suggest it was verified upstream too, so I believe this can be closed. Please reopen if needed.