Bug 1532060

Summary: Router Panic: panic: runtime error: index out of range in cockroachdb
Product: OpenShift Container Platform Reporter: Eric Paris <eparis>
Component: NetworkingAssignee: Rajat Chopra <rchopra>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED UPSTREAM Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, bbennett, hgomes, jkaur, jrosenta, knakayam, pkanthal, rchopra, stwalter, tkimura
Version: 3.7.0   
Target Milestone: ---   
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1590826 (view as bug list) Environment:
Last Closed: 2018-02-22 23:25:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1590826    
Attachments:
Description Flags
oc logs -p router-1142-d5s4f none

Description Eric Paris 2018-01-07 20:11:39 UTC
Created attachment 1378189 [details]
oc logs -p router-1142-d5s4f

registry.reg-aws.openshift.com:443/openshift3/ose-haproxy-router:v3.7.9-1

Found this in us-east-1:

$ oc get pod -n default -l router=router
NAME                READY     STATUS    RESTARTS   AGE
router-1142-5zmgr   1/1       Running   34         23d
router-1142-8zkgv   1/1       Running   20         23d
router-1142-d5s4f   1/1       Running   9          23d
router-1142-vkqq7   1/1       Running   8          23d
router-1142-xpjhw   1/1       Running   37         23d

Looking at all 5 routers `oc get logs -p` I see this at the end of all of them:

panic: runtime error: index out of range

goroutine 174195 [running]:
github.com/openshift/origin/vendor/github.com/cockroachdb/cmux.(*ptNode).match(0xc420c75dd0, 0xc420c84658, 0x0, 0x8, 0x1, 0x0)
	/builddir/build/BUILD/atomic-openshift-git-0.7c71a2d/_output/local/go/src/github.com/openshift/origin/vendor/github.com/cockroachdb/cmux/patricia.go:148 +0x197
github.com/openshift/origin/vendor/github.com/cockroachdb/cmux.(*patriciaTree).matchPrefix(0xc420c7fa00, 0xf287600, 0xc44ea815b8, 0xf2a0700)
	/builddir/build/BUILD/atomic-openshift-git-0.7c71a2d/_output/local/go/src/github.com/openshift/origin/vendor/github.com/cockroachdb/cmux/patricia.go:38 +0x90
github.com/openshift/origin/vendor/github.com/cockroachdb/cmux.(*patriciaTree).(github.com/openshift/origin/vendor/github.com/cockroachdb/cmux.matchPrefix)-fm(0xf287600, 0xc44ea815b8, 0xc431896ac8)
	/builddir/build/BUILD/atomic-openshift-git-0.7c71a2d/_output/local/go/src/github.com/openshift/origin/vendor/github.com/cockroachdb/cmux/matchers.go:23 +0x3e
github.com/openshift/origin/vendor/github.com/cockroachdb/cmux.(*cMux).serve(0xc420c654c0, 0xf2fa960, 0xc431896ac8, 0xc420c82360, 0xc420ff9190)
	/builddir/build/BUILD/atomic-openshift-git-0.7c71a2d/_output/local/go/src/github.com/openshift/origin/vendor/github.com/cockroachdb/cmux/cmux.go:129 +0x265
created by github.com/openshift/origin/vendor/github.com/cockroachdb/cmux.(*cMux).Serve
	/builddir/build/BUILD/atomic-openshift-git-0.7c71a2d/_output/local/go/src/github.com/openshift/origin/vendor/github.com/cockroachdb/cmux/cmux.go:119 +0x16c

Comment 1 Ben Bennett 2018-01-17 21:53:28 UTC
This was introduced by: https://github.com/openshift/origin/pull/16975

Comment 2 Rajat Chopra 2018-02-01 02:07:13 UTC
The bug fix is in the vendor tree. We need to update cockroachdb/cmux to the following commit id: b64f5908f4945f4b11ed4a0a9d3cc1e23350866d (at least)

The fix is in patricia tree overflowing on a boundary condition on http 1.1 Fast request match. Either we switch to a slower but more accurate match i.e. not use FastMatch, or we update the repo to include this fix.

Glide update PR coming up.

Comment 4 zhaozhanqi 2018-02-22 07:16:02 UTC
hi,@Eric @Ben @Rajat

I'm wondering I still cannot understand how to reproduce this issue. since we did some round of testing in 3.9. but not found this kind of issue. Could you give some clue or steps to reproduce this in order to avoid happen same issue in future. thanks.

Comment 5 Eric Paris 2018-02-22 15:31:53 UTC
I honestly have no idea how to reproduce other than run it in online. I'm ok with QA just verifying the code has changed and we'll see if they continue online. Rajat, what version has the fix?

Comment 6 Rajat Chopra 2018-02-22 23:25:32 UTC
The master branch has the fix. Anyway the fix was in an upstream package. Its a time-sensitive bug that is difficult to reproduce, so I support comment#5.
Closing this bug as 'fixed' upstream.

Comment 7 Ben Bennett 2018-03-07 20:07:48 UTC
*** Bug 1552742 has been marked as a duplicate of this bug. ***

Comment 14 Ben Bennett 2018-05-29 14:14:45 UTC
*** Bug 1582818 has been marked as a duplicate of this bug. ***