Description of problem: Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED) in F5 router logs when deleting and adding the wildcard route. Version-Release number of selected component (if applicable): openshift v3.5.0.16+a26133a kubernetes v1.5.2+43a9be4 etcd 3.1.0 F5 router images: v3.5.0.16 85ad4c56a2a7 How reproducible: observed only once, try same steps but cannot reproduce Steps to Reproduce: 1. oadm router --type=f5-router ... 2. create project u1p1 and wildcard route oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/wildcard_route/route_edge.json -n u1p1 3. delete the wildcard route oc delete route wildcard-edge-route -n u1p1 4. create the wildcard routes again. Actual results: oc logs f5router-8-n7t2k and shows: E0206 07:17:19.754185 1 runtime.go:64] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED) /builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:70 /builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:63 /builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:49 /usr/lib/golang/src/runtime/asm_amd64.s:479 /usr/lib/golang/src/runtime/panic.go:458 /builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/pkg/client/cache/eventqueue.go:132 /builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/pkg/client/cache/eventqueue.go:196 /builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:370 /builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:317 /builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:187 /builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:96 /builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:97 /builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:52 /usr/lib/golang/src/runtime/asm_amd64.s:2086 Expected results: no panic Additional info:
The main issue here is the eventqueue.go code doesn't like transitions from a deleted event to a new added. Did see a similar failure in the router integration tests from modified -> added which was changed to go from (added -> deleted) in PR: https://github.com/openshift/origin/pull/12783 I suspect this is occurring because the previous event has not been consumed. Not a F5 router specific problem but it does show up here. Asked Paul Morie and Clayton for guidance on what we can do in the event queue code.
So according a response from Clayton, event queue needs to be "burned with fire and replaced with an informer and a work queue". Its a wee bit too late this release to do it, so the fix in this PR: https://github.com/openshift/origin/pull/12903 is purely defensive - kill the router pod so that it gets restarted and all's ok. Otherwise we have a router which will never get updates from etcd. Hopefully the likelihood of this happening is low but with the defensive fix at least it leaves the router process in a consistent state. @hongli / @zhaozhanqi if it is useful for your testing, there's a script to simulate this error with: https://gist.githubusercontent.com/ramr/58dbdc3c5982db7b3c3154eb4bca60c8/raw/94c2312b34ae1ccbde19155bb3f0506c1cd54dd2/reproduce-eq-panic.sh%2520 Usage: <script-name> [<route-yaml> <nworkers>] You may have to run this a few times or bump the number of workers to get it going. It takes me 1-3 times consistently to reproduce the issue. Thx
Note: Ram closed the above PR with the comment "FYI - this is not needed because we recover from that failure. Not sure what's restarting that thread - probably something in k8s cache/reflector code." So, it looks like this doesn't break the routers at the moment. We need to change the router: "We need to burn event queue with fire and replace with an informer and a work queue" (as Clayton eloquently puts it).
Added the card https://trello.com/c/y6SFvOA7 to track the event queue replacement work. Closing this card since, while it is ugly in the logs, the router recovers.
https://bugzilla.redhat.com/show_bug.cgi?id=1429823 may be a duplicate of this bug.
Since today's env is OCP 3.4.1.11 and still can reproduce the issue, will verify it ASAP when v3.4.1.12 env ready.
*** Bug 1430863 has been marked as a duplicate of this bug. ***
verified in 3.4.1.12 (atomic-openshift-3.4.1.12-1.git.0.57d7e1d.el7.x86_64) and the issue has been fixed. Run below script more than 20 times and didn't see panic in logs. It just takes 2-3 times to reproduce the issue before. https://gist.githubusercontent.com/ramr/58dbdc3c5982db7b3c3154eb4bca60c8/raw/94c2312b34ae1ccbde19155bb3f0506c1cd54dd2/reproduce-eq-panic.sh%2520
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0865
*** Bug 1434707 has been marked as a duplicate of this bug. ***