Bug 1419771

Summary:	[3.4] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED) in router logs
Product:	OpenShift Container Platform	Reporter:	Hongan Li <hongli>
Component:	Networking	Assignee:	Jacob Tanenbaum <jtanenba>
Networking sub component:	router	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, bbennett, bvincell, clichybi, dlbewley, erich, jkaur, jtanenba, knakayam, misalunk, mrobson, pdwyer, spurtell, sten, tatanaka, tkimura
Version:	3.4.0	Keywords:	Reopened
Target Milestone:	---
Target Release:	3.4.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: quickly and repeatedly adding and deleting a route with same name in a namespace Consequence: the router pod panics "invalid state transistion: Deleted -> ADDED" Fix: adding the objects UID to the event queue key generation function Result: No panic from quickly adding and deleting routes	Story Points:	---
Clone Of:
Clones:	1435721 (view as bug list)		Environment:
Last Closed:	2017-04-04 14:28:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1267746, 1435721

Description Hongan Li 2017-02-07 01:49:52 UTC

Description of problem:
Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED) in F5 router logs when deleting and adding the wildcard route.

Version-Release number of selected component (if applicable):
openshift v3.5.0.16+a26133a
kubernetes v1.5.2+43a9be4
etcd 3.1.0

F5 router images: v3.5.0.16  85ad4c56a2a7

How reproducible:
observed only once, try same steps but cannot reproduce

Steps to Reproduce:
1. oadm router --type=f5-router ...
2. create project u1p1 and wildcard route
   oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/wildcard_route/route_edge.json -n u1p1

3. delete the wildcard route
   oc delete route wildcard-edge-route -n u1p1

4. create the wildcard routes again.

Actual results:
oc logs f5router-8-n7t2k and shows:
E0206 07:17:19.754185       1 runtime.go:64] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED)
/builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:70
/builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:63
/builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:49
/usr/lib/golang/src/runtime/asm_amd64.s:479
/usr/lib/golang/src/runtime/panic.go:458
/builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/pkg/client/cache/eventqueue.go:132
/builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/pkg/client/cache/eventqueue.go:196
/builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:370
/builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:317
/builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:187
/builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:96
/builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:97
/builddir/build/BUILD/atomic-openshift-git-0.a26133a/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:52
/usr/lib/golang/src/runtime/asm_amd64.s:2086


Expected results:
no panic

Additional info:

Comment 1 Ram Ranganathan 2017-02-07 20:00:02 UTC

The main issue here is the eventqueue.go code doesn't like transitions from a deleted event to a new added. Did see a similar failure in the router integration tests
from modified -> added which was changed to go from (added -> deleted) in PR:
  https://github.com/openshift/origin/pull/12783

I suspect this is occurring because the previous event has not been consumed. Not a F5 router specific problem but it does show up here. Asked Paul Morie and Clayton for guidance on what we can do in the event queue code.

Comment 2 Ram Ranganathan 2017-02-10 00:47:19 UTC

So according a response from Clayton, event queue needs to be "burned with fire and replaced with an informer and a work queue". 

Its a wee bit too late this release to do it, so the fix in this PR: https://github.com/openshift/origin/pull/12903

is purely defensive - kill the router pod so that it gets restarted and all's ok. 
Otherwise we have a router which will never get updates from etcd. 

Hopefully the likelihood of this happening is low but with the defensive fix at least it leaves the router process in a consistent state.


@hongli / @zhaozhanqi if it is useful for your testing, there's a script to simulate this error with: 
https://gist.githubusercontent.com/ramr/58dbdc3c5982db7b3c3154eb4bca60c8/raw/94c2312b34ae1ccbde19155bb3f0506c1cd54dd2/reproduce-eq-panic.sh%2520 

Usage:  <script-name> [<route-yaml> <nworkers>]

You may have to run this a few times or bump the number of workers to get it going. It takes me 1-3 times consistently to reproduce the issue.
Thx

Comment 3 Ben Bennett 2017-02-10 17:39:22 UTC

Note: Ram closed the above PR with the comment "FYI - this is not needed because we recover from that failure. Not sure what's restarting that thread - probably something in k8s cache/reflector code."

So, it looks like this doesn't break the routers at the moment.  We need to change the router: "We need to burn event queue with fire and replace with an informer and a work queue" (as Clayton eloquently puts it).

Comment 4 Ben Bennett 2017-02-13 18:53:48 UTC

Added the card https://trello.com/c/y6SFvOA7 to track the event queue replacement work.

Closing this card since, while it is ugly in the logs, the router recovers.

Comment 12 Eric Rich 2017-03-28 15:44:04 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1429823 may be a duplicate of this bug.

Comment 14 Hongan Li 2017-03-29 08:31:21 UTC

Since today's env is OCP 3.4.1.11 and still can reproduce the issue, will verify it ASAP when v3.4.1.12 env ready.

Comment 15 Ben Bennett 2017-03-29 15:11:41 UTC

*** Bug 1430863 has been marked as a duplicate of this bug. ***

Comment 16 Hongan Li 2017-03-30 02:38:13 UTC

verified in 3.4.1.12 (atomic-openshift-3.4.1.12-1.git.0.57d7e1d.el7.x86_64) and the issue has been fixed.

Run below script more than 20 times and didn't see panic in logs. It just takes 2-3 times to reproduce the issue before.

https://gist.githubusercontent.com/ramr/58dbdc3c5982db7b3c3154eb4bca60c8/raw/94c2312b34ae1ccbde19155bb3f0506c1cd54dd2/reproduce-eq-panic.sh%2520

Comment 21 errata-xmlrpc 2017-04-04 14:28:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0865

Comment 22 Ben Bennett 2017-04-04 14:28:52 UTC

*** Bug 1434707 has been marked as a duplicate of this bug. ***