Description of problem:
Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED) in F5 router logs when deleting and adding the wildcard route.
Version-Release number of selected component (if applicable):
F5 router images: v184.108.40.206 85ad4c56a2a7
observed only once, try same steps but cannot reproduce
Steps to Reproduce:
1. oadm router --type=f5-router ...
2. create project u1p1 and wildcard route
oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/wildcard_route/route_edge.json -n u1p1
3. delete the wildcard route
oc delete route wildcard-edge-route -n u1p1
4. create the wildcard routes again.
oc logs f5router-8-n7t2k and shows:
E0206 07:17:19.754185 1 runtime.go:64] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED)
The main issue here is the eventqueue.go code doesn't like transitions from a deleted event to a new added. Did see a similar failure in the router integration tests
from modified -> added which was changed to go from (added -> deleted) in PR:
I suspect this is occurring because the previous event has not been consumed. Not a F5 router specific problem but it does show up here. Asked Paul Morie and Clayton for guidance on what we can do in the event queue code.
So according a response from Clayton, event queue needs to be "burned with fire and replaced with an informer and a work queue".
Its a wee bit too late this release to do it, so the fix in this PR: https://github.com/openshift/origin/pull/12903
is purely defensive - kill the router pod so that it gets restarted and all's ok.
Otherwise we have a router which will never get updates from etcd.
Hopefully the likelihood of this happening is low but with the defensive fix at least it leaves the router process in a consistent state.
@hongli / @zhaozhanqi if it is useful for your testing, there's a script to simulate this error with:
Usage: <script-name> [<route-yaml> <nworkers>]
You may have to run this a few times or bump the number of workers to get it going. It takes me 1-3 times consistently to reproduce the issue.
Note: Ram closed the above PR with the comment "FYI - this is not needed because we recover from that failure. Not sure what's restarting that thread - probably something in k8s cache/reflector code."
So, it looks like this doesn't break the routers at the moment. We need to change the router: "We need to burn event queue with fire and replace with an informer and a work queue" (as Clayton eloquently puts it).
Added the card https://trello.com/c/y6SFvOA7 to track the event queue replacement work.
Closing this card since, while it is ugly in the logs, the router recovers.
https://bugzilla.redhat.com/show_bug.cgi?id=1429823 may be a duplicate of this bug.
Since today's env is OCP 220.127.116.11 and still can reproduce the issue, will verify it ASAP when v18.104.22.168 env ready.
*** Bug 1430863 has been marked as a duplicate of this bug. ***
verified in 22.214.171.124 (atomic-openshift-126.96.36.199-1.git.0.57d7e1d.el7.x86_64) and the issue has been fixed.
Run below script more than 20 times and didn't see panic in logs. It just takes 2-3 times to reproduce the issue before.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
*** Bug 1434707 has been marked as a duplicate of this bug. ***