Bug 1419771
Summary: | [3.4] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED) in router logs | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Hongan Li <hongli> | |
Component: | Networking | Assignee: | Jacob Tanenbaum <jtanenba> | |
Networking sub component: | router | QA Contact: | zhaozhanqi <zzhao> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | CC: | aos-bugs, bbennett, bvincell, clichybi, dlbewley, erich, jkaur, jtanenba, knakayam, misalunk, mrobson, pdwyer, spurtell, sten, tatanaka, tkimura | |
Version: | 3.4.0 | Keywords: | Reopened | |
Target Milestone: | --- | |||
Target Release: | 3.4.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: quickly and repeatedly adding and deleting a route with same name in a namespace
Consequence: the router pod panics "invalid state transistion: Deleted -> ADDED"
Fix: adding the objects UID to the event queue key generation function
Result: No panic from quickly adding and deleting routes
|
Story Points: | --- | |
Clone Of: | ||||
: | 1435721 (view as bug list) | Environment: | ||
Last Closed: | 2017-04-04 14:28:21 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1267746, 1435721 |
Description
Hongan Li
2017-02-07 01:49:52 UTC
The main issue here is the eventqueue.go code doesn't like transitions from a deleted event to a new added. Did see a similar failure in the router integration tests from modified -> added which was changed to go from (added -> deleted) in PR: https://github.com/openshift/origin/pull/12783 I suspect this is occurring because the previous event has not been consumed. Not a F5 router specific problem but it does show up here. Asked Paul Morie and Clayton for guidance on what we can do in the event queue code. So according a response from Clayton, event queue needs to be "burned with fire and replaced with an informer and a work queue". Its a wee bit too late this release to do it, so the fix in this PR: https://github.com/openshift/origin/pull/12903 is purely defensive - kill the router pod so that it gets restarted and all's ok. Otherwise we have a router which will never get updates from etcd. Hopefully the likelihood of this happening is low but with the defensive fix at least it leaves the router process in a consistent state. @hongli / @zhaozhanqi if it is useful for your testing, there's a script to simulate this error with: https://gist.githubusercontent.com/ramr/58dbdc3c5982db7b3c3154eb4bca60c8/raw/94c2312b34ae1ccbde19155bb3f0506c1cd54dd2/reproduce-eq-panic.sh%2520 Usage: <script-name> [<route-yaml> <nworkers>] You may have to run this a few times or bump the number of workers to get it going. It takes me 1-3 times consistently to reproduce the issue. Thx Note: Ram closed the above PR with the comment "FYI - this is not needed because we recover from that failure. Not sure what's restarting that thread - probably something in k8s cache/reflector code." So, it looks like this doesn't break the routers at the moment. We need to change the router: "We need to burn event queue with fire and replace with an informer and a work queue" (as Clayton eloquently puts it). Added the card https://trello.com/c/y6SFvOA7 to track the event queue replacement work. Closing this card since, while it is ugly in the logs, the router recovers. https://bugzilla.redhat.com/show_bug.cgi?id=1429823 may be a duplicate of this bug. Since today's env is OCP 3.4.1.11 and still can reproduce the issue, will verify it ASAP when v3.4.1.12 env ready. *** Bug 1430863 has been marked as a duplicate of this bug. *** verified in 3.4.1.12 (atomic-openshift-3.4.1.12-1.git.0.57d7e1d.el7.x86_64) and the issue has been fixed. Run below script more than 20 times and didn't see panic in logs. It just takes 2-3 times to reproduce the issue before. https://gist.githubusercontent.com/ramr/58dbdc3c5982db7b3c3154eb4bca60c8/raw/94c2312b34ae1ccbde19155bb3f0506c1cd54dd2/reproduce-eq-panic.sh%2520 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0865 *** Bug 1434707 has been marked as a duplicate of this bug. *** |