Bug 1429823 - [3.5.x] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED) - default router
Summary: [3.5.x] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Inval...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Jacob Tanenbaum
QA Contact: zhaozhanqi
URL:
Whiteboard:
: 1430541 1434164 1462675 (view as bug list)
Depends On:
Blocks: OSOPS_V3
TreeView+ depends on / blocked
 
Reported: 2017-03-07 08:45 UTC by Vladislav Walek
Modified: 2022-08-04 22:20 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: quickly and repeatedly adding and deleting a route with same name in a namespace Consequence: the router pod panics "invalid state transition: Deleted -> ADDED" Fix: adding the objects UID to the event queue key generation function Result: No panic from quickly adding and deleting routes
Clone Of:
Environment:
Last Closed: 2017-04-26 05:36:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 13494 0 None None None 2017-03-24 13:40:03 UTC
Red Hat Product Errata RHBA-2017:1129 0 normal SHIPPED_LIVE OpenShift Container Platform 3.5, 3.4, 3.3, and 3.2 bug fix update 2017-04-26 09:35:35 UTC

Description Vladislav Walek 2017-03-07 08:45:15 UTC
Description of problem:

The 2 routers pods (default) failed on the nodes with error:

E0223 12:48:34.531884       1 runtime.go:64] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED)
/builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:70
/builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:63
/builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:49
/usr/lib/golang/src/runtime/asm_amd64.s:479
/usr/lib/golang/src/runtime/panic.go:458

Possible to be related to the https://bugzilla.redhat.com/show_bug.cgi?id=1419771.
The replication controller did not reschedule the pods. The pods did not recover automatically, the manual intervention was required. Pods were deleted to be rescheduled and running.

Version-Release number of selected component (if applicable):
router v3.4.1.2
OSCP v3.4.1.2

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ben Bennett 2017-03-07 14:11:53 UTC
Can you get the router configuration please?  Was this an F5-backesd router or haproxy?

Comment 2 Vladislav Walek 2017-03-07 14:14:22 UTC
will get the configuration from customer. This is haproxy.

Comment 3 Ben Bennett 2017-03-14 17:51:15 UTC
Talked to eparis and since this is not a regression (it's been in all router releases... it is just really hard to tickle).  We won't block 3.5.0 for it, but will get a fix ASAP.

Comment 4 Eric Sauer 2017-03-15 15:20:24 UTC
This is currently biting customers on OpenShift Dedicated as well. They are running v3.3.1.13. This will need to be backported far enough that we can fix this for those customers.

Comment 5 Eric Sauer 2017-03-15 15:20:39 UTC
This is currently biting customers on OpenShift Dedicated as well. They are running v3.3.1.13. This will need to be backported far enough that we can fix this for those customers.

Comment 7 Alexander Koksharov 2017-03-17 08:36:42 UTC
ha-router config is in the attachment.

Comment 11 Will Gordon 2017-03-21 12:34:14 UTC
*** Bug 1434164 has been marked as a duplicate of this bug. ***

Comment 14 Jacob Tanenbaum 2017-03-21 17:22:41 UTC
Can the customer provide any details on the setup (masters/nodes) and workload that causes this panic? We have not been able to reproduce the exact situation locally

Comment 17 Alex Meade 2017-03-22 19:52:31 UTC
A customer Eric is referring to on OpenShift Dedicated seems to hit this when doing the reproduction steps defined in Bug #1419771.

Comment 18 openshift-github-bot 2017-03-24 18:56:57 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/fd723fde7fc30aeabe8f511b509c1307f6b146fe
change the router eventqueue key function

changing the router eventqueue key function so that there is a higher chance that
each item will have a unique key so the router does not panic.

originally the thought was to add the creation timestamp because it was not user
editable but the accessor function meta.CreationTimestamp() only gives the timestamp
to the second and since these actions need to occur quickly a second is too long. Only
adding creation timestamp I was able to observe the panic with the test script. I
decided to use UID because it is much more likely that the UID is unique.

Bug: 1429823

changelog:

added a note explaining why routerKeyFn was added

Comment 19 Thomas Wiest 2017-03-27 19:55:33 UTC
Can we get this backported to 3.4?

Comment 20 Eric Rich 2017-03-28 15:43:42 UTC
(In reply to Thomas Wiest from comment #19)
> Can we get this backported to 3.4?

A backport should already be in flight as part of https://bugzilla.redhat.com/show_bug.cgi?id=1419771

Comment 21 Ben Bennett 2017-03-28 17:30:13 UTC
It will be in the next 3.4 release.  The backport made the cut-off for the next fix release.

Comment 22 Ben Bennett 2017-03-28 17:31:31 UTC
The 3.5.X PR will land as soon as 3.5.0 cuts:
  https://github.com/openshift/origin/issues/13494

Comment 23 bmorriso 2017-03-30 18:33:50 UTC
*** Bug 1430541 has been marked as a duplicate of this bug. ***

Comment 32 errata-xmlrpc 2017-04-26 05:36:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1129

Comment 33 Ben Bennett 2017-06-21 14:04:55 UTC
*** Bug 1462675 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.