1429823 – [3.5.x] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED) - default router

Bug 1429823 - [3.5.x] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED) - default router

Summary: [3.5.x] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Inval...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jacob Tanenbaum
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1430541 1434164 1462675 (view as bug list)
Depends On:
Blocks:	OSOPS_V3
TreeView+	depends on / blocked

Reported:	2017-03-07 08:45 UTC by Vladislav Walek
Modified:	2022-08-04 22:20 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: quickly and repeatedly adding and deleting a route with same name in a namespace Consequence: the router pod panics "invalid state transition: Deleted -> ADDED" Fix: adding the objects UID to the event queue key generation function Result: No panic from quickly adding and deleting routes
Clone Of:
Environment:
Last Closed:	2017-04-26 05:36:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Origin (Github)	13494	0	None	None	None	2017-03-24 13:40:03 UTC
Red Hat Product Errata	RHBA-2017:1129	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.5, 3.4, 3.3, and 3.2 bug fix update	2017-04-26 09:35:35 UTC

Description Vladislav Walek 2017-03-07 08:45:15 UTC

Description of problem:

The 2 routers pods (default) failed on the nodes with error:

E0223 12:48:34.531884       1 runtime.go:64] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED)
/builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:70
/builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:63
/builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:49
/usr/lib/golang/src/runtime/asm_amd64.s:479
/usr/lib/golang/src/runtime/panic.go:458

Possible to be related to the https://bugzilla.redhat.com/show_bug.cgi?id=1419771.
The replication controller did not reschedule the pods. The pods did not recover automatically, the manual intervention was required. Pods were deleted to be rescheduled and running.

Version-Release number of selected component (if applicable):
router v3.4.1.2
OSCP v3.4.1.2

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ben Bennett 2017-03-07 14:11:53 UTC

Can you get the router configuration please?  Was this an F5-backesd router or haproxy?

Comment 2 Vladislav Walek 2017-03-07 14:14:22 UTC

will get the configuration from customer. This is haproxy.

Comment 3 Ben Bennett 2017-03-14 17:51:15 UTC

Talked to eparis and since this is not a regression (it's been in all router releases... it is just really hard to tickle).  We won't block 3.5.0 for it, but will get a fix ASAP.

Comment 4 Eric Sauer 2017-03-15 15:20:24 UTC

This is currently biting customers on OpenShift Dedicated as well. They are running v3.3.1.13. This will need to be backported far enough that we can fix this for those customers.

Comment 5 Eric Sauer 2017-03-15 15:20:39 UTC

This is currently biting customers on OpenShift Dedicated as well. They are running v3.3.1.13. This will need to be backported far enough that we can fix this for those customers.

Comment 7 Alexander Koksharov 2017-03-17 08:36:42 UTC

ha-router config is in the attachment.

Comment 11 Will Gordon 2017-03-21 12:34:14 UTC

*** Bug 1434164 has been marked as a duplicate of this bug. ***

Comment 14 Jacob Tanenbaum 2017-03-21 17:22:41 UTC

Can the customer provide any details on the setup (masters/nodes) and workload that causes this panic? We have not been able to reproduce the exact situation locally

Comment 17 Alex Meade 2017-03-22 19:52:31 UTC

A customer Eric is referring to on OpenShift Dedicated seems to hit this when doing the reproduction steps defined in Bug #1419771.

Comment 18 openshift-github-bot 2017-03-24 18:56:57 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/fd723fde7fc30aeabe8f511b509c1307f6b146fe
change the router eventqueue key function

changing the router eventqueue key function so that there is a higher chance that
each item will have a unique key so the router does not panic.

originally the thought was to add the creation timestamp because it was not user
editable but the accessor function meta.CreationTimestamp() only gives the timestamp
to the second and since these actions need to occur quickly a second is too long. Only
adding creation timestamp I was able to observe the panic with the test script. I
decided to use UID because it is much more likely that the UID is unique.

Bug: 1429823

changelog:

added a note explaining why routerKeyFn was added

Comment 19 Thomas Wiest 2017-03-27 19:55:33 UTC

Can we get this backported to 3.4?

Comment 20 Eric Rich 2017-03-28 15:43:42 UTC

(In reply to Thomas Wiest from comment #19)
> Can we get this backported to 3.4?

A backport should already be in flight as part of https://bugzilla.redhat.com/show_bug.cgi?id=1419771

Comment 21 Ben Bennett 2017-03-28 17:30:13 UTC

It will be in the next 3.4 release.  The backport made the cut-off for the next fix release.

Comment 22 Ben Bennett 2017-03-28 17:31:31 UTC

The 3.5.X PR will land as soon as 3.5.0 cuts:
  https://github.com/openshift/origin/issues/13494

Comment 23 bmorriso 2017-03-30 18:33:50 UTC

*** Bug 1430541 has been marked as a duplicate of this bug. ***

Comment 32 errata-xmlrpc 2017-04-26 05:36:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1129

Comment 33 Ben Bennett 2017-06-21 14:04:55 UTC

*** Bug 1462675 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

akokshar
aloughla
ameade
aos-bugs
asherkho
bbennett
bmeng
bmorriso
dakini
eparis
erich
esauer
jgoulding
jtanenba
maschmid
mmasters
mwhittin
pportant
rromerom
twiest
wgordon
xtian