Bug 1429823

Summary:	[3.5.x] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED) - default router
Product:	OpenShift Container Platform	Reporter:	Vladislav Walek <vwalek>
Component:	Networking	Assignee:	Jacob Tanenbaum <jtanenba>
Networking sub component:	router	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	akokshar, aloughla, ameade, aos-bugs, asherkho, bbennett, bmeng, bmorriso, dakini, eparis, erich, esauer, jgoulding, jtanenba, maschmid, mmasters, mwhittin, pportant, rromerom, twiest, wgordon, xtian
Version:	3.4.1	Keywords:	OpsBlocker
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: quickly and repeatedly adding and deleting a route with same name in a namespace Consequence: the router pod panics "invalid state transition: Deleted -> ADDED" Fix: adding the objects UID to the event queue key generation function Result: No panic from quickly adding and deleting routes	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-04-26 05:36:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1303130

Description Vladislav Walek 2017-03-07 08:45:15 UTC

Description of problem:

The 2 routers pods (default) failed on the nodes with error:

E0223 12:48:34.531884       1 runtime.go:64] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED)
/builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:70
/builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:63
/builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:49
/usr/lib/golang/src/runtime/asm_amd64.s:479
/usr/lib/golang/src/runtime/panic.go:458

Possible to be related to the https://bugzilla.redhat.com/show_bug.cgi?id=1419771.
The replication controller did not reschedule the pods. The pods did not recover automatically, the manual intervention was required. Pods were deleted to be rescheduled and running.

Version-Release number of selected component (if applicable):
router v3.4.1.2
OSCP v3.4.1.2

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ben Bennett 2017-03-07 14:11:53 UTC

Can you get the router configuration please?  Was this an F5-backesd router or haproxy?

Comment 2 Vladislav Walek 2017-03-07 14:14:22 UTC

will get the configuration from customer. This is haproxy.

Comment 3 Ben Bennett 2017-03-14 17:51:15 UTC

Talked to eparis and since this is not a regression (it's been in all router releases... it is just really hard to tickle).  We won't block 3.5.0 for it, but will get a fix ASAP.

Comment 4 Eric Sauer 2017-03-15 15:20:24 UTC

This is currently biting customers on OpenShift Dedicated as well. They are running v3.3.1.13. This will need to be backported far enough that we can fix this for those customers.

Comment 5 Eric Sauer 2017-03-15 15:20:39 UTC

This is currently biting customers on OpenShift Dedicated as well. They are running v3.3.1.13. This will need to be backported far enough that we can fix this for those customers.

Comment 7 Alexander Koksharov 2017-03-17 08:36:42 UTC

ha-router config is in the attachment.

Comment 11 Will Gordon 2017-03-21 12:34:14 UTC

*** Bug 1434164 has been marked as a duplicate of this bug. ***

Comment 14 Jacob Tanenbaum 2017-03-21 17:22:41 UTC

Can the customer provide any details on the setup (masters/nodes) and workload that causes this panic? We have not been able to reproduce the exact situation locally

Comment 17 Alex Meade 2017-03-22 19:52:31 UTC

A customer Eric is referring to on OpenShift Dedicated seems to hit this when doing the reproduction steps defined in Bug #1419771.

Comment 18 openshift-github-bot 2017-03-24 18:56:57 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/fd723fde7fc30aeabe8f511b509c1307f6b146fe
change the router eventqueue key function

changing the router eventqueue key function so that there is a higher chance that
each item will have a unique key so the router does not panic.

originally the thought was to add the creation timestamp because it was not user
editable but the accessor function meta.CreationTimestamp() only gives the timestamp
to the second and since these actions need to occur quickly a second is too long. Only
adding creation timestamp I was able to observe the panic with the test script. I
decided to use UID because it is much more likely that the UID is unique.

Bug: 1429823

changelog:

added a note explaining why routerKeyFn was added

Comment 19 Thomas Wiest 2017-03-27 19:55:33 UTC

Can we get this backported to 3.4?

Comment 20 Eric Rich 2017-03-28 15:43:42 UTC

(In reply to Thomas Wiest from comment #19)
> Can we get this backported to 3.4?

A backport should already be in flight as part of https://bugzilla.redhat.com/show_bug.cgi?id=1419771

Comment 21 Ben Bennett 2017-03-28 17:30:13 UTC

It will be in the next 3.4 release.  The backport made the cut-off for the next fix release.

Comment 22 Ben Bennett 2017-03-28 17:31:31 UTC

The 3.5.X PR will land as soon as 3.5.0 cuts:
  https://github.com/openshift/origin/issues/13494

Comment 23 bmorriso 2017-03-30 18:33:50 UTC

*** Bug 1430541 has been marked as a duplicate of this bug. ***

Comment 32 errata-xmlrpc 2017-04-26 05:36:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1129

Comment 33 Ben Bennett 2017-06-21 14:04:55 UTC

*** Bug 1462675 has been marked as a duplicate of this bug. ***