Description of problem: The 2 routers pods (default) failed on the nodes with error: E0223 12:48:34.531884 1 runtime.go:64] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED) /builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:70 /builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:63 /builddir/build/BUILD/atomic-openshift-git-0.d760092/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:49 /usr/lib/golang/src/runtime/asm_amd64.s:479 /usr/lib/golang/src/runtime/panic.go:458 Possible to be related to the https://bugzilla.redhat.com/show_bug.cgi?id=1419771. The replication controller did not reschedule the pods. The pods did not recover automatically, the manual intervention was required. Pods were deleted to be rescheduled and running. Version-Release number of selected component (if applicable): router v3.4.1.2 OSCP v3.4.1.2 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Can you get the router configuration please? Was this an F5-backesd router or haproxy?
will get the configuration from customer. This is haproxy.
Talked to eparis and since this is not a regression (it's been in all router releases... it is just really hard to tickle). We won't block 3.5.0 for it, but will get a fix ASAP.
This is currently biting customers on OpenShift Dedicated as well. They are running v3.3.1.13. This will need to be backported far enough that we can fix this for those customers.
ha-router config is in the attachment.
*** Bug 1434164 has been marked as a duplicate of this bug. ***
Can the customer provide any details on the setup (masters/nodes) and workload that causes this panic? We have not been able to reproduce the exact situation locally
A customer Eric is referring to on OpenShift Dedicated seems to hit this when doing the reproduction steps defined in Bug #1419771.
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/fd723fde7fc30aeabe8f511b509c1307f6b146fe change the router eventqueue key function changing the router eventqueue key function so that there is a higher chance that each item will have a unique key so the router does not panic. originally the thought was to add the creation timestamp because it was not user editable but the accessor function meta.CreationTimestamp() only gives the timestamp to the second and since these actions need to occur quickly a second is too long. Only adding creation timestamp I was able to observe the panic with the test script. I decided to use UID because it is much more likely that the UID is unique. Bug: 1429823 changelog: added a note explaining why routerKeyFn was added
Can we get this backported to 3.4?
(In reply to Thomas Wiest from comment #19) > Can we get this backported to 3.4? A backport should already be in flight as part of https://bugzilla.redhat.com/show_bug.cgi?id=1419771
It will be in the next 3.4 release. The backport made the cut-off for the next fix release.
The 3.5.X PR will land as soon as 3.5.0 cuts: https://github.com/openshift/origin/issues/13494
*** Bug 1430541 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1129
*** Bug 1462675 has been marked as a duplicate of this bug. ***