|Summary:||garbagecollection deleting children of TemplateInstance/ServiceInstance on restart after crash|
|Product:||OpenShift Container Platform||Reporter:||Seth Jennings <sjenning>|
|Component:||Master||Assignee:||Dan Mace <dmace>|
|Status:||CLOSED ERRATA||QA Contact:||ge liu <geliu>|
|Version:||3.7.0||CC:||aos-bugs, bparees, decarr, dmace, geliu, haowang, jliggitt, jminter, jokerman, jupierce, mfojtik, mmccomas, mzali, pmorie, pweil, sjenning, smunilla, tkimura|
|Fixed In Version:||Doc Type:||No Doc Update|
|Last Closed:||2018-04-05 09:34:33 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Seth Jennings 2017-12-13 21:30:14 UTC
Created attachment 1367591 [details] controller.log Description of problem: atomic-openshift-master-controllers dies: F1213 15:15:11.627699 20074 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions systemd restarts atomic-openshift-master-controllers and garbage collection subsequently does nuts and deletes children of TemplateInstance/ServiceInstance (i.e. my application) Version-Release number of selected component (if applicable): openshift v3.7.14 kubernetes v1.7.6+a08f5eeb62 How reproducible: Not always, but often. Depends on scheduler cache getting corrupted. Steps to Reproduce: 1. create cakephp-mysql example app from the web console 2. wait a few minutes for cache corruption and master-controller restart 3. watch app get deleted by garbage collector 4. feels of sadness Actual results: GC deletes my app Expected results: GC should not delete my app Additional info: See attached controller log I can recreate on brand new v3.7.14 and v3.8.18 clusters. I've also seen it happen in a live demo at kubecon (oops).
Comment 1 Jordan Liggitt 2017-12-13 22:42:23 UTC
Dan, do we have test coverage for the following scenarios on cold start: existing owner relationship (should wait for informer sync and not reap child) missing owner instance (should wait for informer sync, see parent instance is missing, and reap child) missing owner type from discovery (should wait for informer sync of child type, see the parent type is unknown, and not reap) error starting owner informer (should wait for informer sync forever, I think)
Comment 2 Seth Jennings 2017-12-14 14:28:37 UTC
I confirmed that a simple restart of the atomic-openshift-master-controllers.service will also cause the issue. There is no need to wait for the scheduler cache corruption which may or may not happen.
Comment 3 Dan Mace 2017-12-14 14:55:18 UTC
My current theory is this represents another (more insidious) manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=1509022. Still working on testing that theory.
Comment 4 Dan Mace 2017-12-14 15:22:41 UTC
I confirmed this is a manifestation of https://github.com/kubernetes/kubernetes/issues/54940. A simpler way to reproduce: 1. Create a configmap in a namespace which declares an owner reference to a node (which is cluster scoped) 2. Restart the controller process Unfortunately, the hack which largely resolves the problem during app creation does not (and cannot) fix the problem as manifested during controller restart.  https://github.com/openshift/origin/pull/17207
Comment 5 Jordan Liggitt 2017-12-15 06:31:33 UTC
Comment 6 ge liu 2017-12-18 09:49:08 UTC
Fail to recreate it with ocp 3.7.14 as bug description, it appears occasionally, is there any good way to make it appears?
Comment 7 Seth Jennings 2017-12-18 13:41:17 UTC
Did you try the recreate procedure in comment 4? It is simpler than recreate procedure in the initial bug description.
Comment 8 ge liu 2017-12-20 07:01:32 UTC
We tried the steps in comments 4, It could not recreate this issue, Have you ever tried it? thx
Comment 9 Seth Jennings 2017-12-20 18:09:09 UTC
Dan can you help QE with recreate?
Comment 11 Dan Mace 2018-01-03 15:27:54 UTC
(In reply to ge liu from comment #8) > We tried the steps in comments 4, It could not recreate this issue, Have you > ever tried it? thx Can you describe in detail the steps executed to reproduce as outlined in https://bugzilla.redhat.com/show_bug.cgi?id=1525699#c4? I'm surprised you were unable to reproduce using those steps.
Comment 16 Wang Haoran 2018-01-29 07:01:34 UTC
Verified with: openshift v3.7.27 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 steps1: 1. Create a configmap with ownerReferences to an existed node 2. Create a configmap with ownerReferences to an not existed node 3. Create a cnofigmap with ownerReferences with UnknownType resource 4. Restart the controller process 5. Check the configmap: 1), the first configmap still there 2), the 2nd configmap was reaped 3), the 3rd one with UnknownType still there.
Comment 20 errata-xmlrpc 2018-04-05 09:34:33 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0636