Bug 1525699
Summary: | garbagecollection deleting children of TemplateInstance/ServiceInstance on restart after crash | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Seth Jennings <sjenning> | ||||
Component: | Master | Assignee: | Dan Mace <dmace> | ||||
Status: | CLOSED ERRATA | QA Contact: | ge liu <geliu> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 3.7.0 | CC: | aos-bugs, bparees, decarr, dmace, geliu, haowang, jliggitt, jminter, jokerman, jupierce, mfojtik, mmccomas, mzali, pmorie, pweil, sjenning, smunilla, tkimura | ||||
Target Milestone: | --- | ||||||
Target Release: | 3.7.z | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: |
undefined
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-04-05 09:34:33 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Dan, do we have test coverage for the following scenarios on cold start: existing owner relationship (should wait for informer sync and not reap child) missing owner instance (should wait for informer sync, see parent instance is missing, and reap child) missing owner type from discovery (should wait for informer sync of child type, see the parent type is unknown, and not reap) error starting owner informer (should wait for informer sync forever, I think) I confirmed that a simple restart of the atomic-openshift-master-controllers.service will also cause the issue. There is no need to wait for the scheduler cache corruption which may or may not happen. My current theory is this represents another (more insidious) manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=1509022. Still working on testing that theory. I confirmed this is a manifestation of https://github.com/kubernetes/kubernetes/issues/54940. A simpler way to reproduce: 1. Create a configmap in a namespace which declares an owner reference to a node (which is cluster scoped) 2. Restart the controller process Unfortunately, the hack[1] which largely resolves the problem during app creation does not (and cannot) fix the problem as manifested during controller restart. [1] https://github.com/openshift/origin/pull/17207 Upstream issue: https://github.com/kubernetes/kubernetes/issues/54940 Upstream fix: https://github.com/kubernetes/kubernetes/pull/57211 3.7: https://github.com/openshift/origin/pull/17818 3.8: https://github.com/openshift/origin/pull/17819 3.9: https://github.com/openshift/origin/pull/17820 Fail to recreate it with ocp 3.7.14 as bug description, it appears occasionally, is there any good way to make it appears? Did you try the recreate procedure in comment 4? It is simpler than recreate procedure in the initial bug description. We tried the steps in comments 4, It could not recreate this issue, Have you ever tried it? thx Dan can you help QE with recreate? (In reply to ge liu from comment #8) > We tried the steps in comments 4, It could not recreate this issue, Have you > ever tried it? thx Can you describe in detail the steps executed to reproduce as outlined in https://bugzilla.redhat.com/show_bug.cgi?id=1525699#c4? I'm surprised you were unable to reproduce using those steps. Verified with: openshift v3.7.27 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 steps1: 1. Create a configmap with ownerReferences to an existed node 2. Create a configmap with ownerReferences to an not existed node 3. Create a cnofigmap with ownerReferences with UnknownType resource 4. Restart the controller process 5. Check the configmap: 1), the first configmap still there 2), the 2nd configmap was reaped 3), the 3rd one with UnknownType still there. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0636 |
Created attachment 1367591 [details] controller.log Description of problem: atomic-openshift-master-controllers dies: F1213 15:15:11.627699 20074 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions systemd restarts atomic-openshift-master-controllers and garbage collection subsequently does nuts and deletes children of TemplateInstance/ServiceInstance (i.e. my application) Version-Release number of selected component (if applicable): openshift v3.7.14 kubernetes v1.7.6+a08f5eeb62 How reproducible: Not always, but often. Depends on scheduler cache getting corrupted. Steps to Reproduce: 1. create cakephp-mysql example app from the web console 2. wait a few minutes for cache corruption and master-controller restart 3. watch app get deleted by garbage collector 4. feels of sadness Actual results: GC deletes my app Expected results: GC should not delete my app Additional info: See attached controller log I can recreate on brand new v3.7.14 and v3.8.18 clusters. I've also seen it happen in a live demo at kubecon (oops).