Bug 1525699

Summary: garbagecollection deleting children of TemplateInstance/ServiceInstance on restart after crash
Product: OpenShift Container Platform Reporter: Seth Jennings <sjenning>
Component: MasterAssignee: Dan Mace <dmace>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.7.0CC: aos-bugs, bparees, decarr, dmace, geliu, haowang, jliggitt, jminter, jokerman, jupierce, mfojtik, mmccomas, mzali, pmorie, pweil, sjenning, smunilla, tkimura
Target Milestone: ---   
Target Release: 3.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-05 09:34:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
controller.log none

Description Seth Jennings 2017-12-13 21:30:14 UTC
Created attachment 1367591 [details]
controller.log

Description of problem:

atomic-openshift-master-controllers dies:
F1213 15:15:11.627699   20074 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions

systemd restarts atomic-openshift-master-controllers and garbage collection subsequently does nuts and deletes children of TemplateInstance/ServiceInstance (i.e. my application)

Version-Release number of selected component (if applicable):
openshift v3.7.14
kubernetes v1.7.6+a08f5eeb62

How reproducible:
Not always, but often.  Depends on scheduler cache getting corrupted.

Steps to Reproduce:
1. create cakephp-mysql example app from the web console
2. wait a few minutes for cache corruption and master-controller restart
3. watch app get deleted by garbage collector
4. feels of sadness

Actual results:
GC deletes my app

Expected results:
GC should not delete my app

Additional info:
See attached controller log

I can recreate on brand new v3.7.14 and v3.8.18 clusters.  I've also seen it happen in a live demo at kubecon (oops).

Comment 1 Jordan Liggitt 2017-12-13 22:42:23 UTC
Dan, do we have test coverage for the following scenarios on cold start:

existing owner relationship (should wait for informer sync and not reap child)

missing owner instance (should wait for informer sync, see parent instance is missing, and reap child)

missing owner type from discovery (should wait for informer sync of child type, see the parent type is unknown, and not reap)

error starting owner informer (should wait for informer sync forever, I think)

Comment 2 Seth Jennings 2017-12-14 14:28:37 UTC
I confirmed that a simple restart of the atomic-openshift-master-controllers.service will also cause the issue.  There is no need to wait for the scheduler cache corruption which may or may not happen.

Comment 3 Dan Mace 2017-12-14 14:55:18 UTC
My current theory is this represents another (more insidious) manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=1509022. Still working on testing that theory.

Comment 4 Dan Mace 2017-12-14 15:22:41 UTC
I confirmed this is a manifestation of https://github.com/kubernetes/kubernetes/issues/54940. A simpler way to reproduce:

1. Create a configmap in a namespace which declares an owner reference to a node (which is cluster scoped)
2. Restart the controller process

Unfortunately, the hack[1] which largely resolves the problem during app creation does not (and cannot) fix the problem as manifested during controller restart.

[1] https://github.com/openshift/origin/pull/17207

Comment 6 ge liu 2017-12-18 09:49:08 UTC
Fail to recreate it with ocp 3.7.14 as bug description, it appears occasionally,  is there any good way to make it appears?

Comment 7 Seth Jennings 2017-12-18 13:41:17 UTC
Did you try the recreate procedure in comment 4?  It is simpler than recreate procedure in the initial bug description.

Comment 8 ge liu 2017-12-20 07:01:32 UTC
We tried the steps in comments 4, It could not recreate this issue, Have you ever tried it? thx

Comment 9 Seth Jennings 2017-12-20 18:09:09 UTC
Dan can you help QE with recreate?

Comment 11 Dan Mace 2018-01-03 15:27:54 UTC
(In reply to ge liu from comment #8)
> We tried the steps in comments 4, It could not recreate this issue, Have you
> ever tried it? thx

Can you describe in detail the steps executed to reproduce as outlined in https://bugzilla.redhat.com/show_bug.cgi?id=1525699#c4? I'm surprised you were unable to reproduce using those steps.

Comment 16 Wang Haoran 2018-01-29 07:01:34 UTC
Verified with:
openshift v3.7.27
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

steps1:
1. Create a configmap with ownerReferences to an existed node
2. Create a configmap with ownerReferences to an not existed node
3. Create a cnofigmap with ownerReferences with UnknownType resource
4. Restart the controller process
5. Check the configmap:
   1), the first configmap still there
   2), the 2nd configmap was reaped
   3), the 3rd one with UnknownType still there.

Comment 20 errata-xmlrpc 2018-04-05 09:34:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0636