Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1525699

Summary:

garbagecollection deleting children of TemplateInstance/ServiceInstance on restart after crash

Product:

OpenShift Container Platform

Reporter:

Seth Jennings <sjenning>

Component:

Master

Assignee:

Dan Mace <dmace>

Status:

CLOSED ERRATA

QA Contact:

ge liu <geliu>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

3.7.0

CC:

aos-bugs, bparees, decarr, dmace, geliu, haowang, jliggitt, jminter, jokerman, jupierce, mfojtik, mmccomas, mzali, pmorie, pweil, sjenning, smunilla, tkimura

Target Milestone:

---

Target Release:

3.7.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

undefined

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-04-05 09:34:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
controller.log	none

Description Seth Jennings 2017-12-13 21:30:14 UTC

Created attachment 1367591 [details]
controller.log

Description of problem:

atomic-openshift-master-controllers dies:
F1213 15:15:11.627699   20074 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions

systemd restarts atomic-openshift-master-controllers and garbage collection subsequently does nuts and deletes children of TemplateInstance/ServiceInstance (i.e. my application)

Version-Release number of selected component (if applicable):
openshift v3.7.14
kubernetes v1.7.6+a08f5eeb62

How reproducible:
Not always, but often.  Depends on scheduler cache getting corrupted.

Steps to Reproduce:
1. create cakephp-mysql example app from the web console
2. wait a few minutes for cache corruption and master-controller restart
3. watch app get deleted by garbage collector
4. feels of sadness

Actual results:
GC deletes my app

Expected results:
GC should not delete my app

Additional info:
See attached controller log

I can recreate on brand new v3.7.14 and v3.8.18 clusters.  I've also seen it happen in a live demo at kubecon (oops).

Comment 1 Jordan Liggitt 2017-12-13 22:42:23 UTC

Dan, do we have test coverage for the following scenarios on cold start:

existing owner relationship (should wait for informer sync and not reap child)

missing owner instance (should wait for informer sync, see parent instance is missing, and reap child)

missing owner type from discovery (should wait for informer sync of child type, see the parent type is unknown, and not reap)

error starting owner informer (should wait for informer sync forever, I think)

Comment 2 Seth Jennings 2017-12-14 14:28:37 UTC

I confirmed that a simple restart of the atomic-openshift-master-controllers.service will also cause the issue.  There is no need to wait for the scheduler cache corruption which may or may not happen.

Comment 3 Dan Mace 2017-12-14 14:55:18 UTC

My current theory is this represents another (more insidious) manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=1509022. Still working on testing that theory.

Comment 4 Dan Mace 2017-12-14 15:22:41 UTC

I confirmed this is a manifestation of https://github.com/kubernetes/kubernetes/issues/54940. A simpler way to reproduce:

1. Create a configmap in a namespace which declares an owner reference to a node (which is cluster scoped)
2. Restart the controller process

Unfortunately, the hack[1] which largely resolves the problem during app creation does not (and cannot) fix the problem as manifested during controller restart.

[1] https://github.com/openshift/origin/pull/17207

Comment 5 Jordan Liggitt 2017-12-15 06:31:33 UTC

Upstream issue: https://github.com/kubernetes/kubernetes/issues/54940
Upstream fix: https://github.com/kubernetes/kubernetes/pull/57211

3.7: https://github.com/openshift/origin/pull/17818
3.8: https://github.com/openshift/origin/pull/17819
3.9: https://github.com/openshift/origin/pull/17820

Comment 6 ge liu 2017-12-18 09:49:08 UTC

Fail to recreate it with ocp 3.7.14 as bug description, it appears occasionally,  is there any good way to make it appears?

Comment 7 Seth Jennings 2017-12-18 13:41:17 UTC

Did you try the recreate procedure in comment 4?  It is simpler than recreate procedure in the initial bug description.

Comment 8 ge liu 2017-12-20 07:01:32 UTC

We tried the steps in comments 4, It could not recreate this issue, Have you ever tried it? thx

Comment 9 Seth Jennings 2017-12-20 18:09:09 UTC

Dan can you help QE with recreate?

Comment 11 Dan Mace 2018-01-03 15:27:54 UTC

(In reply to ge liu from comment #8)
> We tried the steps in comments 4, It could not recreate this issue, Have you
> ever tried it? thx

Can you describe in detail the steps executed to reproduce as outlined in https://bugzilla.redhat.com/show_bug.cgi?id=1525699#c4? I'm surprised you were unable to reproduce using those steps.

Comment 16 Wang Haoran 2018-01-29 07:01:34 UTC

Verified with:
openshift v3.7.27
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

steps1:
1. Create a configmap with ownerReferences to an existed node
2. Create a configmap with ownerReferences to an not existed node
3. Create a cnofigmap with ownerReferences with UnknownType resource
4. Restart the controller process
5. Check the configmap:
   1), the first configmap still there
   2), the 2nd configmap was reaped
   3), the 3rd one with UnknownType still there.

Comment 20 errata-xmlrpc 2018-04-05 09:34:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0636