Created attachment 1367591 [details]
Description of problem:
F1213 15:15:11.627699 20074 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions
systemd restarts atomic-openshift-master-controllers and garbage collection subsequently does nuts and deletes children of TemplateInstance/ServiceInstance (i.e. my application)
Version-Release number of selected component (if applicable):
Not always, but often. Depends on scheduler cache getting corrupted.
Steps to Reproduce:
1. create cakephp-mysql example app from the web console
2. wait a few minutes for cache corruption and master-controller restart
3. watch app get deleted by garbage collector
4. feels of sadness
GC deletes my app
GC should not delete my app
See attached controller log
I can recreate on brand new v3.7.14 and v3.8.18 clusters. I've also seen it happen in a live demo at kubecon (oops).
Dan, do we have test coverage for the following scenarios on cold start:
existing owner relationship (should wait for informer sync and not reap child)
missing owner instance (should wait for informer sync, see parent instance is missing, and reap child)
missing owner type from discovery (should wait for informer sync of child type, see the parent type is unknown, and not reap)
error starting owner informer (should wait for informer sync forever, I think)
I confirmed that a simple restart of the atomic-openshift-master-controllers.service will also cause the issue. There is no need to wait for the scheduler cache corruption which may or may not happen.
My current theory is this represents another (more insidious) manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=1509022. Still working on testing that theory.
I confirmed this is a manifestation of https://github.com/kubernetes/kubernetes/issues/54940. A simpler way to reproduce:
1. Create a configmap in a namespace which declares an owner reference to a node (which is cluster scoped)
2. Restart the controller process
Unfortunately, the hack which largely resolves the problem during app creation does not (and cannot) fix the problem as manifested during controller restart.
Upstream issue: https://github.com/kubernetes/kubernetes/issues/54940
Upstream fix: https://github.com/kubernetes/kubernetes/pull/57211
Fail to recreate it with ocp 3.7.14 as bug description, it appears occasionally, is there any good way to make it appears?
Did you try the recreate procedure in comment 4? It is simpler than recreate procedure in the initial bug description.
We tried the steps in comments 4, It could not recreate this issue, Have you ever tried it? thx
Dan can you help QE with recreate?
(In reply to ge liu from comment #8)
> We tried the steps in comments 4, It could not recreate this issue, Have you
> ever tried it? thx
Can you describe in detail the steps executed to reproduce as outlined in https://bugzilla.redhat.com/show_bug.cgi?id=1525699#c4? I'm surprised you were unable to reproduce using those steps.
1. Create a configmap with ownerReferences to an existed node
2. Create a configmap with ownerReferences to an not existed node
3. Create a cnofigmap with ownerReferences with UnknownType resource
4. Restart the controller process
5. Check the configmap:
1), the first configmap still there
2), the 2nd configmap was reaped
3), the 3rd one with UnknownType still there.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.