Description of problem: I am running reliability tests on 3.7 cluster, I see occasional restart of atomic-openshift-master-controllers.service in following pattern Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[93589]: F1102 04:15:33.881293 93600 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16013]: container "atomic-openshift-master-controllers" does not exist Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service: control process exited, code=exited status=1 Nov 02 04:15:34 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state. Nov 02 04:15:34 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service failed. Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service holdoff time over, scheduling restart. Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: Starting atomic-openshift-master-controllers.service... Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: Started atomic-openshift-master-controllers.service. Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16049]: I1102 04:15:39.427994 16060 plugins.go:77] Registered admission plugin "NamespaceLifecycle" Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16049]: W1102 04:15:39.429207 16060 start_master.go:290] Warning: assetConfig.loggingPublicURL: Invalid value: "": required to view aggregated container logs in the console, master start will continue. Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16049]: W1102 04:15:39.429234 16060 start_master.go:290] Warning: assetConfig.metricsPublicURL: Invalid value: "": required to view cluster metrics in the console, master start will continue. Version-Release number of selected component (if applicable): openshift v3.7.0-0.178.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 How reproducible: Always Steps to Reproduce: 1. Keep creating/updating/building/scaling quickstart apps on the cluster 2. watch master logs Actual results: master controller restart occasionally Expected results: should not restart master controller Additional info: See master logs attached.
In the referenced logs in comment 1, master-controllers restarted 15 times in 5 days due to the scheduler cache corruption fatal.
https://github.com/kubernetes/kubernetes/issues/50916
Fix in https://github.com/kubernetes/kubernetes/pull/55262
Pick: https://github.com/openshift/origin/pull/17656
Will be fixed by 1.9.1 rebase in https://github.com/openshift/origin/pull/18003
Assigning QA to @vlaad. This will be verified in the 3.9 reliability runs.
Verified in following version openshift v3.9.0-0.20.0 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.8 I do not see restarts of master-controller process anymore.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489