Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1510174 - occasional restart of atomic-openshift-master-controllers.service due to scheduler cache corruption
occasional restart of atomic-openshift-master-controllers.service due to sche...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master (Show other bugs)
3.7.0
Unspecified Unspecified
unspecified Severity high
: ---
: 3.9.0
Assigned To: Jordan Liggitt
Vikas Laad
aos-scalability-37
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-11-06 15:44 EST by Vikas Laad
Modified: 2018-03-28 10:11 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-03-28 10:11:22 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0489 None None None 2018-03-28 10:11 EDT

  None (edit)
Description Vikas Laad 2017-11-06 15:44:39 EST
Description of problem:
I am running reliability tests on 3.7 cluster, I see occasional restart of atomic-openshift-master-controllers.service in following pattern

Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[93589]: F1102 04:15:33.881293   93600 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions
Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a
Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16013]: container "atomic-openshift-master-controllers" does not exist
Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service: control process exited, code=exited status=1
Nov 02 04:15:34 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state.
Nov 02 04:15:34 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service failed.
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service holdoff time over, scheduling restart.
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: Starting atomic-openshift-master-controllers.service...
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: Started atomic-openshift-master-controllers.service.
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16049]: I1102 04:15:39.427994   16060 plugins.go:77] Registered admission plugin "NamespaceLifecycle"
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16049]: W1102 04:15:39.429207   16060 start_master.go:290] Warning: assetConfig.loggingPublicURL: Invalid value: "": required to view aggregated container logs in the console, master start will continue.
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16049]: W1102 04:15:39.429234   16060 start_master.go:290] Warning: assetConfig.metricsPublicURL: Invalid value: "": required to view cluster metrics in the console, master start will continue.

Version-Release number of selected component (if applicable):
openshift v3.7.0-0.178.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8


How reproducible:
Always

Steps to Reproduce:
1. Keep creating/updating/building/scaling quickstart apps on the cluster
2. watch master logs

Actual results:
master controller restart occasionally

Expected results:
should not restart master controller

Additional info:
See master logs attached.
Comment 2 Mike Fiedler 2017-11-06 22:11:40 EST
In the referenced logs in comment 1, master-controllers restarted 15 times in 5 days due to the scheduler cache corruption fatal.
Comment 3 Jordan Liggitt 2017-11-07 09:10:08 EST
https://github.com/kubernetes/kubernetes/issues/50916
Comment 4 Jordan Liggitt 2017-11-07 15:15:49 EST
Fix in https://github.com/kubernetes/kubernetes/pull/55262
Comment 5 Michal Fojtik 2017-12-07 04:16:45 EST
Pick: https://github.com/openshift/origin/pull/17656
Comment 6 Jordan Liggitt 2018-01-09 15:04:48 EST
Will be fixed by 1.9.1 rebase in https://github.com/openshift/origin/pull/18003
Comment 8 Mike Fiedler 2018-01-15 15:43:40 EST
Assigning QA to @vlaad.   This will be verified in the 3.9 reliability runs.
Comment 9 Vikas Laad 2018-01-22 11:08:25 EST
Verified in following version

openshift v3.9.0-0.20.0
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.8


I do not see restarts of master-controller process anymore.
Comment 12 errata-xmlrpc 2018-03-28 10:11:22 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Note You need to log in before you can comment on or make changes to this bug.