Description of problem: Upgrade against non-ha containerzied ocp failed at task [restart master controllers]. RUNNING HANDLER [restart master controllers] ******************************************************************************************* fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "msg": "Unable to start service atomic-openshift-master-controllers: Job for atomic-openshift-master-controllers.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-master-controllers.service\" and \"journalctl -xe\" for details.\n"} ===================more info Checked on master hosts to find that atomic-openshift-master-controllers was running. some journal log <--snip--> Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: Failed to start Atomic OpenShift Master Controllers. Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state. Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service failed. Nov 06 00:42:12 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service holdoff time over, scheduling restart. Nov 06 00:42:12 qe-jliu-con-master-1 systemd[1]: Starting Atomic OpenShift Master Controllers... Nov 06 00:42:13 qe-jliu-con-master-1 atomic-openshift-master-controllers[3392]: Error response from daemon: No such container: atomic-openshift-master-controllers <--snip--> # systemctl status atomic-openshift-master* ● atomic-openshift-master-api.service - Atomic OpenShift Master API Loaded: loaded (/etc/systemd/system/atomic-openshift-master-api.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2017-11-06 00:41:56 EST; 7min ago Docs: https://github.com/openshift/origin Process: 3086 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS) Process: 3081 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-api (code=exited, status=1/FAILURE) Main PID: 3085 (docker-current) ... ● atomic-openshift-master-controllers.service - Atomic OpenShift Master Controllers Loaded: loaded (/etc/systemd/system/atomic-openshift-master-controllers.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2017-11-06 00:42:23 EST; 6min ago Docs: https://github.com/openshift/origin Process: 3382 ExecStop=/usr/bin/docker stop atomic-openshift-master-controllers (code=exited, status=1/FAILURE) Process: 3397 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS) Process: 3392 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-controllers (code=exited, status=1/FAILURE) Main PID: 3396 (docker-current) ... ● atomic-openshift-master.service Loaded: not-found (Reason: No such file or directory) Active: active (running) since Mon 2017-11-06 00:41:34 EST; 7min ago Main PID: 1843 (docker-current) CGroup: /system.slice/atomic-openshift-master.service └─1843 /usr/bin/docker-current run --rm --privileged --net=host --name atomic-openshift-master --env-file=/etc/sysconfig/atomic-openshift-master -v /var/lib/origin:/var/lib/origin -v /var/log:/var/log -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etc/origin -v /etc/origin/cloudprovider:/etc/origin/cloudprovider -v /etc/pki:/etc/pki:ro openshift3/ose:v3.6.173.0.63 start master --config=/etc/origin/master/master-config.yaml --loglevel=5 # docker images REPOSITORY TAG IMAGE ID CREATED SIZE registry.ops.openshift.com/openshift3/ose v3.7 e63c03f3ae7b 4 days ago 1.059 GB registry.ops.openshift.com/openshift3/ose v3.7.0 e63c03f3ae7b 4 days ago 1.059 GB registry.ops.openshift.com/openshift3/openvswitch v3.6.173.0.63 e7d2769a89cf 6 days ago 1.159 GB registry.ops.openshift.com/openshift3/node v3.6.173.0.63 f1fe7e034bec 6 days ago 1.157 GB registry.ops.openshift.com/openshift3/ose v3.6 ef9ef5dca033 6 days ago 970.6 MB registry.ops.openshift.com/openshift3/ose v3.6.173.0.63 ef9ef5dca033 6 days ago 970.6 MB # docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a155f91c04df openshift3/ose:v3.7.0 "/usr/bin/openshift s" 8 minutes ago Up 8 minutes atomic-openshift-master-controllers f9edfcd6b082 openshift3/ose:v3.7.0 "/usr/bin/openshift s" 8 minutes ago Up 8 minutes atomic-openshift-master-api 403b30456dab openshift3/node:v3.6.173.0.63 "/usr/local/bin/origi" 9 minutes ago Up 9 minutes atomic-openshift-node fba2480d2cf5 openshift3/ose:v3.6.173.0.63 "/usr/bin/openshift s" 9 minutes ago Up 9 minutes atomic-openshift-master dee2ab535184 openshift3/openvswitch:v3.6.173.0.63 "/usr/local/bin/ovs-r" 4 hours ago Up 4 hours Version-Release number of the following components: openshift-ansible-3.7.0-0.194.0.git.0.e8af207.el7.noarch How reproducible: sometimes Steps to Reproduce: 1. Containerzied install ocp v3.6 with non-ha deployed. 2. Upgrade above to v3.7 3. Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: Upgrade log and master-controller journal log in attachment
Reading through the journal logs I think this is just happening because the api server is still bootstrapping when we first attempt to start the controllers. We see the controllers stop with the following error then restart at which point they're successful. Nov 06 00:42:00 qe-jliu-con-master-1 atomic-openshift-master-controllers[3222]: F1106 05:42:00.078464 1 client_builder.go:273] unable to get token for service account: error watching Nov 06 00:42:00 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a So lets either add retries to these service starts or ignore the failure then loop for a period checking the status of the service.
@sdodson I think we should investigate systemd's handling of the starting of the api service if the controller requires that as a dependency. I will implement the work-around for now as we're close to freeze. https://www.freedesktop.org/software/systemd/man/sd_notify.html#
PR Created: https://github.com/openshift/openshift-ansible/pull/6027
openshift-ansible-3.7.0-0.197.0
Verified on openshift-ansible-3.7.0-0.197.0.git.0.f40c09c.el7.noarch.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188