Description of problem: Upgrade playbook sometimes stopped and exited at TASK [Start atomic-openshift-master-controllers] in play [Cycle all controller services to force new leader election mode] during upgrade_control_palne. PLAY [Cycle all controller services to force new leader election mode] ********* META: ran handlers TASK [Stop atomic-openshift-master-controllers] ******************************** task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/v3_9/upgrade_control_plane.yml:105 changed: [x.x.x.x] => {"changed": true, "name": "atomic-openshift-master-controllers", "state": "stopped", "status": ........ ....... fatal: [x.x.x.x]: FAILED! => {"changed": false, "msg": "Unable to start service atomic-openshift-master-controllers: Job for atomic-openshift-master-controllers.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-master-controllers.service\" and \"journalctl -xe\" for details.\n"} But actually, atomic-openshift-master-controllers was running well when checking on master host manually. # systemctl status atomic-openshift-master-controllers.service ● atomic-openshift-master-controllers.service - Atomic OpenShift Master Controllers Loaded: loaded (/etc/systemd/system/atomic-openshift-master-controllers.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2018-01-30 04:00:36 UTC; 2h 34min ago Docs: https://github.com/openshift/origin Process: 94761 ExecStop=/usr/bin/docker stop atomic-openshift-master-controllers (code=exited, status=1/FAILURE) Process: 94793 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS) Process: 94786 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-controllers (code=exited, status=1/FAILURE) Main PID: 94792 (docker-current) Memory: 10.2M CGroup: /system.slice/atomic-openshift-master-controllers.service └─94792 /usr/bin/docker-current run --rm --privileged --net=host --name atomic-openshift-master-controllers --env-file=/etc/sysconfig/atomic-openshift-master-c... Version-Release number of the following components: ansible-2.4.2.0-2.el7.noarch openshift-ansible-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch How reproducible: sometimes Steps to Reproduce: 1. Run upgrade_control_plane playbook against ocp(all in one) ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade_control_plane.yml 2. 3. Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: - name: Cycle all controller services to force new leader election mode hosts: oo_masters_to_config gather_facts: no roles: - role: openshift_facts tasks: - name: Stop {{ openshift_service_type }}-master-controllers systemd: name: "{{ openshift_service_type }}-master-controllers" state: stopped - name: Start {{ openshift_service_type }}-master-controllers systemd: name: "{{ openshift_service_type }}-master-controllers" state: started
Created https://github.com/openshift/openshift-ansible/pull/6943, although couldn't reproduce the crash
Fix is available in openshift-ansible-3.9.0-0.36.0.git.0.da68f13.el7
Blocked by bz1540464
Version: openshift-ansible-3.9.0-0.38.0.git.0.57e1184.el7.noarch Original stop and start atomic-openshift-master-controllers was changed to restart atomic-openshift-master-controllers, but this task was skipped. Master controllers was not restarted after master was upgraded. <--snip--> PLAY [Cycle all controller services to force new leader election mode] ****************************************************************************************************** TASK [Restart master controllers to force new leader election mode] ********************************************************************************************************* skipping: [x.x.x.x] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} TASK [Re-enable master controllers to force new leader election mode] ******************************************************************************************************* skipping: [x.x.x.x] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} PLAY [Upgrade web console] ********************************************************************************<--snip-->
Weird, that worked fine for me here. Both tasks would be skipped only if openshift_rolling_restart_mode was set to neither 'system' nor 'services', which is probably the culprit here. Could you attach the inventory file and ansible output log?
(In reply to Vadim Rutkovsky from comment #8) > Weird, that worked fine for me here. Both tasks would be skipped only if > openshift_rolling_restart_mode was set to neither 'system' nor 'services', > which is probably the culprit here. > > Could you attach the inventory file and ansible output log? I think the root cause should be openshift_rolling_restart_mode was not set to "services" which should be "services" by default.
(In reply to liujia from comment #11) > (In reply to Vadim Rutkovsky from comment #8) > > Weird, that worked fine for me here. Both tasks would be skipped only if > > openshift_rolling_restart_mode was set to neither 'system' nor 'services', > > which is probably the culprit here. > > > > Could you attach the inventory file and ansible output log? > > I think the root cause should be openshift_rolling_restart_mode was not set > to "services" which should be "services" by default. Right, the code was expecting it to be 'service'. Also the following task had a typo. Created https://github.com/openshift/openshift-ansible/pull/7052 to resolve this
Fix available in openshift-ansible-3.9.0-0.42.0.git.0.1a9a61b.el7
Verified on openshift-ansible-3.9.0-0.42.0.git.0.1a9a61b.el7.noarch.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3748