Bug 1540054 - Should add retry/delay to stop/start master service during upgrade_control_plane
Summary: Should add retry/delay to stop/start master service during upgrade_control_plane
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.9.0
Assignee: Vadim Rutkovsky
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-30 07:37 UTC by liujia
Modified: 2018-12-13 19:26 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: During upgrade all masters were restarted simultaneously Consequence: In multimaster environment leader election could have failed Fix: A short pause was added between master restarts Result: Leader election no longer fails
Clone Of:
Environment:
Last Closed: 2018-12-13 19:26:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3748 0 None None None 2018-12-13 19:26:58 UTC

Description liujia 2018-01-30 07:37:09 UTC
Description of problem:
Upgrade playbook sometimes stopped and exited at TASK [Start atomic-openshift-master-controllers] in play [Cycle all controller services to force new leader election mode] during upgrade_control_palne.

PLAY [Cycle all controller services to force new leader election mode] *********
META: ran handlers

TASK [Stop atomic-openshift-master-controllers] ********************************
task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/v3_9/upgrade_control_plane.yml:105
changed: [x.x.x.x] => {"changed": true, "name": "atomic-openshift-master-controllers", "state": "stopped", "status": ........
.......
fatal: [x.x.x.x]: FAILED! => {"changed": false, "msg": "Unable to start service atomic-openshift-master-controllers: Job for atomic-openshift-master-controllers.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-master-controllers.service\" and \"journalctl -xe\" for details.\n"}

But actually, atomic-openshift-master-controllers was running well when checking on master host manually.

# systemctl status atomic-openshift-master-controllers.service
● atomic-openshift-master-controllers.service - Atomic OpenShift Master Controllers
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master-controllers.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2018-01-30 04:00:36 UTC; 2h 34min ago
     Docs: https://github.com/openshift/origin
  Process: 94761 ExecStop=/usr/bin/docker stop atomic-openshift-master-controllers (code=exited, status=1/FAILURE)
  Process: 94793 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS)
  Process: 94786 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-controllers (code=exited, status=1/FAILURE)
 Main PID: 94792 (docker-current)
   Memory: 10.2M
   CGroup: /system.slice/atomic-openshift-master-controllers.service
           └─94792 /usr/bin/docker-current run --rm --privileged --net=host --name atomic-openshift-master-controllers --env-file=/etc/sysconfig/atomic-openshift-master-c...


Version-Release number of the following components:
ansible-2.4.2.0-2.el7.noarch
openshift-ansible-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. Run upgrade_control_plane playbook against ocp(all in one)

ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade_control_plane.yml

2.
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
- name: Cycle all controller services to force new leader election mode
  hosts: oo_masters_to_config
  gather_facts: no
  roles:
  - role: openshift_facts
  tasks:
  - name: Stop {{ openshift_service_type }}-master-controllers
    systemd:
      name: "{{ openshift_service_type }}-master-controllers"
      state: stopped
  - name: Start {{ openshift_service_type }}-master-controllers
    systemd:
      name: "{{ openshift_service_type }}-master-controllers"
      state: started

Comment 1 Vadim Rutkovsky 2018-01-30 17:03:53 UTC
Created https://github.com/openshift/openshift-ansible/pull/6943, although couldn't reproduce the crash

Comment 5 Vadim Rutkovsky 2018-02-02 17:52:25 UTC
Fix is available in openshift-ansible-3.9.0-0.36.0.git.0.da68f13.el7

Comment 6 liujia 2018-02-05 04:59:25 UTC
Blocked by bz1540464

Comment 7 liujia 2018-02-06 06:15:38 UTC
Version:
openshift-ansible-3.9.0-0.38.0.git.0.57e1184.el7.noarch

Original stop and start atomic-openshift-master-controllers was changed to restart atomic-openshift-master-controllers, but this task was skipped. Master controllers was not restarted after master was upgraded.
<--snip-->
PLAY [Cycle all controller services to force new leader election mode] ******************************************************************************************************

TASK [Restart master controllers to force new leader election mode] *********************************************************************************************************
skipping: [x.x.x.x] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}

TASK [Re-enable master controllers to force new leader election mode] *******************************************************************************************************
skipping: [x.x.x.x] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}

PLAY [Upgrade web console] ********************************************************************************<--snip-->

Comment 8 Vadim Rutkovsky 2018-02-06 14:00:49 UTC
Weird, that worked fine for me here. Both tasks would be skipped only if openshift_rolling_restart_mode was set to neither 'system' nor 'services', which is probably the culprit here.

Could you attach the inventory file and ansible output log?

Comment 11 liujia 2018-02-07 01:53:48 UTC
(In reply to Vadim Rutkovsky from comment #8)
> Weird, that worked fine for me here. Both tasks would be skipped only if
> openshift_rolling_restart_mode was set to neither 'system' nor 'services',
> which is probably the culprit here.
> 
> Could you attach the inventory file and ansible output log?

I think the root cause should be openshift_rolling_restart_mode was not set to "services" which should be "services" by default.

Comment 12 Vadim Rutkovsky 2018-02-07 15:54:26 UTC
(In reply to liujia from comment #11)
> (In reply to Vadim Rutkovsky from comment #8)
> > Weird, that worked fine for me here. Both tasks would be skipped only if
> > openshift_rolling_restart_mode was set to neither 'system' nor 'services',
> > which is probably the culprit here.
> > 
> > Could you attach the inventory file and ansible output log?
> 
> I think the root cause should be openshift_rolling_restart_mode was not set
> to "services" which should be "services" by default.

Right, the code was expecting it to be 'service'. Also the following task had a typo.

Created https://github.com/openshift/openshift-ansible/pull/7052 to resolve this

Comment 13 Vadim Rutkovsky 2018-02-12 07:59:44 UTC
Fix available in openshift-ansible-3.9.0-0.42.0.git.0.1a9a61b.el7

Comment 14 liujia 2018-02-14 03:55:21 UTC
Verified on openshift-ansible-3.9.0-0.42.0.git.0.1a9a61b.el7.noarch.

Comment 17 errata-xmlrpc 2018-12-13 19:26:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748


Note You need to log in before you can comment on or make changes to this bug.