Bug 1576386 - Single ceph monitor upgrade not allowed
Summary: Single ceph monitor upgrade not allowed
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: 3.*
Assignee: Sébastien Han
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-09 10:51 UTC by Jose Luis Franco
Modified: 2018-05-10 23:39 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-10 23:39:38 UTC
Embargoed:


Attachments (Terms of Use)

Description Jose Luis Franco 2018-05-09 10:51:38 UTC
Description of problem:
Upstream CI job for upgrades consists on a multinode deployment, with a single node for the undercloud and another node for the overcloud, with the following custom roles_data: https://github.com/openstack/tripleo-heat-templates/blob/master/ci/environments/scenario001-multinode-containers.yaml

As a consequence, when upgrading the overcloud, in the step of upgrading ceph via "overcloud ceph upgrade" the playbook gets stopped in the following task:

2018-05-09 09:41:05,341 p=29987 u=mistral |  TASK [gather facts] ************************************************************
2018-05-09 09:41:05,341 p=29987 u=mistral |  Wednesday 09 May 2018  09:41:05 +0000 (0:00:00.051)       0:00:01.110 *********
2018-05-09 09:41:05,362 p=29987 u=mistral |  skipping: [192.168.24.17] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}
2018-05-09 09:41:05,386 p=29987 u=mistral |  TASK [gather and delegate facts] ***********************************************
2018-05-09 09:41:05,386 p=29987 u=mistral |  Wednesday 09 May 2018  09:41:05 +0000 (0:00:00.044)       0:00:01.155 *********
2018-05-09 09:41:08,857 p=29987 u=mistral |  ok: [192.168.24.17 -> 192.168.24.17] => (item=192.168.24.17)
2018-05-09 09:41:08,954 p=29987 u=mistral |  TASK [set_fact] ****************************************************************
2018-05-09 09:41:08,954 p=29987 u=mistral |  Wednesday 09 May 2018  09:41:08 +0000 (0:00:03.567)       0:00:04.723 *********
2018-05-09 09:41:08,996 p=29987 u=mistral |  ok: [192.168.24.17] => {"ansible_facts": {"rolling_update": true}, "changed": false, "failed": false}
2018-05-09 09:41:09,013 p=29987 u=mistral |  PLAY [upgrade ceph mon cluster] ************************************************
2018-05-09 09:41:09,131 p=29987 u=mistral |  TASK [set mon_host_count] ******************************************************
2018-05-09 09:41:09,132 p=29987 u=mistral |  Wednesday 09 May 2018  09:41:09 +0000 (0:00:00.177)       0:00:04.900 *********
2018-05-09 09:41:09,182 p=29987 u=mistral |  ok: [192.168.24.17] => {"ansible_facts": {"mon_host_count": "1"}, "changed": false, "failed": false}
2018-05-09 09:41:09,192 p=29987 u=mistral |  TASK [debug] *******************************************************************
2018-05-09 09:41:09,192 p=29987 u=mistral |  Wednesday 09 May 2018  09:41:09 +0000 (0:00:00.060)       0:00:04.961 *********
2018-05-09 09:41:09,241 p=29987 u=mistral |  ok: [192.168.24.17] => {
    "msg": "WARNING - upgrading a ceph cluster with only one monitor node (192.168.24.17)"
}
2018-05-09 09:41:09,250 p=29987 u=mistral |  TASK [fail when single containerized monitor] **********************************
2018-05-09 09:41:09,250 p=29987 u=mistral |  Wednesday 09 May 2018  09:41:09 +0000 (0:00:00.057)       0:00:05.019 *********
2018-05-09 09:41:09,297 p=29987 u=mistral |  fatal: [192.168.24.17]: FAILED! => {"changed": false, "failed": true, "msg": "Upgrades of a single monitor are not supported, also running 1 monitor is not recommended always use 3."}
2018-05-09 09:41:09,299 p=29987 u=mistral |  PLAY RECAP *********************************************************************
2018-05-09 09:41:09,299 p=29987 u=mistral |  192.168.24.17              : ok=5    changed=0    unreachable=0    failed=1
2018-05-09 09:41:09,299 p=29987 u=mistral |  localhost                  : ok=1    changed=0    unreachable=0    failed=0
2018-05-09 09:41:09,299 p=29987 u=mistral |  Wednesday 09 May 2018  09:41:09 +0000 (0:00:00.048)       0:00:05.068 *********

The "easy" solution would be enable a 3controller 3 ceph configuration for the job, however it's not a possibility at the moment as the job is taking almost the timeout time (3 hours) to execute with a single-node configuration.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy the overcloud with the following command: https://logs.rdoproject.org/97/565597/7/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-queens-nv/Zea560056edcb4462888d3acf49b2146d/undercloud/home/jenkins/overcloud-deploy.sh

2. Upgrade the overcloud via:
 - Overcloud upgrade prepare: https://logs.rdoproject.org/97/565597/7/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-queens-nv/Zea560056edcb4462888d3acf49b2146d/undercloud/home/jenkins/overcloud_upgrade_prepare.sh.txt.gz

 - Overcloud upgrade run: https://logs.rdoproject.org/97/565597/7/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-queens-nv/Zea560056edcb4462888d3acf49b2146d/undercloud/home/jenkins/overcloud_upgrade_run-Controller.sh.txt.gz


3. Upgrade ceph via: https://logs.rdoproject.org/97/565597/7/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-queens-nv/Zea560056edcb4462888d3acf49b2146d/undercloud/home/jenkins/ceph-upgrade-run.sh.txt.gz

Actual results:
Ceph upgrade doesn't initiate

Expected results:
Ceph upgrade finishes successfuly

Additional info:

Comment 3 Sébastien Han 2018-05-10 23:39:38 UTC
Unfortunately, this is not a configuration we can support. Even though we'd allow 1 mon. This playbook will fail later.
This has been discussed multiple times already and we are not planning to work on this.

I know the CI resources and the time it takes to run an upgrade can be frustrating but please note that a CI should reproduce realistic scenarios (production basically) and there is no point of testing a deployment of a single monitor, nor its upgrade.

Hope you will understand, thanks.


Note You need to log in before you can comment on or make changes to this bug.