Description of problem: Upstream CI job for upgrades consists on a multinode deployment, with a single node for the undercloud and another node for the overcloud, with the following custom roles_data: https://github.com/openstack/tripleo-heat-templates/blob/master/ci/environments/scenario001-multinode-containers.yaml As a consequence, when upgrading the overcloud, in the step of upgrading ceph via "overcloud ceph upgrade" the playbook gets stopped in the following task: 2018-05-09 09:41:05,341 p=29987 u=mistral | TASK [gather facts] ************************************************************ 2018-05-09 09:41:05,341 p=29987 u=mistral | Wednesday 09 May 2018 09:41:05 +0000 (0:00:00.051) 0:00:01.110 ********* 2018-05-09 09:41:05,362 p=29987 u=mistral | skipping: [192.168.24.17] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} 2018-05-09 09:41:05,386 p=29987 u=mistral | TASK [gather and delegate facts] *********************************************** 2018-05-09 09:41:05,386 p=29987 u=mistral | Wednesday 09 May 2018 09:41:05 +0000 (0:00:00.044) 0:00:01.155 ********* 2018-05-09 09:41:08,857 p=29987 u=mistral | ok: [192.168.24.17 -> 192.168.24.17] => (item=192.168.24.17) 2018-05-09 09:41:08,954 p=29987 u=mistral | TASK [set_fact] **************************************************************** 2018-05-09 09:41:08,954 p=29987 u=mistral | Wednesday 09 May 2018 09:41:08 +0000 (0:00:03.567) 0:00:04.723 ********* 2018-05-09 09:41:08,996 p=29987 u=mistral | ok: [192.168.24.17] => {"ansible_facts": {"rolling_update": true}, "changed": false, "failed": false} 2018-05-09 09:41:09,013 p=29987 u=mistral | PLAY [upgrade ceph mon cluster] ************************************************ 2018-05-09 09:41:09,131 p=29987 u=mistral | TASK [set mon_host_count] ****************************************************** 2018-05-09 09:41:09,132 p=29987 u=mistral | Wednesday 09 May 2018 09:41:09 +0000 (0:00:00.177) 0:00:04.900 ********* 2018-05-09 09:41:09,182 p=29987 u=mistral | ok: [192.168.24.17] => {"ansible_facts": {"mon_host_count": "1"}, "changed": false, "failed": false} 2018-05-09 09:41:09,192 p=29987 u=mistral | TASK [debug] ******************************************************************* 2018-05-09 09:41:09,192 p=29987 u=mistral | Wednesday 09 May 2018 09:41:09 +0000 (0:00:00.060) 0:00:04.961 ********* 2018-05-09 09:41:09,241 p=29987 u=mistral | ok: [192.168.24.17] => { "msg": "WARNING - upgrading a ceph cluster with only one monitor node (192.168.24.17)" } 2018-05-09 09:41:09,250 p=29987 u=mistral | TASK [fail when single containerized monitor] ********************************** 2018-05-09 09:41:09,250 p=29987 u=mistral | Wednesday 09 May 2018 09:41:09 +0000 (0:00:00.057) 0:00:05.019 ********* 2018-05-09 09:41:09,297 p=29987 u=mistral | fatal: [192.168.24.17]: FAILED! => {"changed": false, "failed": true, "msg": "Upgrades of a single monitor are not supported, also running 1 monitor is not recommended always use 3."} 2018-05-09 09:41:09,299 p=29987 u=mistral | PLAY RECAP ********************************************************************* 2018-05-09 09:41:09,299 p=29987 u=mistral | 192.168.24.17 : ok=5 changed=0 unreachable=0 failed=1 2018-05-09 09:41:09,299 p=29987 u=mistral | localhost : ok=1 changed=0 unreachable=0 failed=0 2018-05-09 09:41:09,299 p=29987 u=mistral | Wednesday 09 May 2018 09:41:09 +0000 (0:00:00.048) 0:00:05.068 ********* The "easy" solution would be enable a 3controller 3 ceph configuration for the job, however it's not a possibility at the moment as the job is taking almost the timeout time (3 hours) to execute with a single-node configuration. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Deploy the overcloud with the following command: https://logs.rdoproject.org/97/565597/7/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-queens-nv/Zea560056edcb4462888d3acf49b2146d/undercloud/home/jenkins/overcloud-deploy.sh 2. Upgrade the overcloud via: - Overcloud upgrade prepare: https://logs.rdoproject.org/97/565597/7/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-queens-nv/Zea560056edcb4462888d3acf49b2146d/undercloud/home/jenkins/overcloud_upgrade_prepare.sh.txt.gz - Overcloud upgrade run: https://logs.rdoproject.org/97/565597/7/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-queens-nv/Zea560056edcb4462888d3acf49b2146d/undercloud/home/jenkins/overcloud_upgrade_run-Controller.sh.txt.gz 3. Upgrade ceph via: https://logs.rdoproject.org/97/565597/7/openstack-check/gate-tripleo-ci-centos-7-container-to-container-upgrades-queens-nv/Zea560056edcb4462888d3acf49b2146d/undercloud/home/jenkins/ceph-upgrade-run.sh.txt.gz Actual results: Ceph upgrade doesn't initiate Expected results: Ceph upgrade finishes successfuly Additional info:
Unfortunately, this is not a configuration we can support. Even though we'd allow 1 mon. This playbook will fail later. This has been discussed multiple times already and we are not planning to work on this. I know the CI resources and the time it takes to run an upgrade can be frustrating but please note that a CI should reproduce realistic scenarios (production basically) and there is no point of testing a deployment of a single monitor, nor its upgrade. Hope you will understand, thanks.