Description of problem: The ceph enabled FFU from 13 to 16.1 CI job fails in a post-upgrade check that verifies that all haproxy backend services are up and running: TASK [tripleo-upgrade : Running post upgrade scripts for controller-0] ********* task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/upgrade/controller_node_upgrade.yml:2 Thursday 23 July 2020 22:58:08 +0000 (0:00:00.193) 9:26:54.055 ********* changed: [undercloud-0] => (item=haproxy) => { "changed": true, "cmd": "set -o pipefail && /home/stack/controller-0_post/haproxy.sh", "delta": "0:00:06.182722", "end": "2020-07-23 18:58:15.137372", "item": "haproxy", "rc": 0, "start": "2020-07-23 18:58:08.954650" } STDOUT: Waiting for haproxy pcs resource to start 3 instances of haproxy-bundle are started failed: [undercloud-0] (item=haproxy_backend) => { "changed": true, "cmd": "set -o pipefail && /home/stack/controller-0_post/haproxy_backend.sh", "delta": "0:07:05.553062", "end": "2020-07-23 19:05:21.477824", "item": "haproxy_backend", "rc": 1, "start": "2020-07-23 18:58:15.924762" } STDOUT: Waiting for haproxy backend services to come up Waiting for haproxy backend services to come up <more of this log> Waiting for haproxy backend services to come up Waiting for haproxy backend services to come up FAILURE: glance_api glance_api is down on controller-1.internalapi.redhat.local controller-2.internalapi.redhat.local The haproxy backend having trouble to start is glance_api one. An interesting point to mention is that the very same check works fine on the non-ceph enabled CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph/128/artifact/.sh/ir-tripleo-ffu-upgrade-run.log Ha proxy backend statistics show the following (log obtained from a different environment failed with the same reason, the stats log isn't been stored by the Jenkins job): cinder,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0, cinder,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,2,1,,0,,2,0,,0,L7OK,200,3,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,, cinder,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,2,2,,0,,2,0,,0,L7OK,200,6,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,, cinder,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7160,0,,1,2,3,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,, cinder,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,3,3,0,,0,8587,0,,1,2,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,, glance_api,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,3,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0, glance_api,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,3,1,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,, glance_api,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8587,8587,,1,3,2,,0,,2,0,,0,L4CON,,0,0,0,0,0,0,0,,,,,0,0,,,,,-1,Connection refused,,0,0,0,0,,,,Layer4 connection problem,,2,5,0,,,,,,http,,,,,,,, glance_api,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8587,8587,,1,3,3,,0,,2,0,,0,L4CON,,0,0,0,0,0,0,0,,,,,0,0,,,,,-1,Connection refused,,0,0,0,0,,,,Layer4 connection problem,,2,5,0,,,,,,http,,,,,,,, glance_api,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,1,1,0,,0,8587,0,,1,3,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,, haproxy.stats,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,4,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0, haproxy.stats,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,0,0,0,,0,8587,,,1,4,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,, heat_api,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,5,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0, heat_api,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,5,1,,0,,2,0,,0,L7OK,200,2,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,, heat_api,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,5,2,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,, heat_api,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7153,0,,1,5,3,,0,,2,0,,0,L7OK,200,2,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,, heat_api,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,3,3,0,,0,8587,0,,1,5,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,, heat_cfn,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,6,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0, heat_cfn,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,6,1,,0,,2,0,,0,L7OK,200,3,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,, heat_cfn,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,6,2,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,, heat_cfn,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7151,0,,1,6,3,,0,,2,0,,0,L7OK,200,2,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,, heat_cfn,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,3,3,0,,0,8587,0,,1,6,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,, horizon,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,7,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0, horizon,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7256,0,,1,7,1,,0,,2,0,,0,L7OK,301,0,0,0,0,0,0,0,,,,,0,0,,,,,-1,Moved Permanently,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,, Only the glance_api haproxy backend service seems to be down. CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/ CI job logs: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Run CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/ 2. 3. Actual results: Expected results: Additional info:
Just as a note, this BZ is pretty similar to this other one https://bugzilla.redhat.com/show_bug.cgi?id=1850991 . The CI job ran with the fix for 1850991 and this new backend service appeared as failed.
Just to clarify, as the comment above could lead to the idea that this issue is caused by that fix. The glance_api service was already failing before (in addition to the ceph_dashboard): Waiting for haproxy backend services to come up Waiting for haproxy backend services to come up Waiting for haproxy backend services to come up Waiting for haproxy backend services to come up Waiting for haproxy backend services to come up FAILURE: ceph_dashboard ceph_dashboard ceph_dashboard ceph_dashboard glance_api glance_api is down on controller-0 controller-1 controller-2 BACKEND controller-1.internalapi.redhat.local controller-2.internalapi.redhat.local Log: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/85/artifact/.sh/ir-tripleo-ffu-upgrade-run.log
seems a genuine failure: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/controller-1.tar.gz?controller-1/var/log/containers/glance/api.log 2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd [-] Error connecting to ceph cluster.: rados.TimedOut: [errno 110] error connecting to the cluster 2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd Traceback (most recent call last): 2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd File "/usr/lib/python3.6/site-packages/glance_store/_drivers/rbd.py", line 273, in get_connection 2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd client.connect(timeout=self.connect_timeout) 2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd File "rados.pyx", line 893, in rados.Rados.connect 2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd rados.TimedOut: [errno 110] error connecting to the cluster 2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd
The doc (which is under heavy testing and review) of bz#1855813 can avoid this bug: adding it as a dependency of this bug.