Bug 1860236 - [OSP13->OSP16.1] Glance-api haproxy backend service fails to start after FFU in ceph environment
Summary: [OSP13->OSP16.1] Glance-api haproxy backend service fails to start after FFU ...
Keywords:
Status: CLOSED DUPLICATE of bug 1855813
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Francesco Pantano
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On: 1855813
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-24 05:35 UTC by Jose Luis Franco
Modified: 2022-02-15 07:16 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-10 13:35:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-12673 0 None None None 2022-02-15 07:16:29 UTC

Description Jose Luis Franco 2020-07-24 05:35:25 UTC
Description of problem:


The ceph enabled FFU from 13 to 16.1 CI job fails in a post-upgrade check that verifies that all haproxy backend services are up and running:

TASK [tripleo-upgrade : Running post upgrade scripts for controller-0] *********
task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/upgrade/controller_node_upgrade.yml:2
Thursday 23 July 2020  22:58:08 +0000 (0:00:00.193)       9:26:54.055 ********* 
changed: [undercloud-0] => (item=haproxy) => {
    "changed": true,
    "cmd": "set -o pipefail && /home/stack/controller-0_post/haproxy.sh",
    "delta": "0:00:06.182722",
    "end": "2020-07-23 18:58:15.137372",
    "item": "haproxy",
    "rc": 0,
    "start": "2020-07-23 18:58:08.954650"
}

STDOUT:

Waiting for haproxy pcs resource to start
3 instances of haproxy-bundle are started

failed: [undercloud-0] (item=haproxy_backend) => {
    "changed": true,
    "cmd": "set -o pipefail && /home/stack/controller-0_post/haproxy_backend.sh",
    "delta": "0:07:05.553062",
    "end": "2020-07-23 19:05:21.477824",
    "item": "haproxy_backend",
    "rc": 1,
    "start": "2020-07-23 18:58:15.924762"
}

STDOUT:

Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
<more of this log>
Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
FAILURE: glance_api
glance_api is down on controller-1.internalapi.redhat.local
controller-2.internalapi.redhat.local

The haproxy backend having trouble to start is glance_api one. An interesting point to mention is that the very same check works fine on the non-ceph enabled CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph/128/artifact/.sh/ir-tripleo-ffu-upgrade-run.log

Ha proxy backend statistics show the following (log obtained from a different environment failed with the same reason, the stats log isn't been stored by the Jenkins job):

cinder,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
cinder,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,2,1,,0,,2,0,,0,L7OK,200,3,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
cinder,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,2,2,,0,,2,0,,0,L7OK,200,6,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
cinder,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7160,0,,1,2,3,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
cinder,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,3,3,0,,0,8587,0,,1,2,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,,
glance_api,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,3,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
glance_api,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,3,1,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
glance_api,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8587,8587,,1,3,2,,0,,2,0,,0,L4CON,,0,0,0,0,0,0,0,,,,,0,0,,,,,-1,Connection refused,,0,0,0,0,,,,Layer4 connection problem,,2,5,0,,,,,,http,,,,,,,,
glance_api,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8587,8587,,1,3,3,,0,,2,0,,0,L4CON,,0,0,0,0,0,0,0,,,,,0,0,,,,,-1,Connection refused,,0,0,0,0,,,,Layer4 connection problem,,2,5,0,,,,,,http,,,,,,,,
glance_api,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,1,1,0,,0,8587,0,,1,3,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,,
haproxy.stats,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,4,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
haproxy.stats,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,0,0,0,,0,8587,,,1,4,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,,
heat_api,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,5,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
heat_api,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,5,1,,0,,2,0,,0,L7OK,200,2,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_api,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,5,2,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_api,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7153,0,,1,5,3,,0,,2,0,,0,L7OK,200,2,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_api,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,3,3,0,,0,8587,0,,1,5,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,,
heat_cfn,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,6,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
heat_cfn,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,6,1,,0,,2,0,,0,L7OK,200,3,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_cfn,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,6,2,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_cfn,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7151,0,,1,6,3,,0,,2,0,,0,L7OK,200,2,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_cfn,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,3,3,0,,0,8587,0,,1,6,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,,
horizon,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,7,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
horizon,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7256,0,,1,7,1,,0,,2,0,,0,L7OK,301,0,0,0,0,0,0,0,,,,,0,0,,,,,-1,Moved Permanently,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,

Only the glance_api haproxy backend service seems to be down.

CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/

CI job logs: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Run CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jose Luis Franco 2020-07-24 05:37:21 UTC
Just as a note, this BZ is pretty similar to this other one https://bugzilla.redhat.com/show_bug.cgi?id=1850991 . The CI job ran with the fix for 1850991 and this new backend service appeared as failed.

Comment 2 Jose Luis Franco 2020-07-24 05:41:35 UTC
Just to clarify, as the comment above could lead to the idea that this issue is caused by that fix. The glance_api service was already failing before (in addition to the ceph_dashboard):

Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
FAILURE: ceph_dashboard
ceph_dashboard
ceph_dashboard
ceph_dashboard
glance_api
glance_api is down on controller-0
controller-1
controller-2
BACKEND
controller-1.internalapi.redhat.local
controller-2.internalapi.redhat.local


Log: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/85/artifact/.sh/ir-tripleo-ffu-upgrade-run.log

Comment 3 Luca Miccini 2020-07-24 05:54:49 UTC
seems a genuine failure:

http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/controller-1.tar.gz?controller-1/var/log/containers/glance/api.log

2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd [-] Error connecting to ceph cluster.: rados.TimedOut: [errno 110] error connecting to the cluster
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd Traceback (most recent call last):
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd   File "/usr/lib/python3.6/site-packages/glance_store/_drivers/rbd.py", line 273, in get_connection
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd     client.connect(timeout=self.connect_timeout)
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd   File "rados.pyx", line 893, in rados.Rados.connect
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd rados.TimedOut: [errno 110] error connecting to the cluster
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd

Comment 5 Francesco Pantano 2020-07-27 07:40:16 UTC
The doc (which is under heavy testing and review) of bz#1855813
can avoid this bug: adding it as a dependency of this bug.


Note You need to log in before you can comment on or make changes to this bug.