Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1860236

Summary:	[OSP13->OSP16.1] Glance-api haproxy backend service fails to start after FFU in ceph environment
Product:	Red Hat OpenStack	Reporter:	Jose Luis Franco <jfrancoa>
Component:	puppet-tripleo	Assignee:	Francesco Pantano <fpantano>
Status:	CLOSED DUPLICATE	QA Contact:	David Rosenfeld <drosenfe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	16.1 (Train)	CC:	bperkins, fpantano, gfidente, jjoyce, johfulto, jpretori, jschluet, lmiccini, slinaber, tvignaud
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-10 13:35:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1855813
Bug Blocks:

Description Jose Luis Franco 2020-07-24 05:35:25 UTC

Description of problem:


The ceph enabled FFU from 13 to 16.1 CI job fails in a post-upgrade check that verifies that all haproxy backend services are up and running:

TASK [tripleo-upgrade : Running post upgrade scripts for controller-0] *********
task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/upgrade/controller_node_upgrade.yml:2
Thursday 23 July 2020  22:58:08 +0000 (0:00:00.193)       9:26:54.055 ********* 
changed: [undercloud-0] => (item=haproxy) => {
    "changed": true,
    "cmd": "set -o pipefail && /home/stack/controller-0_post/haproxy.sh",
    "delta": "0:00:06.182722",
    "end": "2020-07-23 18:58:15.137372",
    "item": "haproxy",
    "rc": 0,
    "start": "2020-07-23 18:58:08.954650"
}

STDOUT:

Waiting for haproxy pcs resource to start
3 instances of haproxy-bundle are started

failed: [undercloud-0] (item=haproxy_backend) => {
    "changed": true,
    "cmd": "set -o pipefail && /home/stack/controller-0_post/haproxy_backend.sh",
    "delta": "0:07:05.553062",
    "end": "2020-07-23 19:05:21.477824",
    "item": "haproxy_backend",
    "rc": 1,
    "start": "2020-07-23 18:58:15.924762"
}

STDOUT:

Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
<more of this log>
Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
FAILURE: glance_api
glance_api is down on controller-1.internalapi.redhat.local
controller-2.internalapi.redhat.local

The haproxy backend having trouble to start is glance_api one. An interesting point to mention is that the very same check works fine on the non-ceph enabled CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph/128/artifact/.sh/ir-tripleo-ffu-upgrade-run.log

Ha proxy backend statistics show the following (log obtained from a different environment failed with the same reason, the stats log isn't been stored by the Jenkins job):

cinder,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
cinder,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,2,1,,0,,2,0,,0,L7OK,200,3,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
cinder,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,2,2,,0,,2,0,,0,L7OK,200,6,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
cinder,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7160,0,,1,2,3,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
cinder,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,3,3,0,,0,8587,0,,1,2,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,,
glance_api,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,3,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
glance_api,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,3,1,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
glance_api,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8587,8587,,1,3,2,,0,,2,0,,0,L4CON,,0,0,0,0,0,0,0,,,,,0,0,,,,,-1,Connection refused,,0,0,0,0,,,,Layer4 connection problem,,2,5,0,,,,,,http,,,,,,,,
glance_api,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8587,8587,,1,3,3,,0,,2,0,,0,L4CON,,0,0,0,0,0,0,0,,,,,0,0,,,,,-1,Connection refused,,0,0,0,0,,,,Layer4 connection problem,,2,5,0,,,,,,http,,,,,,,,
glance_api,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,1,1,0,,0,8587,0,,1,3,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,,
haproxy.stats,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,4,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
haproxy.stats,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,0,0,0,,0,8587,,,1,4,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,,
heat_api,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,5,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
heat_api,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,5,1,,0,,2,0,,0,L7OK,200,2,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_api,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,5,2,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_api,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7153,0,,1,5,3,,0,,2,0,,0,L7OK,200,2,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_api,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,3,3,0,,0,8587,0,,1,5,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,,
heat_cfn,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,6,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
heat_cfn,controller-0.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,6,1,,0,,2,0,,0,L7OK,200,3,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_cfn,controller-1.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,8587,0,,1,6,2,,0,,2,0,,0,L7OK,200,5,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_cfn,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7151,0,,1,6,3,,0,,2,0,,0,L7OK,200,2,0,0,0,0,0,0,,,,,0,0,,,,,-1,OK,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,
heat_cfn,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,UP,3,3,0,,0,8587,0,,1,6,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,,,,,,,,
horizon,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,7,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,0,0,0,0,0,
horizon,controller-2.internalapi.redhat.local,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,1,1,7256,0,,1,7,1,,0,,2,0,,0,L7OK,301,0,0,0,0,0,0,0,,,,,0,0,,,,,-1,Moved Permanently,,0,0,0,0,,,,Layer7 check passed,,2,5,6,,,,,,http,,,,,,,,

Only the glance_api haproxy backend service seems to be down.

CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/

CI job logs: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Run CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jose Luis Franco 2020-07-24 05:37:21 UTC

Just as a note, this BZ is pretty similar to this other one https://bugzilla.redhat.com/show_bug.cgi?id=1850991 . The CI job ran with the fix for 1850991 and this new backend service appeared as failed.

Comment 2 Jose Luis Franco 2020-07-24 05:41:35 UTC

Just to clarify, as the comment above could lead to the idea that this issue is caused by that fix. The glance_api service was already failing before (in addition to the ceph_dashboard):

Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
Waiting for haproxy backend services to come up
FAILURE: ceph_dashboard
ceph_dashboard
ceph_dashboard
ceph_dashboard
glance_api
glance_api is down on controller-0
controller-1
controller-2
BACKEND
controller-1.internalapi.redhat.local
controller-2.internalapi.redhat.local


Log: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/85/artifact/.sh/ir-tripleo-ffu-upgrade-run.log

Comment 3 Luca Miccini 2020-07-24 05:54:49 UTC

seems a genuine failure:

http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/86/controller-1.tar.gz?controller-1/var/log/containers/glance/api.log

2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd [-] Error connecting to ceph cluster.: rados.TimedOut: [errno 110] error connecting to the cluster
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd Traceback (most recent call last):
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd   File "/usr/lib/python3.6/site-packages/glance_store/_drivers/rbd.py", line 273, in get_connection
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd     client.connect(timeout=self.connect_timeout)
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd   File "rados.pyx", line 893, in rados.Rados.connect
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd rados.TimedOut: [errno 110] error connecting to the cluster
2020-07-23 17:22:53.260 7 ERROR glance_store._drivers.rbd

Comment 5 Francesco Pantano 2020-07-27 07:40:16 UTC

The doc (which is under heavy testing and review) of bz#1855813
can avoid this bug: adding it as a dependency of this bug.