2224351 – After cephadm adoption, haproxy fails to start when RGW is deployed

Bug 2224351 - After cephadm adoption, haproxy fails to start when RGW is deployed

Summary: After cephadm adoption, haproxy fails to start when RGW is deployed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	5.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.3z5
Assignee:	Teoman ONAY
QA Contact:	Sayalee
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2104616 2160009 2229931 2229959
TreeView+	depends on / blocked

Reported:	2023-07-20 13:54 UTC by Francesco Pantano
Modified:	2025-04-12 20:37 UTC (History)
CC List:	19 users (show)
Fixed In Version:	ceph-ansible-6.0.28.6-1.el8cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2224527 2229931 (view as bug list)
Environment:
Last Closed:	2023-08-28 09:40:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-7056	None	None	None	2023-07-20 14:06:04 UTC
Red Hat Knowledge Base (Solution)	7071082	None	None	None	2025-04-12 20:37:21 UTC
Red Hat Product Errata	RHBA-2023:4760	None	None	None	2023-08-28 09:41:40 UTC

Description Francesco Pantano 2023-07-20 13:54:07 UTC

Description of problem:


During the FFU from 16.2 to 17.1, when RGW is deployed as part of Director
deployed ceph, the procedure fails on the next stack update.
In particular, haproxy-bundle is not able to start via pacemaker due to a
failure that occurs when it tries to bind to the rgw port (8080).
After digging into the existing environment, we've seen that rgw has not
been redeployed on the storage network, and it's bound on *.

The resulting spec gathered from the adopted cluster shows:


---
service_type: rgw
service_id: controller-0
service_name: rgw.controller-0
placement:
  count_per_host: 1
  hosts:
  - controller-0
spec:
  rgw_frontend_port: 8080
---
service_type: rgw
service_id: controller-1
service_name: rgw.controller-1
placement:
  count_per_host: 1
  hosts:
  - controller-1
spec:
  rgw_frontend_port: 8080
---
service_type: rgw
service_id: controller-2
service_name: rgw.controller-2
placement:
  count_per_host: 1
  hosts:
  - controller-2
spec:
  rgw_frontend_port: 8080


while the Director uses to build RGW as follows:

---
service_type: rgw
service_id: rgw
service_name: rgw.rgw
placement:
  hosts:
  - controller-0
  - controller-1
  - controller-2
networks:
- 172.17.3.0/24
spec:
  rgw_frontend_port: 8080
  rgw_realm: default
  rgw_zone: default


Apparently, the code responsible for the rgw adoption is [1] and should handle
the fact that the three rgw instances were bound to the storage network.
The failure has been observed in the job [2] that can be used to build a reproducer.

[1] https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/cephadm-adopt.yml#L952
[2] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Upgrades/job/DFG-storage-ffu-17.1-from-16.2-passed_phase2-3cont_2comp_3ceph-ipv4-ovn_dvr-ceph-nfs-ganesha/

Comment 2 Scott Ostapovicz 2023-07-20 14:04:07 UTC

Too late to assign issues to 6.1 z1!  Retargeting to 6.1 z2.

Comment 8 Scott Ostapovicz 2023-07-24 13:38:03 UTC

Retargeting to 5.3 z5 seems like the right thing to do for this.  Thanks.

Comment 37 errata-xmlrpc 2023-08-28 09:40:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.3 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:4760

Note You need to log in before you can comment on or make changes to this bug.