Bug 2224351 - After cephadm adoption, haproxy fails to start when RGW is deployed
Summary: After cephadm adoption, haproxy fails to start when RGW is deployed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 5.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 5.3z5
Assignee: Teoman ONAY
QA Contact: Sayalee
URL:
Whiteboard:
Depends On:
Blocks: 2160009 2104616 2229931 2229959
TreeView+ depends on / blocked
 
Reported: 2023-07-20 13:54 UTC by Francesco Pantano
Modified: 2024-03-11 12:57 UTC (History)
18 users (show)

Fixed In Version: ceph-ansible-6.0.28.6-1.el8cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2224527 2229931 (view as bug list)
Environment:
Last Closed: 2023-08-28 09:40:56 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7056 0 None None None 2023-07-20 14:06:04 UTC
Red Hat Product Errata RHBA-2023:4760 0 None None None 2023-08-28 09:41:40 UTC

Description Francesco Pantano 2023-07-20 13:54:07 UTC
Description of problem:


During the FFU from 16.2 to 17.1, when RGW is deployed as part of Director
deployed ceph, the procedure fails on the next stack update.
In particular, haproxy-bundle is not able to start via pacemaker due to a
failure that occurs when it tries to bind to the rgw port (8080).
After digging into the existing environment, we've seen that rgw has not
been redeployed on the storage network, and it's bound on *.

The resulting spec gathered from the adopted cluster shows:


---
service_type: rgw
service_id: controller-0
service_name: rgw.controller-0
placement:
  count_per_host: 1
  hosts:
  - controller-0
spec:
  rgw_frontend_port: 8080
---
service_type: rgw
service_id: controller-1
service_name: rgw.controller-1
placement:
  count_per_host: 1
  hosts:
  - controller-1
spec:
  rgw_frontend_port: 8080
---
service_type: rgw
service_id: controller-2
service_name: rgw.controller-2
placement:
  count_per_host: 1
  hosts:
  - controller-2
spec:
  rgw_frontend_port: 8080


while the Director uses to build RGW as follows:

---
service_type: rgw
service_id: rgw
service_name: rgw.rgw
placement:
  hosts:
  - controller-0
  - controller-1
  - controller-2
networks:
- 172.17.3.0/24
spec:
  rgw_frontend_port: 8080
  rgw_realm: default
  rgw_zone: default


Apparently, the code responsible for the rgw adoption is [1] and should handle
the fact that the three rgw instances were bound to the storage network.
The failure has been observed in the job [2] that can be used to build a reproducer.

[1] https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/cephadm-adopt.yml#L952
[2] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Upgrades/job/DFG-storage-ffu-17.1-from-16.2-passed_phase2-3cont_2comp_3ceph-ipv4-ovn_dvr-ceph-nfs-ganesha/

Comment 2 Scott Ostapovicz 2023-07-20 14:04:07 UTC
Too late to assign issues to 6.1 z1!  Retargeting to 6.1 z2.

Comment 8 Scott Ostapovicz 2023-07-24 13:38:03 UTC
Retargeting to 5.3 z5 seems like the right thing to do for this.  Thanks.

Comment 37 errata-xmlrpc 2023-08-28 09:40:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.3 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:4760


Note You need to log in before you can comment on or make changes to this bug.