Bug 2224351

Summary: After cephadm adoption, haproxy fails to start when RGW is deployed
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Francesco Pantano <fpantano>
Component: Ceph-AnsibleAssignee: Teoman ONAY <tonay>
Status: CLOSED ERRATA QA Contact: Sayalee <saraut>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.3CC: adking, aramteke, arcsingh, ceph-eng-bugs, cephqe-warriors, gfidente, gmeno, jbadiapa, jhoylaer, jpretori, kthakre, mcaldeir, mkatari, msaini, saraut, sostapov, tonay, tserlin, vereddy
Target Milestone: ---   
Target Release: 5.3z5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-ansible-6.0.28.6-1.el8cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2224527 2229931 (view as bug list) Environment:
Last Closed: 2023-08-28 09:40:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2104616, 2160009, 2229931, 2229959    

Description Francesco Pantano 2023-07-20 13:54:07 UTC
Description of problem:


During the FFU from 16.2 to 17.1, when RGW is deployed as part of Director
deployed ceph, the procedure fails on the next stack update.
In particular, haproxy-bundle is not able to start via pacemaker due to a
failure that occurs when it tries to bind to the rgw port (8080).
After digging into the existing environment, we've seen that rgw has not
been redeployed on the storage network, and it's bound on *.

The resulting spec gathered from the adopted cluster shows:


---
service_type: rgw
service_id: controller-0
service_name: rgw.controller-0
placement:
  count_per_host: 1
  hosts:
  - controller-0
spec:
  rgw_frontend_port: 8080
---
service_type: rgw
service_id: controller-1
service_name: rgw.controller-1
placement:
  count_per_host: 1
  hosts:
  - controller-1
spec:
  rgw_frontend_port: 8080
---
service_type: rgw
service_id: controller-2
service_name: rgw.controller-2
placement:
  count_per_host: 1
  hosts:
  - controller-2
spec:
  rgw_frontend_port: 8080


while the Director uses to build RGW as follows:

---
service_type: rgw
service_id: rgw
service_name: rgw.rgw
placement:
  hosts:
  - controller-0
  - controller-1
  - controller-2
networks:
- 172.17.3.0/24
spec:
  rgw_frontend_port: 8080
  rgw_realm: default
  rgw_zone: default


Apparently, the code responsible for the rgw adoption is [1] and should handle
the fact that the three rgw instances were bound to the storage network.
The failure has been observed in the job [2] that can be used to build a reproducer.

[1] https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/cephadm-adopt.yml#L952
[2] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Upgrades/job/DFG-storage-ffu-17.1-from-16.2-passed_phase2-3cont_2comp_3ceph-ipv4-ovn_dvr-ceph-nfs-ganesha/

Comment 2 Scott Ostapovicz 2023-07-20 14:04:07 UTC
Too late to assign issues to 6.1 z1!  Retargeting to 6.1 z2.

Comment 8 Scott Ostapovicz 2023-07-24 13:38:03 UTC
Retargeting to 5.3 z5 seems like the right thing to do for this.  Thanks.

Comment 37 errata-xmlrpc 2023-08-28 09:40:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.3 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:4760