Bug 2356354

Summary: Skip port conflict check in case of RGW
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: John Fulton <johfulto>
Component: CephadmAssignee: Adam King <adking>
Status: POST --- QA Contact: Sayalee <saraut>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.3CC: cephqe-warriors, mcaldeir, mobisht
Target Milestone: ---   
Target Release: 5.3z9   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Fulton 2025-03-31 21:33:25 UTC
Please remove this check in the case of RGW:

  https://github.com/ceph/ceph/blob/v16.2.15/src/cephadm/cephadm#L1305

We may not always be using '0.0.0.0'.

In newer versions this is handled better:

  https://github.com/ceph/ceph/blob/v17.2.8/src/cephadm/cephadm#L1422

Comment 2 Storage PM bot 2025-03-31 21:33:36 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 4 John Fulton 2025-04-04 12:45:45 UTC
How does this impact OpenStack customers who are running Ceph RGW on their OpenStack control plane?

In OpenStack Ceph is often running on the controller nodes, which are also running haproxy.  The controller nodes have multiple interfaces, with haproxy listening on the interface used for API communications. Ceph services should only be listening on the storage (ceph public) or storage management (ceph private) networks. However, it appears that there are Ceph services that try to listen on 0.0.0.0, which when they do, run into port conflicts with the ports haproxy is listening on on the OpenStack API network.

In this specific situation, the RGW spec is configured to have RGW come up on the ceph network:

---
[ceph: root@host42 /]# ceph orch ls --export
(...)
service_type: rgw
service_id: host42
service_name: rgw.host42
placement:
  count_per_host: 1
  hosts:
  - host42
networks:
- 10.1.42.0/24
- 10.0.42.0/24
- 10.2.42.0/24
extra_container_args:
- -v
- /etc/pki/ca-trust:/etc/pki/ca-trust:ro
spec:
  rgw_frontend_port: 8080
---
service_type: rgw
service_id: host43
service_name: rgw.host43
placement:
  count_per_host: 1
  hosts:
  - host43
networks:
- 10.1.42.0/24
- 10.0.42.0/24
- 10.2.42.0/24
extra_container_args:
- -v
- /etc/pki/ca-trust:/etc/pki/ca-trust:ro
spec:
  rgw_frontend_port: 8080
---
service_type: rgw
service_id: host44
service_name: rgw.host44
placement:
  count_per_host: 1
  hosts:
  - host44
networks:
- 10.1.42.0/24
- 10.0.42.0/24
- 10.2.42.0/24
extra_container_args:
- -v
- /etc/pki/ca-trust:/etc/pki/ca-trust:ro
spec:
  rgw_frontend_port: 8080

---

However, error messages in 'ceph health details' indicate that it is attempting to bind on *:8080, which conflicts with haproxy:

---
[ceph: root@host42 /]# ceph health detail
HEALTH_WARN Failed to place 1 daemon(s); 3 failed cephadm daemon(s)
[WRN] CEPHADM_DAEMON_PLACE_FAIL: Failed to place 1 daemon(s)
    Failed while placing rgw.host42.host42.foo on host42: cephadm exited with an error code: 1, stderr:Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-5ffc7906-2722-4602-9478-e2fe6ad3ff49-rgw-host42-host42-foo
/bin/podman: stderr Error: error inspecting object: no such container ceph-5ffc7906-2722-4602-9478-e2fe6ad3ff49-rgw-host42-host42-foo
Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-5ffc7906-2722-4602-9478-e2fe6ad3ff49-rgw.host42.host42.foo
/bin/podman: stderr Error: error inspecting object: no such container ceph-5ffc7906-2722-4602-9478-e2fe6ad3ff49-rgw.host42.host42.foo
Deploy daemon rgw.host42.host42.foo ...
Verifying port 8080 ...
Cannot bind to IP 0.0.0.0 port 8080: [Errno 98] Address already in use
ERROR: TCP Port(s) '8080' required for rgw already in use
---

In order to work around this issue we temporarily stopped haproxy using 'pcs resource disable haproxy-bundle'.  Once haproxy was stopped RGW started up on its own, and was bound to the expected network instead of 0.0.0.0:

---
[ceph: root@host42 /]# ceph orch ps
(...)
rgw.host42.host42.qfeedh  host42  10.0.42.20:8080  running (62s)    58s ago  62s    60.1M        -  16.2.10-275.el8cp  d7a74ab527fa  b60d550cdc91
rgw.host43.host43.ykpwef  host43  10.0.42.21:8080  running (65s)    58s ago  64s    58.9M        -  16.2.10-275.el8cp  d7a74ab527fa  ddea7b33bfc9
rgw.host44.host44.tsepgo  host44  10.0.42.22:8080  running (56s)    51s ago  55s    62.2M        -  16.2.10-275.el8cp  d7a74ab527fa  c1e87e8744ce
---

It appears that there is something checking first for availability on 0.0.0.0:8080 as RGW is coming up before it actually binds to the network(s) specified in the spec. If the check fails the daemon does not start.

This BZ is to track removing that unnecessary check since RGW can start on the IP specified in the spec.