Bug 2356354
| Summary: | Skip port conflict check in case of RGW | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | John Fulton <johfulto> |
| Component: | Cephadm | Assignee: | Adam King <adking> |
| Status: | POST --- | QA Contact: | Sayalee <saraut> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 5.3 | CC: | cephqe-warriors, mcaldeir, mobisht |
| Target Milestone: | --- | ||
| Target Release: | 5.3z9 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
John Fulton
2025-03-31 21:33:25 UTC
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity. How does this impact OpenStack customers who are running Ceph RGW on their OpenStack control plane?
In OpenStack Ceph is often running on the controller nodes, which are also running haproxy. The controller nodes have multiple interfaces, with haproxy listening on the interface used for API communications. Ceph services should only be listening on the storage (ceph public) or storage management (ceph private) networks. However, it appears that there are Ceph services that try to listen on 0.0.0.0, which when they do, run into port conflicts with the ports haproxy is listening on on the OpenStack API network.
In this specific situation, the RGW spec is configured to have RGW come up on the ceph network:
---
[ceph: root@host42 /]# ceph orch ls --export
(...)
service_type: rgw
service_id: host42
service_name: rgw.host42
placement:
count_per_host: 1
hosts:
- host42
networks:
- 10.1.42.0/24
- 10.0.42.0/24
- 10.2.42.0/24
extra_container_args:
- -v
- /etc/pki/ca-trust:/etc/pki/ca-trust:ro
spec:
rgw_frontend_port: 8080
---
service_type: rgw
service_id: host43
service_name: rgw.host43
placement:
count_per_host: 1
hosts:
- host43
networks:
- 10.1.42.0/24
- 10.0.42.0/24
- 10.2.42.0/24
extra_container_args:
- -v
- /etc/pki/ca-trust:/etc/pki/ca-trust:ro
spec:
rgw_frontend_port: 8080
---
service_type: rgw
service_id: host44
service_name: rgw.host44
placement:
count_per_host: 1
hosts:
- host44
networks:
- 10.1.42.0/24
- 10.0.42.0/24
- 10.2.42.0/24
extra_container_args:
- -v
- /etc/pki/ca-trust:/etc/pki/ca-trust:ro
spec:
rgw_frontend_port: 8080
---
However, error messages in 'ceph health details' indicate that it is attempting to bind on *:8080, which conflicts with haproxy:
---
[ceph: root@host42 /]# ceph health detail
HEALTH_WARN Failed to place 1 daemon(s); 3 failed cephadm daemon(s)
[WRN] CEPHADM_DAEMON_PLACE_FAIL: Failed to place 1 daemon(s)
Failed while placing rgw.host42.host42.foo on host42: cephadm exited with an error code: 1, stderr:Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-5ffc7906-2722-4602-9478-e2fe6ad3ff49-rgw-host42-host42-foo
/bin/podman: stderr Error: error inspecting object: no such container ceph-5ffc7906-2722-4602-9478-e2fe6ad3ff49-rgw-host42-host42-foo
Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-5ffc7906-2722-4602-9478-e2fe6ad3ff49-rgw.host42.host42.foo
/bin/podman: stderr Error: error inspecting object: no such container ceph-5ffc7906-2722-4602-9478-e2fe6ad3ff49-rgw.host42.host42.foo
Deploy daemon rgw.host42.host42.foo ...
Verifying port 8080 ...
Cannot bind to IP 0.0.0.0 port 8080: [Errno 98] Address already in use
ERROR: TCP Port(s) '8080' required for rgw already in use
---
In order to work around this issue we temporarily stopped haproxy using 'pcs resource disable haproxy-bundle'. Once haproxy was stopped RGW started up on its own, and was bound to the expected network instead of 0.0.0.0:
---
[ceph: root@host42 /]# ceph orch ps
(...)
rgw.host42.host42.qfeedh host42 10.0.42.20:8080 running (62s) 58s ago 62s 60.1M - 16.2.10-275.el8cp d7a74ab527fa b60d550cdc91
rgw.host43.host43.ykpwef host43 10.0.42.21:8080 running (65s) 58s ago 64s 58.9M - 16.2.10-275.el8cp d7a74ab527fa ddea7b33bfc9
rgw.host44.host44.tsepgo host44 10.0.42.22:8080 running (56s) 51s ago 55s 62.2M - 16.2.10-275.el8cp d7a74ab527fa c1e87e8744ce
---
It appears that there is something checking first for availability on 0.0.0.0:8080 as RGW is coming up before it actually binds to the network(s) specified in the spec. If the check fails the daemon does not start.
This BZ is to track removing that unnecessary check since RGW can start on the IP specified in the spec.
|