+++ This bug was initially created as a clone of Bug #2233659 +++ Description of problem: This is part of a osp 17.1 deployment with ceph 6, the following error is blocking the grafana container from starting: Deploy daemon grafana.overcloud-controller-1 ... Verifying port 3100 ... Cannot bind to IP :: port 3100: [Errno 98] Address already in use ERROR: TCP Port(s) '3100' required for grafana already in use The in use address is haproxy on a different interface The config looks good. From "ceph orch ls --export" --- service_type: grafana service_name: grafana placement: hosts: - overcloud-controller-0 - overcloud-controller-1 - overcloud-controller-2 networks: - 2001:db8:1:9::/64 - 2001:db8:1:c::/64 - 2001:db8:1:b::/64 - 2001:db8:1:a::/64 - 2001:db8:1:d::/64 - 2001:db8:1:8::/64 spec: port: 3100 --- If I understand correctly, the "networks" option should limit binding to interfaces contained there. Here is overcloud-controller-0 interface information showing a valid interface for binding. overcloud-controller-0]$ grep 2001:db8:1 ip_addr 16: vlan123 inet6 2001:db8:1:8::b5/64 scope global \ valid_lft forever preferred_lft forever It should only bind to 2001:db8:1:8::b5:3100 This seems to also impact other services such as prometheus & alertmanager but likely the same issue. I'll provide more details and logs in private comments. Version-Release number of selected component (if applicable): cephadm-17.2.6-70.el9cp.noarch ceph 6 deployment How reproducible: this environment Steps to Reproduce: 1. see notes above Actual results: grafana daemon attempting to bind to all interfaces and failing. Expected results: specific interface based on networks configuration. Additional info: In private comments. --- Additional comment from Matt Flusche on 2023-08-22 20:41:32 UTC --- SFDC case: 03568800 sosreports if needed: supportshell.cee.redhat.com:/cases/03568800 Let me know if I need to attach specific logs for review. --- Additional comment from Matt Flusche on 2023-08-22 20:49:21 UTC --- Note, I obfuscated IPs for public case: --- service_type: grafana service_name: grafana placement: hosts: - devcloud-controller-0 - devcloud-controller-1 - devcloud-controller-2 networks: - 2605:1c00:50f2:28a9::/64 - 2605:1c00:50f2:28ac::/64 - 2605:1c00:50f2:28ab::/64 - 2605:1c00:50f2:28aa::/64 - 2605:1c00:50f2:28ad::/64 - 2605:1c00:50f2:28a8::/64 spec: port: 3000 --- ^^port 3000 here was just a temporary test on switching this port; should be 3100. supportshell-1 03568800]$ grep cephadm /cases/03568800/sosreport-20230818-181157/devcloud-controller-0/var/log/messages|grep 3100|grep grafana |tail -1 Aug 18 17:26:20 devcloud-controller-0 ceph-mon[32652]: Failed while placing grafana.devcloud-controller-1 on devcloud-controller-1: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana-devcloud-controller-1#012/bin/podman: stderr Error: inspecting object: no such container ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana-devcloud-controller-1#012Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana.devcloud-controller-1#012/bin/podman: stderr Error: inspecting object: no such container ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana.devcloud-controller-1#012Deploy daemon grafana.devcloud-controller-1 ...#012Verifying port 3100 ...#012Cannot bind to IP :: port 3100: [Errno 98] Address already in use#012ERROR: TCP Port(s) '3100' required for grafana already in use Showing the current listening haproxy service on different ip. supportshell-1 03568800]$ grep 3100 /cases/03568800/sosreport-20230818-181157/devcloud-controller-0/sos_commands/networking/netstat_-W_-neopa tcp6 0 0 2605:1c00:50f2:2888::30:3100 :::* LISTEN 0 393147895 241853/haproxy off (0.00/0/0) supportshell-1 03568800]$ grep 2605:1c00:50f2:28a8 /cases/03568800/sosreport-20230818-181157/devcloud-controller-0/ip_addr 16: vlan688 inet6 2605:1c00:50f2:28a8::b5/64 scope global \ valid_lft forever preferred_lft forever --- Additional comment from Adam King on 2023-08-23 17:51:41 UTC --- Iirc, currently the "networks" param is more for filtering to hosts that have the required networks than actually having the daemon bind to its ports on those specific networks. We have some preliminary work in https://github.com/ceph/ceph/pull/53008 that allows us to at least check the conflicts correctly and makes binding to ports on specific IPs work for haproxy in particular, but we still need to follow up and get this working for other use cases. Definitely something we can take as an RFE though and this is something we know is missing so I don't think we need any additional logs or info from the customer. My biggest concern is actually the use of IPv6. We don't have any testing for IPv6 in the upstream CI, so we only have manual testing for that right now. Either way, will see what we can do and will plan this for 7.1 for now (could potentially be cloned into a 6 release as well afterward). --- Additional comment from Matt Flusche on 2023-08-25 14:51:41 UTC --- Hi Adam, Thanks for looking into this. I've done some lab testing and now I'm more confused on how the interface binding is done. First I just did a generic deployment with a single ipv4 interface and the port binding worked fine. --- service_type: grafana service_name: grafana placement: hosts: - overcloud-controller-0 networks: - 172.16.1.0/24 spec: port: 3100 --- From the log, it selected the 172.16.1.62 interface: logger=http.server t=2023-08-24T18:24:28.690157102Z level=info msg="HTTP Server Listen" address=172.16.1.62:3100 protocol=https subUrl= socket= And we see haproxy & grafana using :3100 on different interfaces as expected. [root@overcloud-controller-0 ceph-admin]# ss -tlnp |grep 3100 LISTEN 0 4096 172.16.1.62:3100 0.0.0.0:* users:(("grafana",pid=473398,fd=7)) LISTEN 0 4096 192.168.2.101:3100 0.0.0.0:* users:(("haproxy",pid=477438,fd=8)) I even tried with a list of IPv4 networks and it worked fine. --- service_type: grafana service_name: grafana placement: hosts: - overcloud-controller-0 networks: - 172.10.1.0/24 - 172.11.1.0/24 - 172.12.1.0/24 - 172.13.1.0/24 - 172.16.1.0/24 spec: port: 3100 --- Then I was manually re-configuring grafana with: ceph orch apply -i /root/grafana.yaml where /root/grafana.yaml has my original single network config: cat /root/grafana.yaml service_type: grafana service_name: grafana placement: hosts: - overcloud-controller-0 networks: - 172.16.1.0/24 spec: port: 3100 However, it would then try to bind to all interfaces [ceph: root@overcloud-controller-0 /]# ceph orch ls grafana --format json-pretty [ { "events": [ "2023-08-24T22:02:23.577879Z service:grafana [ERROR] \"Failed while placing grafana.overcloud-controller-0 on overcloud-controller-0: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana-overcloud-controller-0\n/bin/podman: stderr Error: inspecting object: no such container ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana-overcloud-controller-0\nNon-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana.overcloud-controller-0\n/bin/podman: stderr Error: inspecting object: no such container ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana.overcloud-controller-0\nDeploy daemon grafana.overcloud-controller-0 ...\nVerifying port 3100 ...\nCannot bind to IP 0.0.0.0 port 3100: [Errno 98] Address already in use\nERROR: TCP Port(s) '3100' required for grafana already in use\"", "2023-08-25T13:11:18.990582Z service:grafana [INFO] \"service was created\"" ], "networks": [ "172.16.1.0/24" ], "placement": { "hosts": [ "overcloud-controller-0" ] }, "service_name": "grafana", "service_type": "grafana", "spec": { "port": 3100 }, "status": { "created": "2023-08-25T14:37:21.601722Z", "ports": [ 3100 ], "running": 0, "size": 1 } } ] There seems to be somewhere else it is determining how to bind the grafana interface. --- Additional comment from Francesco Pantano on 2023-10-16 06:51:28 UTC --- --- Additional comment from Manny on 2023-10-17 19:37:40 UTC --- Hello @adking , We have an active case tied to this BZ. It's already linked to this BZ. Is the BZ accurate? Meaning, is it indeed a code issue? Is there a workaround? Is this just a procedural issue? If a code issue, can we get it into RHCS 6.1z3? Not looking for a promise. Just some detail on this cluster: ~~~ $ ceph status cluster: id: b32f20ee-a52f-503d-91a1-a1442eb7e7d9 health: HEALTH_OK services: mon: 3 daemons, quorum devcloud-controller-0,devcloud-controller-2,devcloud-controller-1 (age 3d) mgr: devcloud-controller-0.jyayzd(active, since 6d), standbys: devcloud-controller-2.hpzokl, devcloud-controller-1.gifuhs osd: 24 osds: 24 up (since 6d), 24 in (since 2w) data: pools: 4 pools, 97 pgs objects: 43.33k objects, 218 GiB usage: 657 GiB used, 69 TiB / 70 TiB avail pgs: 97 active+clean io: client: 0 B/s rd, 3.0 KiB/s wr, 0 op/s rd, 0 op/s wr $ ceph version ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy (stable) ~~~ Best regards, Manny Caldeira Software Maintenance Engineer Red Hat Ceph Storage (RHCS) --- Additional comment from Adam King on 2023-10-18 14:49:10 UTC --- (In reply to Manny from comment #6) > Hello @adking , > > We have an active case tied to this BZ. It's already linked to this BZ. > > Is the BZ accurate? Meaning, is it indeed a code issue? Is there a > workaround? > Is this just a procedural issue? > If a code issue, can we get it into RHCS 6.1z3? Not looking for a promise. > > Just some detail on this cluster: > ~~~ > $ ceph status > > cluster: > id: b32f20ee-a52f-503d-91a1-a1442eb7e7d9 > health: HEALTH_OK > > services: > mon: 3 daemons, quorum > devcloud-controller-0,devcloud-controller-2,devcloud-controller-1 (age 3d) > mgr: devcloud-controller-0.jyayzd(active, since 6d), standbys: > devcloud-controller-2.hpzokl, devcloud-controller-1.gifuhs > osd: 24 osds: 24 up (since 6d), 24 in (since 2w) > > data: > pools: 4 pools, 97 pgs > objects: 43.33k objects, 218 GiB > usage: 657 GiB used, 69 TiB / 70 TiB avail > pgs: 97 active+clean > > io: > client: 0 B/s rd, 3.0 KiB/s wr, 0 op/s rd, 0 op/s wr > > $ ceph version > ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) > quincy (stable) > ~~~ > > Best regards, > Manny Caldeira > Software Maintenance Engineer > Red Hat Ceph Storage (RHCS) I can't commit to it, but I can have a look. It requires both changes to have cephadm only check port availability on the given network as well as getting each daemon (prometheus, grafana, etc.) to actually only bind to the correct network. That second part is the one that will take a bit more research, so unsure how long it will take. --- Additional comment from Adam King on 2023-10-18 20:30:09 UTC --- Early experimental work on this https://github.com/ceph/ceph/pull/54083. At least seems to work okay for grafana. --- Additional comment from Manny on 2023-10-24 01:47:19 UTC --- (In reply to Adam King from comment #8) > Early experimental work on this https://github.com/ceph/ceph/pull/54083. At least seems to work okay for grafana. Hello again Adam, Good to hear that you've been able to get this work in any context, TY. Is this an RFE or a bug fix? Can this be fixed in RHCS 6.1.z-something? If yes, can we get this BZ cloned so we have a BZ with an accurate target release? Please let us know, TY Best regards, Manny --- Additional comment from Adam King on 2023-10-24 17:42:48 UTC --- (In reply to Manny from comment #9) > (In reply to Adam King from comment #8) > > Early experimental work on this https://github.com/ceph/ceph/pull/54083. At least seems to work okay for grafana. > > Hello again Adam, > > Good to hear that you've been able to get this work in any context, TY. > > Is this an RFE or a bug fix? > Can this be fixed in RHCS 6.1.z-something? > If yes, can we get this BZ cloned so we have a BZ with an accurate target > release? > > Please let us know, TY > > Best regards, > Manny I consider this to be an RFE. However, we tend to backport quite a few RFEs on the cephadm side anyway. I don't know when 6.1z3 is meant to be releasing so unsure if we can have that for there, but you should still be fine to clone it and if we can't make 6.1z3 we can still do 6.2.
*** Bug 2246434 has been marked as a duplicate of this bug. ***
Missed 6.1 z3 development window. Retargeted to 6.1 z4.
This did not make it to the 6.1 z4 freeze date. Retargeting to 6.1 z5.
*** Bug 2254553 has been marked as a duplicate of this bug. ***