2246440 – cephadm trys to bind grafana daemon to all (::) interfaces when valid networks list is provided.

Bug 2246440 - cephadm trys to bind grafana daemon to all (::) interfaces when valid networks list is provided.

Summary: cephadm trys to bind grafana daemon to all (::) interfaces when valid network...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	6.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	6.1z9
Assignee:	Adam King
QA Contact:	Vinayak Papnoi
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	2246434 2254553 (view as bug list)
Depends On:	2233659
Blocks:	1997638 2160009 2236231 2254553 2350124
TreeView+	depends on / blocked

Reported:	2023-10-26 20:22 UTC by Manny
Modified:	2025-04-28 05:29 UTC (History)
CC List:	16 users (show)
Fixed In Version:	ceph-17.2.6-256
Doc Type:	Enhancement
Doc Text:	.Grafana now binds to an IP within a specific network on a host, rather that always binding to 0.0.0.0 With this enhancement, using a Grafana specification file that includes both the networks' section with the network that Grafana binds to an IP on, and `only_bind_port_on_networks: true` included in the "spec" section of the specification, Cephadm configures the Grafana daemon to bind to an IP within that network rather than 0.0.0.0. This enables users to use the same port that Grafana uses for another service but on a different IP on the host. If it is a specification update that does not cause them all to be moved, `ceph orch redeploy grafana` can be run to pick up the changes to the settings. Grafana specification file: ---- service_type: grafana service_name: grafana placement: count: 1 networks: 192.168.122.0/24 spec: anonymous_access: true protocol: https only_bind_port_on_networks: true ----
Clone Of:	2233659
Clones:	2254553 2350124 (view as bug list)
Environment:
Last Closed:	2025-04-28 05:29:08 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-7805	None	None	None	2023-10-26 20:23:16 UTC
Red Hat Knowledge Base (Solution)	7041333	None	None	None	2024-02-09 18:51:58 UTC
Red Hat Product Errata	RHSA-2025:4238	None	None	None	2025-04-28 05:29:38 UTC

Description Manny 2023-10-26 20:22:37 UTC

+++ This bug was initially created as a clone of Bug #2233659 +++

Description of problem:


This is part of a osp 17.1 deployment with ceph 6,  the following error is blocking the grafana container from starting:

Deploy daemon grafana.overcloud-controller-1 ...
Verifying port 3100 ...
Cannot bind to IP :: port 3100: [Errno 98] Address already in use
ERROR: TCP Port(s) '3100' required for grafana already in use

The in use address is haproxy on a different interface


The config looks good.  From "ceph orch ls --export"

---
service_type: grafana
service_name: grafana
placement:
  hosts:
  - overcloud-controller-0
  - overcloud-controller-1
  - overcloud-controller-2
networks:
- 2001:db8:1:9::/64
- 2001:db8:1:c::/64
- 2001:db8:1:b::/64
- 2001:db8:1:a::/64
- 2001:db8:1:d::/64
- 2001:db8:1:8::/64
spec:
  port: 3100
---

If I understand correctly, the "networks" option should limit binding to interfaces contained there.

Here is overcloud-controller-0 interface information showing a valid interface for binding.

overcloud-controller-0]$ grep 2001:db8:1 ip_addr 
16: vlan123    inet6 2001:db8:1:8::b5/64 scope global \       valid_lft forever preferred_lft forever

It should only bind to 2001:db8:1:8::b5:3100

This seems to also impact other services such as prometheus & alertmanager but likely the same issue.

I'll provide more details and logs in private comments.


Version-Release number of selected component (if applicable):

cephadm-17.2.6-70.el9cp.noarch
ceph 6 deployment

How reproducible:
this environment


Steps to Reproduce:
1. see notes above

Actual results:

grafana daemon attempting to bind to all interfaces and failing.


Expected results:

specific interface based on networks configuration.

Additional info:

In private comments.

--- Additional comment from Matt Flusche on 2023-08-22 20:41:32 UTC ---

SFDC case: 03568800

sosreports if needed: supportshell.cee.redhat.com:/cases/03568800

Let me know if I need to attach specific logs for review.

--- Additional comment from Matt Flusche on 2023-08-22 20:49:21 UTC ---

Note, I obfuscated IPs for public case:

---
service_type: grafana
service_name: grafana
placement:
  hosts:
  - devcloud-controller-0
  - devcloud-controller-1
  - devcloud-controller-2
networks:
- 2605:1c00:50f2:28a9::/64
- 2605:1c00:50f2:28ac::/64
- 2605:1c00:50f2:28ab::/64
- 2605:1c00:50f2:28aa::/64
- 2605:1c00:50f2:28ad::/64
- 2605:1c00:50f2:28a8::/64
spec:
  port: 3000
---

^^port 3000 here was just a temporary test on switching this port; should be 3100.


supportshell-1 03568800]$ grep cephadm /cases/03568800/sosreport-20230818-181157/devcloud-controller-0/var/log/messages|grep 3100|grep grafana |tail -1
Aug 18 17:26:20 devcloud-controller-0 ceph-mon[32652]: Failed while placing grafana.devcloud-controller-1 on devcloud-controller-1: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana-devcloud-controller-1#012/bin/podman: stderr Error: inspecting object: no such container ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana-devcloud-controller-1#012Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana.devcloud-controller-1#012/bin/podman: stderr Error: inspecting object: no such container ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana.devcloud-controller-1#012Deploy daemon grafana.devcloud-controller-1 ...#012Verifying port 3100 ...#012Cannot bind to IP :: port 3100: [Errno 98] Address already in use#012ERROR: TCP Port(s) '3100' required for grafana already in use

Showing the current listening haproxy service on different ip.

supportshell-1 03568800]$ grep 3100 /cases/03568800/sosreport-20230818-181157/devcloud-controller-0/sos_commands/networking/netstat_-W_-neopa 
tcp6       0      0 2605:1c00:50f2:2888::30:3100 :::*                    LISTEN      0          393147895  241853/haproxy       off (0.00/0/0)

supportshell-1 03568800]$ grep 2605:1c00:50f2:28a8 /cases/03568800/sosreport-20230818-181157/devcloud-controller-0/ip_addr 
16: vlan688    inet6 2605:1c00:50f2:28a8::b5/64 scope global \       valid_lft forever preferred_lft forever

--- Additional comment from Adam King on 2023-08-23 17:51:41 UTC ---

Iirc, currently the "networks" param is more for filtering to hosts that have the required networks than actually having the daemon bind to its ports on those specific networks. We have some preliminary work in https://github.com/ceph/ceph/pull/53008 that allows us to at least check the conflicts correctly and makes binding to ports on specific IPs work for haproxy in particular, but we still need to follow up and get this working for other use cases. Definitely something we can take as an RFE though and this is something we know is missing so I don't think we need any additional logs or info from the customer. My biggest concern is actually the use of IPv6. We don't have any testing for IPv6 in the upstream CI, so we only have manual testing for that right now. Either way, will see what we can do and will plan this for 7.1 for now (could potentially be cloned into a 6 release as well afterward).

--- Additional comment from Matt Flusche on 2023-08-25 14:51:41 UTC ---

Hi Adam,

Thanks for looking into this.  I've done some lab testing and now I'm more confused on how the interface binding is done.

First I just did a generic deployment with a single ipv4 interface and the port binding worked fine.

---                                                                                                                                                                              
service_type: grafana                                                                                                                                                            
service_name: grafana                                                                                                                                                            
placement:                                                                                                                                                                       
  hosts:                                                                                                                                                                         
  - overcloud-controller-0                                                                                                                                                       
networks:                                                                                                                                                                        
- 172.16.1.0/24                                                                                                                                                                  
spec:                                                                                                                                                                            
  port: 3100                                                                                                                                                                     
---                                

From the log, it selected the 172.16.1.62 interface:

logger=http.server t=2023-08-24T18:24:28.690157102Z level=info msg="HTTP Server Listen" address=172.16.1.62:3100 protocol=https subUrl= socket=

And we see haproxy & grafana using :3100 on different interfaces as expected.

[root@overcloud-controller-0 ceph-admin]# ss -tlnp |grep 3100                                                                                                                    
LISTEN 0      4096     172.16.1.62:3100       0.0.0.0:*    users:(("grafana",pid=473398,fd=7))                                                                                   
                                                                                        
                                                                                                                                                                                 
LISTEN 0      4096   192.168.2.101:3100       0.0.0.0:*    users:(("haproxy",pid=477438,fd=8))   


I even tried with a list of IPv4 networks and it worked fine.

---                                                                                                                                                                              
service_type: grafana                                                                                                                                                            
service_name: grafana                                                                                                                                                            
placement:                                                                                                                                                                       
  hosts:                                                                                                                                                                         
  - overcloud-controller-0                                                                                                                                                       
networks:                                                                                                                                                                        
- 172.10.1.0/24                                                                                                                                                                  
- 172.11.1.0/24                                                                                                                                                                  
- 172.12.1.0/24                                                                                                                                                                  
- 172.13.1.0/24                                                                                                                                                                  
- 172.16.1.0/24                                                                                                                                                                  
spec:                                                                                                                                                                            
  port: 3100                                                                                                                                                                     
---


Then I was manually re-configuring grafana with: ceph orch apply -i /root/grafana.yaml

where  /root/grafana.yaml has my original single network config:

 cat /root/grafana.yaml 
service_type: grafana
service_name: grafana
placement:
  hosts:
  - overcloud-controller-0
networks:
- 172.16.1.0/24
spec:
  port: 3100

However, it would then try to bind to all interfaces

[ceph: root@overcloud-controller-0 /]# ceph orch ls grafana --format json-pretty

[
  {
    "events": [
      "2023-08-24T22:02:23.577879Z service:grafana [ERROR] \"Failed while placing grafana.overcloud-controller-0 on overcloud-controller-0: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana-overcloud-controller-0\n/bin/podman: stderr Error: inspecting object: no such container ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana-overcloud-controller-0\nNon-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana.overcloud-controller-0\n/bin/podman: stderr Error: inspecting object: no such container ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana.overcloud-controller-0\nDeploy daemon grafana.overcloud-controller-0 ...\nVerifying port 3100 ...\nCannot bind to IP 0.0.0.0 port 3100: [Errno 98] Address already in use\nERROR: TCP Port(s) '3100' required for grafana already in use\"",
      "2023-08-25T13:11:18.990582Z service:grafana [INFO] \"service was created\""
    ],
    "networks": [
      "172.16.1.0/24"
    ],
    "placement": {
      "hosts": [
        "overcloud-controller-0"
      ]
    },
    "service_name": "grafana",
    "service_type": "grafana",
    "spec": {
      "port": 3100
    },
    "status": {
      "created": "2023-08-25T14:37:21.601722Z",
      "ports": [
        3100
      ],
      "running": 0,
      "size": 1
    }
  }
]


There seems to be somewhere else it is determining how to bind the grafana interface.

--- Additional comment from Francesco Pantano on 2023-10-16 06:51:28 UTC ---



--- Additional comment from Manny on 2023-10-17 19:37:40 UTC ---

Hello @adking ,

We have an active case tied to this BZ. It's already linked to this BZ.

 Is the BZ accurate? Meaning, is it indeed a code issue? Is there a workaround?
 Is this just a procedural issue?
 If a code issue, can we get it into RHCS 6.1z3? Not looking for a promise.

Just some detail on this cluster:
~~~
$ ceph status

  cluster:
    id:     b32f20ee-a52f-503d-91a1-a1442eb7e7d9
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum devcloud-controller-0,devcloud-controller-2,devcloud-controller-1 (age 3d)
    mgr: devcloud-controller-0.jyayzd(active, since 6d), standbys: devcloud-controller-2.hpzokl, devcloud-controller-1.gifuhs
    osd: 24 osds: 24 up (since 6d), 24 in (since 2w)

  data:
    pools:   4 pools, 97 pgs
    objects: 43.33k objects, 218 GiB
    usage:   657 GiB used, 69 TiB / 70 TiB avail
    pgs:     97 active+clean

  io:
    client:   0 B/s rd, 3.0 KiB/s wr, 0 op/s rd, 0 op/s wr

$ ceph version
ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy (stable)
~~~

Best regards,
Manny Caldeira
Software Maintenance Engineer
Red Hat Ceph Storage  (RHCS)

--- Additional comment from Adam King on 2023-10-18 14:49:10 UTC ---

(In reply to Manny from comment #6)
> Hello @adking ,
> 
> We have an active case tied to this BZ. It's already linked to this BZ.
> 
>  Is the BZ accurate? Meaning, is it indeed a code issue? Is there a
> workaround?
>  Is this just a procedural issue?
>  If a code issue, can we get it into RHCS 6.1z3? Not looking for a promise.
> 
> Just some detail on this cluster:
> ~~~
> $ ceph status
> 
>   cluster:
>     id:     b32f20ee-a52f-503d-91a1-a1442eb7e7d9
>     health: HEALTH_OK
> 
>   services:
>     mon: 3 daemons, quorum
> devcloud-controller-0,devcloud-controller-2,devcloud-controller-1 (age 3d)
>     mgr: devcloud-controller-0.jyayzd(active, since 6d), standbys:
> devcloud-controller-2.hpzokl, devcloud-controller-1.gifuhs
>     osd: 24 osds: 24 up (since 6d), 24 in (since 2w)
> 
>   data:
>     pools:   4 pools, 97 pgs
>     objects: 43.33k objects, 218 GiB
>     usage:   657 GiB used, 69 TiB / 70 TiB avail
>     pgs:     97 active+clean
> 
>   io:
>     client:   0 B/s rd, 3.0 KiB/s wr, 0 op/s rd, 0 op/s wr
> 
> $ ceph version
> ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343)
> quincy (stable)
> ~~~
> 
> Best regards,
> Manny Caldeira
> Software Maintenance Engineer
> Red Hat Ceph Storage  (RHCS)

I can't commit to it, but I can have a look. It requires both changes to have cephadm only check port availability on the given network as well as getting each daemon (prometheus, grafana, etc.) to actually only bind to the correct network. That second part is the one that will take a bit more research, so unsure how long it will take.

--- Additional comment from Adam King on 2023-10-18 20:30:09 UTC ---

Early experimental work on this https://github.com/ceph/ceph/pull/54083. At least seems to work okay for grafana.

--- Additional comment from Manny on 2023-10-24 01:47:19 UTC ---

(In reply to Adam King from comment #8)
> Early experimental work on this https://github.com/ceph/ceph/pull/54083. At least seems to work okay for grafana.

Hello again Adam,

Good to hear that you've been able to get this work in any context, TY.

Is this an RFE or a bug fix?
Can this be fixed in RHCS 6.1.z-something?
If yes, can we get this BZ cloned so we have a BZ with an accurate target release?

Please let us know, TY

Best regards,
Manny

--- Additional comment from Adam King on 2023-10-24 17:42:48 UTC ---

(In reply to Manny from comment #9)
> (In reply to Adam King from comment #8)
> > Early experimental work on this https://github.com/ceph/ceph/pull/54083. At least seems to work okay for grafana.
> 
> Hello again Adam,
> 
> Good to hear that you've been able to get this work in any context, TY.
> 
> Is this an RFE or a bug fix?
> Can this be fixed in RHCS 6.1.z-something?
> If yes, can we get this BZ cloned so we have a BZ with an accurate target
> release?
> 
> Please let us know, TY
> 
> Best regards,
> Manny

I consider this to be an RFE. However, we tend to backport quite a few RFEs on the cephadm side anyway. I don't know when 6.1z3 is meant to be releasing so unsure if we can have that for there, but you should still be fine to clone it and if we can't make 6.1z3 we can still do 6.2.

Comment 1 Francesco Pantano 2023-11-06 13:18:06 UTC

*** Bug 2246434 has been marked as a duplicate of this bug. ***

Comment 5 Scott Ostapovicz 2023-11-15 04:19:54 UTC

Missed 6.1 z3 development window.  Retargeted to 6.1 z4.

Comment 7 Scott Ostapovicz 2024-01-23 13:53:28 UTC

This did not make it to the 6.1 z4 freeze date.  Retargeting to 6.1 z5.

Comment 12 Erin Peterson 2024-06-14 15:09:14 UTC

*** Bug 2254553 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2025-04-28 05:29:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 6.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:4238

Note You need to log in before you can comment on or make changes to this bug.