Bug 2246440 - cephadm trys to bind grafana daemon to all (::) interfaces when valid networks list is provided.
Summary: cephadm trys to bind grafana daemon to all (::) interfaces when valid network...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Cephadm
Version: 6.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 6.1z7
Assignee: Adam King
QA Contact: Mohit Bisht
URL:
Whiteboard:
: 2246434 2254553 (view as bug list)
Depends On: 2233659
Blocks: 2160009 1997638 2236231 2254553
TreeView+ depends on / blocked
 
Reported: 2023-10-26 20:22 UTC by Manny
Modified: 2024-06-14 15:09 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2233659
: 2254553 (view as bug list)
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7805 0 None None None 2023-10-26 20:23:16 UTC
Red Hat Knowledge Base (Solution) 7041333 0 None None None 2024-02-09 18:51:58 UTC

Description Manny 2023-10-26 20:22:37 UTC
+++ This bug was initially created as a clone of Bug #2233659 +++

Description of problem:


This is part of a osp 17.1 deployment with ceph 6,  the following error is blocking the grafana container from starting:

Deploy daemon grafana.overcloud-controller-1 ...
Verifying port 3100 ...
Cannot bind to IP :: port 3100: [Errno 98] Address already in use
ERROR: TCP Port(s) '3100' required for grafana already in use

The in use address is haproxy on a different interface


The config looks good.  From "ceph orch ls --export"

---
service_type: grafana
service_name: grafana
placement:
  hosts:
  - overcloud-controller-0
  - overcloud-controller-1
  - overcloud-controller-2
networks:
- 2001:db8:1:9::/64
- 2001:db8:1:c::/64
- 2001:db8:1:b::/64
- 2001:db8:1:a::/64
- 2001:db8:1:d::/64
- 2001:db8:1:8::/64
spec:
  port: 3100
---

If I understand correctly, the "networks" option should limit binding to interfaces contained there.

Here is overcloud-controller-0 interface information showing a valid interface for binding.

overcloud-controller-0]$ grep 2001:db8:1 ip_addr 
16: vlan123    inet6 2001:db8:1:8::b5/64 scope global \       valid_lft forever preferred_lft forever

It should only bind to 2001:db8:1:8::b5:3100

This seems to also impact other services such as prometheus & alertmanager but likely the same issue.

I'll provide more details and logs in private comments.


Version-Release number of selected component (if applicable):

cephadm-17.2.6-70.el9cp.noarch
ceph 6 deployment

How reproducible:
this environment


Steps to Reproduce:
1. see notes above

Actual results:

grafana daemon attempting to bind to all interfaces and failing.


Expected results:

specific interface based on networks configuration.

Additional info:

In private comments.

--- Additional comment from Matt Flusche on 2023-08-22 20:41:32 UTC ---

SFDC case: 03568800

sosreports if needed: supportshell.cee.redhat.com:/cases/03568800

Let me know if I need to attach specific logs for review.

--- Additional comment from Matt Flusche on 2023-08-22 20:49:21 UTC ---

Note, I obfuscated IPs for public case:

---
service_type: grafana
service_name: grafana
placement:
  hosts:
  - devcloud-controller-0
  - devcloud-controller-1
  - devcloud-controller-2
networks:
- 2605:1c00:50f2:28a9::/64
- 2605:1c00:50f2:28ac::/64
- 2605:1c00:50f2:28ab::/64
- 2605:1c00:50f2:28aa::/64
- 2605:1c00:50f2:28ad::/64
- 2605:1c00:50f2:28a8::/64
spec:
  port: 3000
---

^^port 3000 here was just a temporary test on switching this port; should be 3100.


supportshell-1 03568800]$ grep cephadm /cases/03568800/sosreport-20230818-181157/devcloud-controller-0/var/log/messages|grep 3100|grep grafana |tail -1
Aug 18 17:26:20 devcloud-controller-0 ceph-mon[32652]: Failed while placing grafana.devcloud-controller-1 on devcloud-controller-1: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana-devcloud-controller-1#012/bin/podman: stderr Error: inspecting object: no such container ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana-devcloud-controller-1#012Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana.devcloud-controller-1#012/bin/podman: stderr Error: inspecting object: no such container ceph-838e38a9-33cd-592a-946e-14172b49bc30-grafana.devcloud-controller-1#012Deploy daemon grafana.devcloud-controller-1 ...#012Verifying port 3100 ...#012Cannot bind to IP :: port 3100: [Errno 98] Address already in use#012ERROR: TCP Port(s) '3100' required for grafana already in use

Showing the current listening haproxy service on different ip.

supportshell-1 03568800]$ grep 3100 /cases/03568800/sosreport-20230818-181157/devcloud-controller-0/sos_commands/networking/netstat_-W_-neopa 
tcp6       0      0 2605:1c00:50f2:2888::30:3100 :::*                    LISTEN      0          393147895  241853/haproxy       off (0.00/0/0)

supportshell-1 03568800]$ grep 2605:1c00:50f2:28a8 /cases/03568800/sosreport-20230818-181157/devcloud-controller-0/ip_addr 
16: vlan688    inet6 2605:1c00:50f2:28a8::b5/64 scope global \       valid_lft forever preferred_lft forever

--- Additional comment from Adam King on 2023-08-23 17:51:41 UTC ---

Iirc, currently the "networks" param is more for filtering to hosts that have the required networks than actually having the daemon bind to its ports on those specific networks. We have some preliminary work in https://github.com/ceph/ceph/pull/53008 that allows us to at least check the conflicts correctly and makes binding to ports on specific IPs work for haproxy in particular, but we still need to follow up and get this working for other use cases. Definitely something we can take as an RFE though and this is something we know is missing so I don't think we need any additional logs or info from the customer. My biggest concern is actually the use of IPv6. We don't have any testing for IPv6 in the upstream CI, so we only have manual testing for that right now. Either way, will see what we can do and will plan this for 7.1 for now (could potentially be cloned into a 6 release as well afterward).

--- Additional comment from Matt Flusche on 2023-08-25 14:51:41 UTC ---

Hi Adam,

Thanks for looking into this.  I've done some lab testing and now I'm more confused on how the interface binding is done.

First I just did a generic deployment with a single ipv4 interface and the port binding worked fine.

---                                                                                                                                                                              
service_type: grafana                                                                                                                                                            
service_name: grafana                                                                                                                                                            
placement:                                                                                                                                                                       
  hosts:                                                                                                                                                                         
  - overcloud-controller-0                                                                                                                                                       
networks:                                                                                                                                                                        
- 172.16.1.0/24                                                                                                                                                                  
spec:                                                                                                                                                                            
  port: 3100                                                                                                                                                                     
---                                

From the log, it selected the 172.16.1.62 interface:

logger=http.server t=2023-08-24T18:24:28.690157102Z level=info msg="HTTP Server Listen" address=172.16.1.62:3100 protocol=https subUrl= socket=

And we see haproxy & grafana using :3100 on different interfaces as expected.

[root@overcloud-controller-0 ceph-admin]# ss -tlnp |grep 3100                                                                                                                    
LISTEN 0      4096     172.16.1.62:3100       0.0.0.0:*    users:(("grafana",pid=473398,fd=7))                                                                                   
                                                                                        
                                                                                                                                                                                 
LISTEN 0      4096   192.168.2.101:3100       0.0.0.0:*    users:(("haproxy",pid=477438,fd=8))   


I even tried with a list of IPv4 networks and it worked fine.

---                                                                                                                                                                              
service_type: grafana                                                                                                                                                            
service_name: grafana                                                                                                                                                            
placement:                                                                                                                                                                       
  hosts:                                                                                                                                                                         
  - overcloud-controller-0                                                                                                                                                       
networks:                                                                                                                                                                        
- 172.10.1.0/24                                                                                                                                                                  
- 172.11.1.0/24                                                                                                                                                                  
- 172.12.1.0/24                                                                                                                                                                  
- 172.13.1.0/24                                                                                                                                                                  
- 172.16.1.0/24                                                                                                                                                                  
spec:                                                                                                                                                                            
  port: 3100                                                                                                                                                                     
---


Then I was manually re-configuring grafana with: ceph orch apply -i /root/grafana.yaml

where  /root/grafana.yaml has my original single network config:

 cat /root/grafana.yaml 
service_type: grafana
service_name: grafana
placement:
  hosts:
  - overcloud-controller-0
networks:
- 172.16.1.0/24
spec:
  port: 3100

However, it would then try to bind to all interfaces

[ceph: root@overcloud-controller-0 /]# ceph orch ls grafana --format json-pretty

[
  {
    "events": [
      "2023-08-24T22:02:23.577879Z service:grafana [ERROR] \"Failed while placing grafana.overcloud-controller-0 on overcloud-controller-0: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana-overcloud-controller-0\n/bin/podman: stderr Error: inspecting object: no such container ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana-overcloud-controller-0\nNon-zero exit code 125 from /bin/podman container inspect --format {{.State.Status}} ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana.overcloud-controller-0\n/bin/podman: stderr Error: inspecting object: no such container ceph-5a7cf34b-f958-525d-b742-1610a2eb4d9e-grafana.overcloud-controller-0\nDeploy daemon grafana.overcloud-controller-0 ...\nVerifying port 3100 ...\nCannot bind to IP 0.0.0.0 port 3100: [Errno 98] Address already in use\nERROR: TCP Port(s) '3100' required for grafana already in use\"",
      "2023-08-25T13:11:18.990582Z service:grafana [INFO] \"service was created\""
    ],
    "networks": [
      "172.16.1.0/24"
    ],
    "placement": {
      "hosts": [
        "overcloud-controller-0"
      ]
    },
    "service_name": "grafana",
    "service_type": "grafana",
    "spec": {
      "port": 3100
    },
    "status": {
      "created": "2023-08-25T14:37:21.601722Z",
      "ports": [
        3100
      ],
      "running": 0,
      "size": 1
    }
  }
]


There seems to be somewhere else it is determining how to bind the grafana interface.

--- Additional comment from Francesco Pantano on 2023-10-16 06:51:28 UTC ---



--- Additional comment from Manny on 2023-10-17 19:37:40 UTC ---

Hello @adking ,

We have an active case tied to this BZ. It's already linked to this BZ.

 Is the BZ accurate? Meaning, is it indeed a code issue? Is there a workaround?
 Is this just a procedural issue?
 If a code issue, can we get it into RHCS 6.1z3? Not looking for a promise.

Just some detail on this cluster:
~~~
$ ceph status

  cluster:
    id:     b32f20ee-a52f-503d-91a1-a1442eb7e7d9
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum devcloud-controller-0,devcloud-controller-2,devcloud-controller-1 (age 3d)
    mgr: devcloud-controller-0.jyayzd(active, since 6d), standbys: devcloud-controller-2.hpzokl, devcloud-controller-1.gifuhs
    osd: 24 osds: 24 up (since 6d), 24 in (since 2w)

  data:
    pools:   4 pools, 97 pgs
    objects: 43.33k objects, 218 GiB
    usage:   657 GiB used, 69 TiB / 70 TiB avail
    pgs:     97 active+clean

  io:
    client:   0 B/s rd, 3.0 KiB/s wr, 0 op/s rd, 0 op/s wr

$ ceph version
ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy (stable)
~~~

Best regards,
Manny Caldeira
Software Maintenance Engineer
Red Hat Ceph Storage  (RHCS)

--- Additional comment from Adam King on 2023-10-18 14:49:10 UTC ---

(In reply to Manny from comment #6)
> Hello @adking ,
> 
> We have an active case tied to this BZ. It's already linked to this BZ.
> 
>  Is the BZ accurate? Meaning, is it indeed a code issue? Is there a
> workaround?
>  Is this just a procedural issue?
>  If a code issue, can we get it into RHCS 6.1z3? Not looking for a promise.
> 
> Just some detail on this cluster:
> ~~~
> $ ceph status
> 
>   cluster:
>     id:     b32f20ee-a52f-503d-91a1-a1442eb7e7d9
>     health: HEALTH_OK
> 
>   services:
>     mon: 3 daemons, quorum
> devcloud-controller-0,devcloud-controller-2,devcloud-controller-1 (age 3d)
>     mgr: devcloud-controller-0.jyayzd(active, since 6d), standbys:
> devcloud-controller-2.hpzokl, devcloud-controller-1.gifuhs
>     osd: 24 osds: 24 up (since 6d), 24 in (since 2w)
> 
>   data:
>     pools:   4 pools, 97 pgs
>     objects: 43.33k objects, 218 GiB
>     usage:   657 GiB used, 69 TiB / 70 TiB avail
>     pgs:     97 active+clean
> 
>   io:
>     client:   0 B/s rd, 3.0 KiB/s wr, 0 op/s rd, 0 op/s wr
> 
> $ ceph version
> ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343)
> quincy (stable)
> ~~~
> 
> Best regards,
> Manny Caldeira
> Software Maintenance Engineer
> Red Hat Ceph Storage  (RHCS)

I can't commit to it, but I can have a look. It requires both changes to have cephadm only check port availability on the given network as well as getting each daemon (prometheus, grafana, etc.) to actually only bind to the correct network. That second part is the one that will take a bit more research, so unsure how long it will take.

--- Additional comment from Adam King on 2023-10-18 20:30:09 UTC ---

Early experimental work on this https://github.com/ceph/ceph/pull/54083. At least seems to work okay for grafana.

--- Additional comment from Manny on 2023-10-24 01:47:19 UTC ---

(In reply to Adam King from comment #8)
> Early experimental work on this https://github.com/ceph/ceph/pull/54083. At least seems to work okay for grafana.

Hello again Adam,

Good to hear that you've been able to get this work in any context, TY.

Is this an RFE or a bug fix?
Can this be fixed in RHCS 6.1.z-something?
If yes, can we get this BZ cloned so we have a BZ with an accurate target release?

Please let us know, TY

Best regards,
Manny

--- Additional comment from Adam King on 2023-10-24 17:42:48 UTC ---

(In reply to Manny from comment #9)
> (In reply to Adam King from comment #8)
> > Early experimental work on this https://github.com/ceph/ceph/pull/54083. At least seems to work okay for grafana.
> 
> Hello again Adam,
> 
> Good to hear that you've been able to get this work in any context, TY.
> 
> Is this an RFE or a bug fix?
> Can this be fixed in RHCS 6.1.z-something?
> If yes, can we get this BZ cloned so we have a BZ with an accurate target
> release?
> 
> Please let us know, TY
> 
> Best regards,
> Manny

I consider this to be an RFE. However, we tend to backport quite a few RFEs on the cephadm side anyway. I don't know when 6.1z3 is meant to be releasing so unsure if we can have that for there, but you should still be fine to clone it and if we can't make 6.1z3 we can still do 6.2.

Comment 1 Francesco Pantano 2023-11-06 13:18:06 UTC
*** Bug 2246434 has been marked as a duplicate of this bug. ***

Comment 5 Scott Ostapovicz 2023-11-15 04:19:54 UTC
Missed 6.1 z3 development window.  Retargeted to 6.1 z4.

Comment 7 Scott Ostapovicz 2024-01-23 13:53:28 UTC
This did not make it to the 6.1 z4 freeze date.  Retargeting to 6.1 z5.

Comment 12 Erin Peterson 2024-06-14 15:09:14 UTC
*** Bug 2254553 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.