Bug 2364869 - [NVMeoF grafana Dashboard] Ceph Health NVMeoF WARNING panel doesnt displays right failed GW count
Summary: [NVMeoF grafana Dashboard] Ceph Health NVMeoF WARNING panel doesnt displays r...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: NVMeOF
Version: 8.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 8.1z2
Assignee: Aviv Caro
QA Contact: Manohar Murthy
ceph-doc-bot
URL:
Whiteboard:
Depends On:
Blocks: 2351689
TreeView+ depends on / blocked
 
Reported: 2025-05-07 17:01 UTC by Sunil Kumar Nagaraju
Modified: 2025-05-29 09:10 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
.NVMe-oF Grafana overview dashboards display wrong status information when multiple gateways are down If more than one gateway is down (in the `UNAVAILABLE` state), the Grafana charts for `Total Gateways` and `Ceph Health NVMeoF WARNING` display the wrong information. Also the `Ceph Health NVMeoF WARNING` graph displays 1 count of `NVMEOF_GATEWAY_DOWN`, even when multiple gateways are down. Currently, there is no workaround.
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)
Ceph Health NVMeoF WARNING invalid count (352.87 KB, image/png)
2025-05-07 17:01 UTC, Sunil Kumar Nagaraju
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-11352 0 None None None 2025-05-07 17:05:00 UTC

Description Sunil Kumar Nagaraju 2025-05-07 17:01:56 UTC
Created attachment 2088847 [details]
Ceph Health NVMeoF WARNING invalid count

Description of problem:

Currently under NVMeoF grafana overview dashboard, `Ceph Health NVMeoF WARNING` panel doesn't provide the exact failed count.

Instead its just refelects the warning count on NVMEOF_GATEWAY_DOWN event in ceph.

as per below health detail status, we can notice that 5 gateways down. However the 
`Ceph Health NVMeoF WARNING` panel always indicate value "1" which actually represents the NVMEOF_GATEWAY_DOWN warning itself rather than displaying the number Gateways went DOWN in the system.

[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]# ceph health detail
HEALTH_WARN 1 stray daemon(s) not managed by cephadm; 5 gateway(s) are in unavailable state; gateway might be down, try to redeploy.
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
    stray daemon nvmeof.ceph-sunilkumar-81-00-d6k85g-node5.duiuns on host ceph-sunilkumar-81-00-d6k85g-node5 not managed by cephadm
[WRN] NVMEOF_GATEWAY_DOWN: 5 gateway(s) are in unavailable state; gateway might be down, try to redeploy.
    NVMeoF Gateway 'client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node6.xndlff' is unavailable.
    NVMeoF Gateway 'client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node7.rlmenl' is unavailable.
    NVMeoF Gateway 'client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node8.qwbghd' is unavailable.
    NVMeoF Gateway 'client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node9.aagjid' is unavailable.
    NVMeoF Gateway 'client.nvmeof.rbd.group2.ceph-sunilkumar-81-00-d6k85g-node4.iedqbf' is unavailable.



Version-Release number of selected component (if applicable):
IBM Ceph 8.1 19.2.1-167.el9cp 

How reproducible: always


Steps to Reproduce:
1. Deploy IBM Ceph cluster.
2. Configure NVMe with multiple GWs and its enties from subsystem to Namespaces.
3. Goto Block --> NVMeoF --> Gateways --> Overview tab, Check the `Ceph Health NVMeoF WARNING` panel, it always reflects one when NVMEOF_GATEWAY_DOWN is fired.

Actual results:
invalid value at the dashboard panel


Additional info:
Attaching screenshot for reference


[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]# ceph orch host ls
HOST                                          ADDR         LABELS                    STATUS
ceph-sunilkumar-81-00-d6k85g-node1-installer  10.0.67.131  _admin,mon,mgr,installer
ceph-sunilkumar-81-00-d6k85g-node2            10.0.64.157  mon,mgr
ceph-sunilkumar-81-00-d6k85g-node3            10.0.67.187  mon,osd
ceph-sunilkumar-81-00-d6k85g-node4            10.0.66.183  mds,osd
ceph-sunilkumar-81-00-d6k85g-node5            10.0.67.29   mds,osd,rgw
ceph-sunilkumar-81-00-d6k85g-node6            10.0.64.71   nvmeof-gw
ceph-sunilkumar-81-00-d6k85g-node7            10.0.67.24   nvmeof-gw
ceph-sunilkumar-81-00-d6k85g-node8            10.0.66.65   nvmeof-gw
ceph-sunilkumar-81-00-d6k85g-node9            10.0.66.228  nvmeof-gw
9 hosts in cluster
[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]# ceph orch ps --daemon-type nvmeof
NAME                                                         HOST                                PORTS                   STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node6.xndlff         ceph-sunilkumar-81-00-d6k85g-node6  *:5500,4420,8009,10008  running (2d)     8m ago   2d     178M        -  1.4.7    96d48c1edeaf  58d7a9d0ace9
nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node7.rlmenl         ceph-sunilkumar-81-00-d6k85g-node7  *:5500,4420,8009,10008  running (2d)     8m ago   2d     176M        -  1.4.7    96d48c1edeaf  f6f7ac8da07b
nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node8.qwbghd         ceph-sunilkumar-81-00-d6k85g-node8  *:5500,4420,8009,10008  running (2d)     8m ago   2d     180M        -  1.4.7    96d48c1edeaf  968828243dc9
nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node9.aagjid         ceph-sunilkumar-81-00-d6k85g-node9  *:5500,4420,8009,10008  running (2d)     8m ago   2d     187M        -  1.4.7    96d48c1edeaf  84ee646d7526
nvmeof.rbd.group2.ceph-sunilkumar-81-00-d6k85g-node4.iedqbf  ceph-sunilkumar-81-00-d6k85g-node4  *:5500,4420,8009,10008  running (2d)     6m ago   2d     192M        -  1.4.7    96d48c1edeaf  e8d379319bb7
nvmeof.rbd.group2.ceph-sunilkumar-81-00-d6k85g-node5.duiuns  ceph-sunilkumar-81-00-d6k85g-node5  *:5500,4420,8009,10008  running (2d)     8m ago   2d     165M        -  1.4.7    96d48c1edeaf  1d1c192f34a6

[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]# ceph orch ls
NAME                       PORTS                   RUNNING  REFRESHED  AGE  PLACEMENT
alertmanager               ?:9093,9094                 1/1  8m ago     2d   count:1
ceph-exporter                                          9/9  8m ago     2d   *
crash                                                  9/9  8m ago     2d   *
grafana                    ?:3000                      1/1  8m ago     2d   count:1
mgr                                                    2/2  8m ago     2d   label:mgr
mon                                                    3/3  8m ago     2d   label:mon
node-exporter              ?:9100                      9/9  8m ago     2d   *
nvmeof.rbd                 ?:4420,5500,8009,10008      4/4  7m ago     46h  ceph-sunilkumar-81-00-d6k85g-node6;ceph-sunilkumar-81-00-d6k85g-node7;ceph-sunilkumar-81-00-d6k85g-node8;ceph-sunilkumar-81-00-d6k85g-node9
nvmeof.rbd.group2          ?:4420,5500,8009,10008      2/2  7m ago     46h  ceph-sunilkumar-81-00-d6k85g-node4;ceph-sunilkumar-81-00-d6k85g-node5
osd.all-available-devices                               12  7m ago     2d   *
prometheus                 ?:9095                      1/1  8m ago     2d   count:1

[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]# ceph nvme-gw show rbd group1
{
    "epoch": 134,
    "pool": "rbd",
    "group": "group1",
    "features": "LB",
    "rebalance_ana_group": 4,
    "num gws": 4,
    "GW-epoch": 105,
    "Anagrp list": "[ 1 2 3 4 ]",
    "num-namespaces": 18,
    "Created Gateways:": [
        {
            "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node6.xndlff",
            "anagrp-id": 1,
            "num-namespaces": 5,
            "performed-full-startup": 1,
            "Availability": "AVAILABLE",
            "num-listeners": 2,
            "ana states": " 1: ACTIVE ,  2: STANDBY ,  3: STANDBY ,  4: STANDBY "
        },
        {
            "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node7.rlmenl",
            "anagrp-id": 2,
            "num-namespaces": 4,
            "performed-full-startup": 1,
            "Availability": "AVAILABLE",
            "num-listeners": 2,
            "ana states": " 1: STANDBY ,  2: ACTIVE ,  3: STANDBY ,  4: STANDBY "
        },
        {
            "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node8.qwbghd",
            "anagrp-id": 3,
            "num-namespaces": 5,
            "performed-full-startup": 1,
            "Availability": "AVAILABLE",
            "num-listeners": 2,
            "ana states": " 1: STANDBY ,  2: STANDBY ,  3: ACTIVE ,  4: STANDBY "
        },
        {
            "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node9.aagjid",
            "anagrp-id": 4,
            "num-namespaces": 4,
            "performed-full-startup": 1,
            "Availability": "AVAILABLE",
            "num-listeners": 2,
            "ana states": " 1: STANDBY ,  2: STANDBY ,  3: STANDBY ,  4: ACTIVE "
        }
    ]
}
[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]# ceph nvme-gw show rbd group2
{
    "epoch": 134,
    "pool": "rbd",
    "group": "group2",
    "features": "LB",
    "rebalance_ana_group": 2,
    "num gws": 2,
    "GW-epoch": 80,
    "Anagrp list": "[ 1 2 ]",
    "num-namespaces": 8,
    "Created Gateways:": [
        {
            "gw-id": "client.nvmeof.rbd.group2.ceph-sunilkumar-81-00-d6k85g-node4.iedqbf",
            "anagrp-id": 1,
            "num-namespaces": 4,
            "performed-full-startup": 1,
            "Availability": "AVAILABLE",
            "num-listeners": 1,
            "ana states": " 1: ACTIVE ,  2: STANDBY "
        },
        {
            "gw-id": "client.nvmeof.rbd.group2.ceph-sunilkumar-81-00-d6k85g-node5.duiuns",
            "anagrp-id": 2,
            "num-namespaces": 4,
            "performed-full-startup": 1,
            "Availability": "AVAILABLE",
            "num-listeners": 1,
            "ana states": " 1: STANDBY ,  2: ACTIVE "
        }
    ]
}
[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]# ceph nvme-gw show rbd ''
{
    "epoch": 134,
    "pool": "rbd",
    "group": "",
    "features": "LB",
    "rebalance_ana_group": 4,
    "num gws": 4,
    "GW-epoch": 24,
    "Anagrp list": "[ 1 2 3 4 ]",
    "num-namespaces": 0,
    "Created Gateways:": [
        {
            "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node6.xndlff",
            "anagrp-id": 1,
            "num-namespaces": 0,
            "performed-full-startup": 0,
            "Availability": "UNAVAILABLE",
            "ana states": " 1: STANDBY ,  2: STANDBY ,  3: STANDBY ,  4: STANDBY "
        },
        {
            "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node7.rlmenl",
            "anagrp-id": 2,
            "num-namespaces": 0,
            "performed-full-startup": 0,
            "Availability": "UNAVAILABLE",
            "ana states": " 1: STANDBY ,  2: STANDBY ,  3: STANDBY ,  4: STANDBY "
        },
        {
            "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node8.qwbghd",
            "anagrp-id": 3,
            "num-namespaces": 0,
            "performed-full-startup": 0,
            "Availability": "UNAVAILABLE",
            "ana states": " 1: STANDBY ,  2: STANDBY ,  3: STANDBY ,  4: STANDBY "
        },
        {
            "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-81-00-d6k85g-node9.aagjid",
            "anagrp-id": 4,
            "num-namespaces": 0,
            "performed-full-startup": 0,
            "Availability": "UNAVAILABLE",
            "ana states": " 1: STANDBY ,  2: STANDBY ,  3: STANDBY ,  4: STANDBY "
        }
    ]
}
[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]#
[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]#
[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]#
[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]#
[ceph: root@ceph-sunilkumar-81-00-d6k85g-node1-installer /]# ceph status
  cluster:
    id:     2e83a2a8-296a-11f0-bf21-fa163e699b11
    health: HEALTH_WARN
            2 stray daemon(s) not managed by cephadm
            4 gateway(s) are in unavailable state; gateway might be down, try to redeploy.

  services:
    mon:                 3 daemons, quorum ceph-sunilkumar-81-00-d6k85g-node1-installer,ceph-sunilkumar-81-00-d6k85g-node2,ceph-sunilkumar-81-00-d6k85g-node3 (age 2d)
    mgr:                 ceph-sunilkumar-81-00-d6k85g-node1-installer.qnxdec(active, since 2d), standbys: ceph-sunilkumar-81-00-d6k85g-node2.qmritp
    osd:                 12 osds: 12 up (since 2d), 12 in (since 2d)
    nvmeof (rbd.):       4 gateways: 0 active ()
    nvmeof (rbd.group1): 4 gateways: 4 active (rbd.ceph-sunilkumar-81-00-d6k85g-node6.xndlff, rbd.ceph-sunilkumar-81-00-d6k85g-node7.rlmenl, rbd.ceph-sunilkumar-81-00-d6k85g-node8.qwbghd, rbd.ceph-sunilkumar-81-00-d6k85g-node9.aagjid)
    nvmeof (rbd.group2): 2 gateways: 2 active (ceph-sunilkumar-81-00-d6k85g-node4.iedqbf, ceph-sunilkumar-81-00-d6k85g-node5.duiuns)

  data:
    pools:   3 pools, 161 pgs
    objects: 10.35k objects, 40 GiB
    usage:   82 GiB used, 158 GiB / 240 GiB avail
    pgs:     161 active+clean

  io:
    client:   26 KiB/s rd, 0 B/s wr, 7 op/s rd, 3 op/s wr


Note You need to log in before you can comment on or make changes to this bug.