.Cephadm does not emit health warnings when an active NVMe-oF daemon is stopped
Currently, Cephadm does not consider stopped daemons when generating health warnings. Health warnings are only triggered for daemons in a failed state. As a result, Cephadm does not emit a health warning if one or more active NVMe-oF daemons are stopped.
As a workaround, use the ceph orch ps --daemon-type nvmeof command to check the state of all NVMe-oF daemons. Check the values in the REFRESHED column of the output, which shows how long ago Cephadm last checked the state of the daemons. To refresh the information, use the ceph orch ps --refresh command. By default, the information is refreshed every 10 minutes. You can adjust this refresh rate by modifying the mgr/cephadm/daemon_cache_timeout value in seconds. For example, to set the refresh rate to every 5 minutes, use the ceph config set mgr mgr/cephadm/daemon_cache_timeout 300 command.
Description of problem:
Currently, health is HEALTH_OK on ceph -s though 8/32 nvmeof GWs are stopped/ error as below. We need to move it to HEALTH_WARN listing the reason as particular daemon is down. This has to be backported to 7.1 as well which is important. I remember having a BZ for this but I cannot find it now
[ceph: root@tala001 /]# ceph -s
cluster:
id: 2573aca6-908d-11ef-ab6d-b4835101e4e4
health: HEALTH_OK
services:
mon: 3 daemons, quorum tala001,tala002,tala003 (age 5d)
mgr: tala001.slxshz(active, since 3d), standbys: tala002.giruyl
osd: 131 osds: 131 up (since 3d), 131 in (since 3d)
nvmeof: 26 gateways active (26 hosts)
data:
pools: 7 pools, 481 pgs
objects: 3.01M objects, 11 TiB
usage: 35 TiB used, 333 TiB / 368 TiB avail
pgs: 481 active+clean
io:
client: 5.8 MiB/s rd, 38 op/s rd, 0 op/s wr
[ceph: root@tala001 /]# ceph orch ps| grep nvmeof
nvmeof.nvmeof_pool.group2.tala011.lupkkr tala011 *:5500,4420,8009 running (3d) 9s ago 3d 1482M - d1890c2c521e 6e36b73af16e
nvmeof.nvmeof_pool.group2.tala012.grpqng tala012 *:5500,4420,8009 running (3d) 9s ago 3d 1462M - d1890c2c521e de5d01ef2a4c
nvmeof.nvmeof_pool.group2.tala013.dalrcy tala013 *:5500,4420,8009 running (3d) 9s ago 3d 1480M - d1890c2c521e acf7355ec64c
nvmeof.nvmeof_pool.group2.tala014.xopceq tala014 *:5500,4420,8009 running (3d) 9s ago 3d 1469M - d1890c2c521e 4123ab99d1e0
nvmeof.nvmeof_pool.group2.tala018.vhhxuf tala018 *:5500,4420,8009 running (3d) 9s ago 3d 1484M - d1890c2c521e 265c31cdf6c7
nvmeof.nvmeof_pool.group2.tala019.jlcpwu tala019 *:5500,4420,8009 running (3d) 9s ago 3d 1462M - d1890c2c521e 2e27c518d290
nvmeof.nvmeof_pool.group2.tala021.kzyoyi tala021 *:5500,4420,8009 running (3d) 9s ago 3d 1471M - d1890c2c521e 95ba5472c883
nvmeof.nvmeof_pool.group2.tala022.uttuut tala022 *:5500,4420,8009 running (3d) 9s ago 3d 1486M - d1890c2c521e 45735a968ec0
nvmeof.nvmeof_pool.group3.ceph-scale-2-py5fg8-node1.lkgbtj ceph-scale-2-py5fg8-node1 *:5500,4420,8009 stopped 7s ago 3d - - <unknown> <unknown> <unknown>
nvmeof.nvmeof_pool.group3.ceph-scale-2-py5fg8-node2.ijwedf ceph-scale-2-py5fg8-node2 *:5500,4420,8009 stopped 7s ago 3d - - <unknown> <unknown> <unknown>
nvmeof.nvmeof_pool.group3.ceph-scale-2-py5fg8-node3.obuxzq ceph-scale-2-py5fg8-node3 *:5500,4420,8009 stopped 7s ago 3d - - <unknown> <unknown> <unknown>
nvmeof.nvmeof_pool.group3.ceph-scale-2-py5fg8-node4.pzuoql ceph-scale-2-py5fg8-node4 *:5500,4420,8009 stopped 7s ago 3d - - <unknown> <unknown> <unknown>
nvmeof.nvmeof_pool.group3.tala023.pdxpap tala023 *:5500,4420,8009 running (3d) 8s ago 3d 1669M - d1890c2c521e 88b70bd5c796
nvmeof.nvmeof_pool.group3.tala024.udqraj tala024 *:5500,4420,8009 running (3d) 8s ago 3d 1718M - d1890c2c521e 362492a39537
nvmeof.nvmeof_pool.group3.tala025.wwbssy tala025 *:5500,4420,8009 running (3d) 7s ago 3d 1742M - d1890c2c521e 98918c880d23
nvmeof.nvmeof_pool.group3.tala026.ydnfgn tala026 *:5500,4420,8009 running (3d) 7s ago 3d 1734M - d1890c2c521e 7fea5c629401
nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node5.cbyzpa ceph-scale-2-py5fg8-node5 *:5500,4420,8009 running (3d) 4s ago 3d 1495M - d1890c2c521e f1f3bbae0c32
nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node6.gmkwyd ceph-scale-2-py5fg8-node6 *:5500,4420,8009 running (3d) 5s ago 3d 1465M - d1890c2c521e 222c64422150
nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node7.gdynqu ceph-scale-2-py5fg8-node7 *:5500,4420,8009 running (3d) 6s ago 3d 1469M - d1890c2c521e c11af10aeac3
nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node8.qdtuns ceph-scale-2-py5fg8-node8 *:5500,4420,8009 running (3d) 6s ago 3d 1464M - d1890c2c521e c757eed19686
nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node9.kyeeop ceph-scale-2-py5fg8-node9 *:5500,4420,8009 running (3d) 6s ago 3d 1476M - d1890c2c521e 20be8febd0bd
nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node10.kdruzm ceph-scale-2-py5fg8-node10 *:5500,4420,8009 running (3d) 6s ago 3d 1511M - d1890c2c521e 7efc572dfa78
nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node11.ivcdsc ceph-scale-2-py5fg8-node11 *:5500,4420,8009 running (3d) 7s ago 3d 1501M - d1890c2c521e 10a8414e7222
nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node12.oxmyin ceph-scale-2-py5fg8-node12 *:5500,4420,8009 running (3d) 7s ago 3d 1478M - d1890c2c521e 0ac9c1634045
nvmeof.nvmeof_pool.tala003.fjsvct tala003 *:5500,4420,8009 running (24h) 9m ago 5d 647M - d1890c2c521e bbd323193423
nvmeof.nvmeof_pool.tala004.mxjpxs tala004 *:5500,4420,8009 running (24h) 9m ago 5d 643M - d1890c2c521e a5bb5f6b1fbc
nvmeof.nvmeof_pool.tala005.kgsiek tala005 *:5500,4420,8009 running (24h) 9m ago 5d 651M - d1890c2c521e 80b16963c41f
nvmeof.nvmeof_pool.tala006.frwxtb tala006 *:5500,4420,8009 running (24h) 9m ago 5d 647M - d1890c2c521e aa07f514cf10
nvmeof.nvmeof_pool.tala007.rlqngl tala007 *:5500,4420,8009 stopped 9m ago 3d - - <unknown> <unknown> <unknown>
nvmeof.nvmeof_pool.tala008.ohvzfm tala008 *:5500,4420,8009 stopped 9m ago 3d - - <unknown> <unknown> <unknown>
nvmeof.nvmeof_pool.tala009.yivhzu tala009 *:5500,4420,8009 running (24h) 9m ago 3d 644M - d1890c2c521e 4a6e52fc832a
nvmeof.nvmeof_pool.tala010.tkqkmb tala010 *:5500,4420,8009 running (24h) 9m ago 3d 650M - d1890c2c521e 25a1413a1fb4
[ceph: root@tala001 /]# ceph orch ls | grep nvmeof
nvmeof.nvmeof_pool ?:4420,5500,8009 6/8 10m ago 17h tala003;tala004;tala005;tala006;tala007;tala008;tala009;tala010
nvmeof.nvmeof_pool.group2 ?:4420,5500,8009 8/8 17s ago 17h tala011;tala012;tala013;tala014;tala018;tala019;tala021;tala022
nvmeof.nvmeof_pool.group3 ?:4420,5500,8009 4/8 17s ago 17h tala023;tala024;tala025;tala026;ceph-scale-2-py5fg8-node1;ceph-scale-2-py5fg8-node2;ceph-scale-2-py5fg8-node3;ceph-scale-2-py5fg8-node4
nvmeof.nvmeof_pool.group4 ?:4420,5500,8009 8/8 16s ago 17h ceph-scale-2-py5fg8-node12;ceph-scale-2-py5fg8-node11;ceph-scale-2-py5fg8-node10;ceph-scale-2-py5fg8-node9;ceph-scale-2-py5fg8-node8;ceph-scale-2-py5fg8-node7;ceph-scale-2-py5fg8-node6;ceph-scale-2-py5fg8-node5
Version-Release number of selected component (if applicable):
[ceph: root@tala001 /]# ceph version
ceph version 19.2.0-39.el9cp (ade19941ff2892c8fef06386a713d71e27e93a2c) squid (stable)
How reproducible: Always
Steps to Reproduce:
1. Deploy ceph cluster at reef or squid and deploy nvmeof service
2. Bring down few nvmeof daemons and ceph health does not complain does not complain about it
3.
Actual results: Bringing down few nvmeof daemons and ceph health does not complain about it and move remains at HEALTH_OK
Expected results: Bringing down few nvmeof daemons and ceph health should complain about it and move to HEALTH_WARN
Additional info:
Description of problem: Currently, health is HEALTH_OK on ceph -s though 8/32 nvmeof GWs are stopped/ error as below. We need to move it to HEALTH_WARN listing the reason as particular daemon is down. This has to be backported to 7.1 as well which is important. I remember having a BZ for this but I cannot find it now [ceph: root@tala001 /]# ceph -s cluster: id: 2573aca6-908d-11ef-ab6d-b4835101e4e4 health: HEALTH_OK services: mon: 3 daemons, quorum tala001,tala002,tala003 (age 5d) mgr: tala001.slxshz(active, since 3d), standbys: tala002.giruyl osd: 131 osds: 131 up (since 3d), 131 in (since 3d) nvmeof: 26 gateways active (26 hosts) data: pools: 7 pools, 481 pgs objects: 3.01M objects, 11 TiB usage: 35 TiB used, 333 TiB / 368 TiB avail pgs: 481 active+clean io: client: 5.8 MiB/s rd, 38 op/s rd, 0 op/s wr [ceph: root@tala001 /]# ceph orch ps| grep nvmeof nvmeof.nvmeof_pool.group2.tala011.lupkkr tala011 *:5500,4420,8009 running (3d) 9s ago 3d 1482M - d1890c2c521e 6e36b73af16e nvmeof.nvmeof_pool.group2.tala012.grpqng tala012 *:5500,4420,8009 running (3d) 9s ago 3d 1462M - d1890c2c521e de5d01ef2a4c nvmeof.nvmeof_pool.group2.tala013.dalrcy tala013 *:5500,4420,8009 running (3d) 9s ago 3d 1480M - d1890c2c521e acf7355ec64c nvmeof.nvmeof_pool.group2.tala014.xopceq tala014 *:5500,4420,8009 running (3d) 9s ago 3d 1469M - d1890c2c521e 4123ab99d1e0 nvmeof.nvmeof_pool.group2.tala018.vhhxuf tala018 *:5500,4420,8009 running (3d) 9s ago 3d 1484M - d1890c2c521e 265c31cdf6c7 nvmeof.nvmeof_pool.group2.tala019.jlcpwu tala019 *:5500,4420,8009 running (3d) 9s ago 3d 1462M - d1890c2c521e 2e27c518d290 nvmeof.nvmeof_pool.group2.tala021.kzyoyi tala021 *:5500,4420,8009 running (3d) 9s ago 3d 1471M - d1890c2c521e 95ba5472c883 nvmeof.nvmeof_pool.group2.tala022.uttuut tala022 *:5500,4420,8009 running (3d) 9s ago 3d 1486M - d1890c2c521e 45735a968ec0 nvmeof.nvmeof_pool.group3.ceph-scale-2-py5fg8-node1.lkgbtj ceph-scale-2-py5fg8-node1 *:5500,4420,8009 stopped 7s ago 3d - - <unknown> <unknown> <unknown> nvmeof.nvmeof_pool.group3.ceph-scale-2-py5fg8-node2.ijwedf ceph-scale-2-py5fg8-node2 *:5500,4420,8009 stopped 7s ago 3d - - <unknown> <unknown> <unknown> nvmeof.nvmeof_pool.group3.ceph-scale-2-py5fg8-node3.obuxzq ceph-scale-2-py5fg8-node3 *:5500,4420,8009 stopped 7s ago 3d - - <unknown> <unknown> <unknown> nvmeof.nvmeof_pool.group3.ceph-scale-2-py5fg8-node4.pzuoql ceph-scale-2-py5fg8-node4 *:5500,4420,8009 stopped 7s ago 3d - - <unknown> <unknown> <unknown> nvmeof.nvmeof_pool.group3.tala023.pdxpap tala023 *:5500,4420,8009 running (3d) 8s ago 3d 1669M - d1890c2c521e 88b70bd5c796 nvmeof.nvmeof_pool.group3.tala024.udqraj tala024 *:5500,4420,8009 running (3d) 8s ago 3d 1718M - d1890c2c521e 362492a39537 nvmeof.nvmeof_pool.group3.tala025.wwbssy tala025 *:5500,4420,8009 running (3d) 7s ago 3d 1742M - d1890c2c521e 98918c880d23 nvmeof.nvmeof_pool.group3.tala026.ydnfgn tala026 *:5500,4420,8009 running (3d) 7s ago 3d 1734M - d1890c2c521e 7fea5c629401 nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node5.cbyzpa ceph-scale-2-py5fg8-node5 *:5500,4420,8009 running (3d) 4s ago 3d 1495M - d1890c2c521e f1f3bbae0c32 nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node6.gmkwyd ceph-scale-2-py5fg8-node6 *:5500,4420,8009 running (3d) 5s ago 3d 1465M - d1890c2c521e 222c64422150 nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node7.gdynqu ceph-scale-2-py5fg8-node7 *:5500,4420,8009 running (3d) 6s ago 3d 1469M - d1890c2c521e c11af10aeac3 nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node8.qdtuns ceph-scale-2-py5fg8-node8 *:5500,4420,8009 running (3d) 6s ago 3d 1464M - d1890c2c521e c757eed19686 nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node9.kyeeop ceph-scale-2-py5fg8-node9 *:5500,4420,8009 running (3d) 6s ago 3d 1476M - d1890c2c521e 20be8febd0bd nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node10.kdruzm ceph-scale-2-py5fg8-node10 *:5500,4420,8009 running (3d) 6s ago 3d 1511M - d1890c2c521e 7efc572dfa78 nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node11.ivcdsc ceph-scale-2-py5fg8-node11 *:5500,4420,8009 running (3d) 7s ago 3d 1501M - d1890c2c521e 10a8414e7222 nvmeof.nvmeof_pool.group4.ceph-scale-2-py5fg8-node12.oxmyin ceph-scale-2-py5fg8-node12 *:5500,4420,8009 running (3d) 7s ago 3d 1478M - d1890c2c521e 0ac9c1634045 nvmeof.nvmeof_pool.tala003.fjsvct tala003 *:5500,4420,8009 running (24h) 9m ago 5d 647M - d1890c2c521e bbd323193423 nvmeof.nvmeof_pool.tala004.mxjpxs tala004 *:5500,4420,8009 running (24h) 9m ago 5d 643M - d1890c2c521e a5bb5f6b1fbc nvmeof.nvmeof_pool.tala005.kgsiek tala005 *:5500,4420,8009 running (24h) 9m ago 5d 651M - d1890c2c521e 80b16963c41f nvmeof.nvmeof_pool.tala006.frwxtb tala006 *:5500,4420,8009 running (24h) 9m ago 5d 647M - d1890c2c521e aa07f514cf10 nvmeof.nvmeof_pool.tala007.rlqngl tala007 *:5500,4420,8009 stopped 9m ago 3d - - <unknown> <unknown> <unknown> nvmeof.nvmeof_pool.tala008.ohvzfm tala008 *:5500,4420,8009 stopped 9m ago 3d - - <unknown> <unknown> <unknown> nvmeof.nvmeof_pool.tala009.yivhzu tala009 *:5500,4420,8009 running (24h) 9m ago 3d 644M - d1890c2c521e 4a6e52fc832a nvmeof.nvmeof_pool.tala010.tkqkmb tala010 *:5500,4420,8009 running (24h) 9m ago 3d 650M - d1890c2c521e 25a1413a1fb4 [ceph: root@tala001 /]# ceph orch ls | grep nvmeof nvmeof.nvmeof_pool ?:4420,5500,8009 6/8 10m ago 17h tala003;tala004;tala005;tala006;tala007;tala008;tala009;tala010 nvmeof.nvmeof_pool.group2 ?:4420,5500,8009 8/8 17s ago 17h tala011;tala012;tala013;tala014;tala018;tala019;tala021;tala022 nvmeof.nvmeof_pool.group3 ?:4420,5500,8009 4/8 17s ago 17h tala023;tala024;tala025;tala026;ceph-scale-2-py5fg8-node1;ceph-scale-2-py5fg8-node2;ceph-scale-2-py5fg8-node3;ceph-scale-2-py5fg8-node4 nvmeof.nvmeof_pool.group4 ?:4420,5500,8009 8/8 16s ago 17h ceph-scale-2-py5fg8-node12;ceph-scale-2-py5fg8-node11;ceph-scale-2-py5fg8-node10;ceph-scale-2-py5fg8-node9;ceph-scale-2-py5fg8-node8;ceph-scale-2-py5fg8-node7;ceph-scale-2-py5fg8-node6;ceph-scale-2-py5fg8-node5 Version-Release number of selected component (if applicable): [ceph: root@tala001 /]# ceph version ceph version 19.2.0-39.el9cp (ade19941ff2892c8fef06386a713d71e27e93a2c) squid (stable) How reproducible: Always Steps to Reproduce: 1. Deploy ceph cluster at reef or squid and deploy nvmeof service 2. Bring down few nvmeof daemons and ceph health does not complain does not complain about it 3. Actual results: Bringing down few nvmeof daemons and ceph health does not complain about it and move remains at HEALTH_OK Expected results: Bringing down few nvmeof daemons and ceph health should complain about it and move to HEALTH_WARN Additional info: