Bug 2182377

Summary: Cluster in Health_err state post removal of Hosts from cluster
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Pawan <pdhiran>
Component: CephadmAssignee: Adam King <adking>
Status: ASSIGNED --- QA Contact: Mohit Bisht <mobisht>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.3CC: cephqe-warriors, saraut
Target Milestone: ---   
Target Release: 7.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pawan 2023-03-28 13:21:23 UTC
Description of problem:
Observing that cluster enters health_err state post removal of hosts.

# ceph -s
  cluster:
    id:     a6832daa-ccab-11ed-afae-fa163ec3bb1a
    health: HEALTH_ERR
            failed to probe daemons or devices
            Module 'cephadm' has failed: 'ceph-pdhiran-o85tl0-node9'

  services:
    mon: 4 daemons, quorum ceph-pdhiran-o85tl0-node2,ceph-pdhiran-o85tl0-node6,ceph-pdhiran-o85tl0-node7,ceph-pdhiran-o85tl0-node11 (age 41m)
    mgr: ceph-pdhiran-o85tl0-node6.hoglst(active, since 5h), standbys: ceph-pdhiran-o85tl0-node2.zwnqgb
    mds: 1/1 daemons up, 1 standby
    osd: 24 osds: 24 up (since 64m), 24 in (since 112m)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   13 pools, 449 pgs
    objects: 250 objects, 32 KiB
    usage:   1.0 GiB used, 599 GiB / 600 GiB avail
    pgs:     449 active+clean

[root@ceph-pdhiran-o85tl0-node1-installer ~]#
[root@ceph-pdhiran-o85tl0-node1-installer ~]# ceph orch host ls
HOST                                 ADDR          LABELS                    STATUS
ceph-pdhiran-o85tl0-node1-installer  10.0.210.201  _admin
ceph-pdhiran-o85tl0-node10           10.0.208.231  osd
ceph-pdhiran-o85tl0-node11           10.0.211.10   mon rgw
ceph-pdhiran-o85tl0-node12           10.0.208.219  osd-bak _no_schedule
ceph-pdhiran-o85tl0-node13           10.0.210.103  osd-bak _no_schedule
ceph-pdhiran-o85tl0-node2            10.0.208.239  mon mds alertmanager mgr
ceph-pdhiran-o85tl0-node3            10.0.211.84   osd
ceph-pdhiran-o85tl0-node4            10.0.209.113  osd
ceph-pdhiran-o85tl0-node5            10.0.210.89   osd
ceph-pdhiran-o85tl0-node6            10.0.209.209  mon mds mgr
ceph-pdhiran-o85tl0-node7            10.0.210.41   mon rgw
11 hosts in cluster

# ceph health detail
HEALTH_ERR failed to probe daemons or devices; Module 'cephadm' has failed: 'ceph-pdhiran-o85tl0-node9'
[WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
    host ceph-pdhiran-o85tl0-node9 `cephadm ceph-volume` failed: host address is empty
    host ceph-pdhiran-o85tl0-node9 `cephadm list-networks` failed: host address is empty
    host ceph-pdhiran-o85tl0-node9 `cephadm gather-facts` failed: host address is empty
[ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: 'ceph-pdhiran-o85tl0-node9'
    Module 'cephadm' has failed: 'ceph-pdhiran-o85tl0-node9'

Version-Release number of selected component (if applicable):
ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)

How reproducible:
1/1

Steps to Reproduce:
1. Deploy RHCS 5.3 cluster, Create pools, write data.

2. Prepare to remove one OSD host from the cluster.

3. Perform drain operation on the host.
# ceph orch host drain ceph-pdhiran-o85tl0-node9
Scheduled to remove the following daemons from host 'ceph-pdhiran-o85tl0-node9'
type                 id
-------------------- ---------------
crash                ceph-pdhiran-o85tl0-node9
node-exporter        ceph-pdhiran-o85tl0-node9
osd                  14
osd                  4
osd                  9
osd                  19

4. Once the drain is complete, remove the host from the cluster.
# ceph orch osd rm status -f json

[{"drain_done_at": "2023-03-28T12:02:33.990155Z", "drain_started_at": "2023-03-28T12:02:23.183372Z", "drain_stopped_at": null, "draining": false, "force": false, "hostname": "ceph-pdhiran-o85tl0-node9", "osd_id": 9, "process_started_at": "2023-03-28T12:02:00.879414Z", "replace": false, "started": true, "stopped": false, "zap": false}]
[root@ceph-pdhiran-o85tl0-node1-installer ~]# ceph orch osd rm status -f json

No OSD remove/replace operations reported

# ceph orch host rm ceph-pdhiran-o85tl0-node9
Removed  host 'ceph-pdhiran-o85tl0-node9'

5. After some time, Observed that the cluster in in Health_err state.

Actual results:
Cluster in Health_Err state post OSD host removal

Expected results:
Cluster to be Health_ok post removal operation

Additional info:
Failover of active mgr clears the health error and cluster reaches health_ok
ceph mgr fail.