Bug 2182377 - Cluster in Health_err state post removal of Hosts from cluster
Summary: Cluster in Health_err state post removal of Hosts from cluster
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Cephadm
Version: 5.3
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 7.1
Assignee: Adam King
QA Contact: Mohit Bisht
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-28 13:21 UTC by Pawan
Modified: 2023-07-06 17:49 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6331 0 None None None 2023-03-28 13:22:48 UTC

Description Pawan 2023-03-28 13:21:23 UTC
Description of problem:
Observing that cluster enters health_err state post removal of hosts.

# ceph -s
  cluster:
    id:     a6832daa-ccab-11ed-afae-fa163ec3bb1a
    health: HEALTH_ERR
            failed to probe daemons or devices
            Module 'cephadm' has failed: 'ceph-pdhiran-o85tl0-node9'

  services:
    mon: 4 daemons, quorum ceph-pdhiran-o85tl0-node2,ceph-pdhiran-o85tl0-node6,ceph-pdhiran-o85tl0-node7,ceph-pdhiran-o85tl0-node11 (age 41m)
    mgr: ceph-pdhiran-o85tl0-node6.hoglst(active, since 5h), standbys: ceph-pdhiran-o85tl0-node2.zwnqgb
    mds: 1/1 daemons up, 1 standby
    osd: 24 osds: 24 up (since 64m), 24 in (since 112m)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   13 pools, 449 pgs
    objects: 250 objects, 32 KiB
    usage:   1.0 GiB used, 599 GiB / 600 GiB avail
    pgs:     449 active+clean

[root@ceph-pdhiran-o85tl0-node1-installer ~]#
[root@ceph-pdhiran-o85tl0-node1-installer ~]# ceph orch host ls
HOST                                 ADDR          LABELS                    STATUS
ceph-pdhiran-o85tl0-node1-installer  10.0.210.201  _admin
ceph-pdhiran-o85tl0-node10           10.0.208.231  osd
ceph-pdhiran-o85tl0-node11           10.0.211.10   mon rgw
ceph-pdhiran-o85tl0-node12           10.0.208.219  osd-bak _no_schedule
ceph-pdhiran-o85tl0-node13           10.0.210.103  osd-bak _no_schedule
ceph-pdhiran-o85tl0-node2            10.0.208.239  mon mds alertmanager mgr
ceph-pdhiran-o85tl0-node3            10.0.211.84   osd
ceph-pdhiran-o85tl0-node4            10.0.209.113  osd
ceph-pdhiran-o85tl0-node5            10.0.210.89   osd
ceph-pdhiran-o85tl0-node6            10.0.209.209  mon mds mgr
ceph-pdhiran-o85tl0-node7            10.0.210.41   mon rgw
11 hosts in cluster

# ceph health detail
HEALTH_ERR failed to probe daemons or devices; Module 'cephadm' has failed: 'ceph-pdhiran-o85tl0-node9'
[WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
    host ceph-pdhiran-o85tl0-node9 `cephadm ceph-volume` failed: host address is empty
    host ceph-pdhiran-o85tl0-node9 `cephadm list-networks` failed: host address is empty
    host ceph-pdhiran-o85tl0-node9 `cephadm gather-facts` failed: host address is empty
[ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: 'ceph-pdhiran-o85tl0-node9'
    Module 'cephadm' has failed: 'ceph-pdhiran-o85tl0-node9'

Version-Release number of selected component (if applicable):
ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)

How reproducible:
1/1

Steps to Reproduce:
1. Deploy RHCS 5.3 cluster, Create pools, write data.

2. Prepare to remove one OSD host from the cluster.

3. Perform drain operation on the host.
# ceph orch host drain ceph-pdhiran-o85tl0-node9
Scheduled to remove the following daemons from host 'ceph-pdhiran-o85tl0-node9'
type                 id
-------------------- ---------------
crash                ceph-pdhiran-o85tl0-node9
node-exporter        ceph-pdhiran-o85tl0-node9
osd                  14
osd                  4
osd                  9
osd                  19

4. Once the drain is complete, remove the host from the cluster.
# ceph orch osd rm status -f json

[{"drain_done_at": "2023-03-28T12:02:33.990155Z", "drain_started_at": "2023-03-28T12:02:23.183372Z", "drain_stopped_at": null, "draining": false, "force": false, "hostname": "ceph-pdhiran-o85tl0-node9", "osd_id": 9, "process_started_at": "2023-03-28T12:02:00.879414Z", "replace": false, "started": true, "stopped": false, "zap": false}]
[root@ceph-pdhiran-o85tl0-node1-installer ~]# ceph orch osd rm status -f json

No OSD remove/replace operations reported

# ceph orch host rm ceph-pdhiran-o85tl0-node9
Removed  host 'ceph-pdhiran-o85tl0-node9'

5. After some time, Observed that the cluster in in Health_err state.

Actual results:
Cluster in Health_Err state post OSD host removal

Expected results:
Cluster to be Health_ok post removal operation

Additional info:
Failover of active mgr clears the health error and cluster reaches health_ok
ceph mgr fail.


Note You need to log in before you can comment on or make changes to this bug.