Description of problem: Reduced number of monitors to 3 from 5, one of the monitor is reported as stray Version-Release number of selected component (if applicable): ceph version 16.1.0-1323.el8cp How reproducible: Tried once Steps to Reproduce: 1. Bootstrap a cluster 2. add OSD and other daemons 3. reduce monitor number to 3 from the default 5 Actual results: [ceph: root@pluto002 /]# ceph orch apply mon 3 Scheduled mon update... [ceph: root@pluto002 /]# ceph health detail HEALTH_WARN 1 stray daemon(s) not managed by cephadm [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm stray daemon mon.pluto009 on host pluto009 not managed by cephadm Expected results: To be removed monitors should get removed completely Additional info:
I've been able to reproduce this very consistently ceph: root@vm-00 /]# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENT IMAGE ID alertmanager 1/1 15s ago 9m count:1 0881eb8f169f crash 5/5 17s ago 10m * 54d9b6d18015 grafana 1/1 15s ago 9m count:1 80728b29ad3f mgr 2/2 16s ago 10m count:2 54d9b6d18015 mon 5/5 17s ago 10m count:5 54d9b6d18015 node-exporter 5/5 17s ago 9m * e5a616e4b9cf osd.all-available-devices 10/10 17s ago 9m * 54d9b6d18015 prometheus 1/1 15s ago 9m count:1 de242295e225 [ceph: root@vm-00 /]# ceph orch apply mon 3 Scheduled mon update... [ceph: root@vm-00 /]# ceph -s cluster: id: a6bfc010-98a7-11eb-b62b-525400eecc6e health: HEALTH_WARN 1 stray daemon(s) not managed by cephadm services: mon: 3 daemons, quorum vm-00,vm-02,vm-01 (age 31s) mgr: vm-00.ihenko(active, since 10m), standbys: vm-04.lryzgm osd: 10 osds: 10 up (since 4m), 10 in (since 4m) data: pools: 1 pools, 256 pgs objects: 0 objects, 0 B usage: 65 MiB used, 1.5 TiB / 1.5 TiB avail pgs: 256 active+clean [ceph: root@vm-00 /]# ceph health detail HEALTH_WARN 1 stray daemon(s) not managed by cephadm [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm stray daemon mon.vm-04 on host vm-04 not managed by cephadm Also, I checked and it seems like the mon.vm-04 daemon was properly shutdown. This is just an issue of reporting a "stray" daemon when the daemon doesn't exist at all. I haven't figured out the cause yet but I'll keep looking.
This will miss the dev freeze for 5.1.
moving to 5.2 for now.
Adam asked me to look at this and I believe this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1945266 since as I have looked at the logs in http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/num_mon_to_3/ we hit a: `` 7f348ecf8700 time 2021-03-31T14:28:43.421216+0000 /builddir/build/BUILD/ceph-16.1.0-1323-g7e7e1f4e/src/mon/MonMap.h: 404: FAILED ceph_assert(m < ranks.size()) ceph version 16.1.0-1323.el8cp (46ac37397f0332c20aceceb8022a1ac1ddf8fa73) pacific (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f349a0693b8] 2: /usr/lib64/ceph/libceph-common.so.2(+0x2765d2) [0x7f349a0695d2] 3: (Elector::send_peer_ping(int, utime_t const*)+0x448) [0x55a4b92a5868] 4: (Elector::ping_check(int)+0x30f) [0x55a4b92a618f] 5: (Context::complete(int)+0xd) [0x55a4b9226fdd] 6: (SafeTimer::timer_thread()+0x1b7) [0x7f349a157be7] 7: (SafeTimerThread::entry()+0x11) [0x7f349a1591c1] 8: /lib64/libpthread.so.0(+0x815a) [0x7f3497b5d15a] 9: clone() `` *** This bug has been marked as a duplicate of bug 1945266 ***