Bug 1945272

Summary:	[orchestrator] one monitor reported as stray after reducing monitors from 5 to 3
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vasishta <vashastr>
Component:	Cephadm	Assignee:	Adam King <adking>
Status:	CLOSED DUPLICATE	QA Contact:	Vasishta <vashastr>
Severity:	high	Docs Contact:	Ranjini M N <rmandyam>
Priority:	high
Version:	5.0	CC:	adking, gsitlani, ksirivad, pcuzner, rmandyam, sangadi, twilkins
Target Milestone:	---
Target Release:	6.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	.The Ceph monitors are reported as stray daemons even after removal from the {storage-product} cluster Cephadm reports the Ceph monitors as stray daemons even though they have been removed from the storage cluster. To work around this issue, run the `ceph mgr fail` command, which allows the manager to restart and clear the error. If there is no standby manager, `ceph mgr fail` command makes the cluster temporarily unresponsive.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-18 01:26:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1959686

Description Vasishta 2021-03-31 15:09:00 UTC

Description of problem:
Reduced number of monitors to 3 from 5, one of the monitor is reported as stray

Version-Release number of selected component (if applicable):
ceph version 16.1.0-1323.el8cp

How reproducible:
Tried once

Steps to Reproduce:
1. Bootstrap a cluster
2. add OSD and other daemons
3. reduce monitor number to 3 from the default 5

Actual results:
[ceph: root@pluto002 /]# ceph orch apply mon  3
Scheduled mon update...
[ceph: root@pluto002 /]# ceph health detail
HEALTH_WARN 1 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
    stray daemon mon.pluto009 on host pluto009 not managed by cephadm


Expected results:
To be removed monitors should get removed completely

Additional info:

Comment 2 Adam King 2021-04-08 20:42:46 UTC

I've been able to reproduce this very consistently


ceph: root@vm-00 /]# ceph orch ls
NAME                       RUNNING  REFRESHED  AGE  PLACEMENT  IMAGE ID      
alertmanager                   1/1  15s ago    9m   count:1    0881eb8f169f  
crash                          5/5  17s ago    10m  *          54d9b6d18015  
grafana                        1/1  15s ago    9m   count:1    80728b29ad3f  
mgr                            2/2  16s ago    10m  count:2    54d9b6d18015  
mon                            5/5  17s ago    10m  count:5    54d9b6d18015  
node-exporter                  5/5  17s ago    9m   *          e5a616e4b9cf  
osd.all-available-devices    10/10  17s ago    9m   *          54d9b6d18015  
prometheus                     1/1  15s ago    9m   count:1    de242295e225  
[ceph: root@vm-00 /]# ceph orch apply mon 3
Scheduled mon update...
[ceph: root@vm-00 /]# ceph -s
  cluster:
    id:     a6bfc010-98a7-11eb-b62b-525400eecc6e
    health: HEALTH_WARN
            1 stray daemon(s) not managed by cephadm
 
  services:
    mon: 3 daemons, quorum vm-00,vm-02,vm-01 (age 31s)
    mgr: vm-00.ihenko(active, since 10m), standbys: vm-04.lryzgm
    osd: 10 osds: 10 up (since 4m), 10 in (since 4m)
 
  data:
    pools:   1 pools, 256 pgs
    objects: 0 objects, 0 B
    usage:   65 MiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     256 active+clean
 
[ceph: root@vm-00 /]# ceph health detail
HEALTH_WARN 1 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
    stray daemon mon.vm-04 on host vm-04 not managed by cephadm



Also, I checked and it seems like the mon.vm-04 daemon was properly shutdown. This is just an issue of reporting a "stray" daemon when the daemon doesn't exist at all.

I haven't figured out the cause yet but I'll keep looking.

Comment 9 Sebastian Wagner 2021-10-21 15:30:09 UTC

This will miss the dev freeze for 5.1.

Comment 10 Sebastian Wagner 2021-11-04 16:21:50 UTC

moving to 5.2 for now.

Comment 12 Kamoltat (Junior) Sirivadhna 2022-03-18 01:26:25 UTC

Adam asked me to look at this and  I believe this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1945266
since as I have looked at the logs in http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/num_mon_to_3/ we hit a:

``
7f348ecf8700 time 2021-03-31T14:28:43.421216+0000
/builddir/build/BUILD/ceph-16.1.0-1323-g7e7e1f4e/src/mon/MonMap.h: 404: FAILED ceph_assert(m < ranks.size())

 ceph version 16.1.0-1323.el8cp (46ac37397f0332c20aceceb8022a1ac1ddf8fa73) pacific (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f349a0693b8]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x2765d2) [0x7f349a0695d2]
 3: (Elector::send_peer_ping(int, utime_t const*)+0x448) [0x55a4b92a5868]
 4: (Elector::ping_check(int)+0x30f) [0x55a4b92a618f]
 5: (Context::complete(int)+0xd) [0x55a4b9226fdd]
 6: (SafeTimer::timer_thread()+0x1b7) [0x7f349a157be7]
 7: (SafeTimerThread::entry()+0x11) [0x7f349a1591c1]
 8: /lib64/libpthread.so.0(+0x815a) [0x7f3497b5d15a]
 9: clone()
``

*** This bug has been marked as a duplicate of bug 1945266 ***