Bug 1945272
| Summary: | [orchestrator] one monitor reported as stray after reducing monitors from 5 to 3 | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vasishta <vashastr> |
| Component: | Cephadm | Assignee: | Adam King <adking> |
| Status: | CLOSED DUPLICATE | QA Contact: | Vasishta <vashastr> |
| Severity: | high | Docs Contact: | Ranjini M N <rmandyam> |
| Priority: | high | ||
| Version: | 5.0 | CC: | adking, gsitlani, ksirivad, pcuzner, rmandyam, sangadi, twilkins |
| Target Milestone: | --- | ||
| Target Release: | 6.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Known Issue | |
| Doc Text: |
.The Ceph monitors are reported as stray daemons even after removal from the {storage-product} cluster
Cephadm reports the Ceph monitors as stray daemons even though they have been removed from the storage cluster.
To work around this issue, run the `ceph mgr fail` command, which allows the manager to restart and clear the error.
If there is no standby manager, `ceph mgr fail` command makes the cluster temporarily unresponsive.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-18 01:26:25 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1959686 | ||
I've been able to reproduce this very consistently
ceph: root@vm-00 /]# ceph orch ls
NAME RUNNING REFRESHED AGE PLACEMENT IMAGE ID
alertmanager 1/1 15s ago 9m count:1 0881eb8f169f
crash 5/5 17s ago 10m * 54d9b6d18015
grafana 1/1 15s ago 9m count:1 80728b29ad3f
mgr 2/2 16s ago 10m count:2 54d9b6d18015
mon 5/5 17s ago 10m count:5 54d9b6d18015
node-exporter 5/5 17s ago 9m * e5a616e4b9cf
osd.all-available-devices 10/10 17s ago 9m * 54d9b6d18015
prometheus 1/1 15s ago 9m count:1 de242295e225
[ceph: root@vm-00 /]# ceph orch apply mon 3
Scheduled mon update...
[ceph: root@vm-00 /]# ceph -s
cluster:
id: a6bfc010-98a7-11eb-b62b-525400eecc6e
health: HEALTH_WARN
1 stray daemon(s) not managed by cephadm
services:
mon: 3 daemons, quorum vm-00,vm-02,vm-01 (age 31s)
mgr: vm-00.ihenko(active, since 10m), standbys: vm-04.lryzgm
osd: 10 osds: 10 up (since 4m), 10 in (since 4m)
data:
pools: 1 pools, 256 pgs
objects: 0 objects, 0 B
usage: 65 MiB used, 1.5 TiB / 1.5 TiB avail
pgs: 256 active+clean
[ceph: root@vm-00 /]# ceph health detail
HEALTH_WARN 1 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
stray daemon mon.vm-04 on host vm-04 not managed by cephadm
Also, I checked and it seems like the mon.vm-04 daemon was properly shutdown. This is just an issue of reporting a "stray" daemon when the daemon doesn't exist at all.
I haven't figured out the cause yet but I'll keep looking.
This will miss the dev freeze for 5.1. moving to 5.2 for now. Adam asked me to look at this and I believe this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1945266 since as I have looked at the logs in http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/num_mon_to_3/ we hit a: `` 7f348ecf8700 time 2021-03-31T14:28:43.421216+0000 /builddir/build/BUILD/ceph-16.1.0-1323-g7e7e1f4e/src/mon/MonMap.h: 404: FAILED ceph_assert(m < ranks.size()) ceph version 16.1.0-1323.el8cp (46ac37397f0332c20aceceb8022a1ac1ddf8fa73) pacific (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f349a0693b8] 2: /usr/lib64/ceph/libceph-common.so.2(+0x2765d2) [0x7f349a0695d2] 3: (Elector::send_peer_ping(int, utime_t const*)+0x448) [0x55a4b92a5868] 4: (Elector::ping_check(int)+0x30f) [0x55a4b92a618f] 5: (Context::complete(int)+0xd) [0x55a4b9226fdd] 6: (SafeTimer::timer_thread()+0x1b7) [0x7f349a157be7] 7: (SafeTimerThread::entry()+0x11) [0x7f349a1591c1] 8: /lib64/libpthread.so.0(+0x815a) [0x7f3497b5d15a] 9: clone() `` *** This bug has been marked as a duplicate of bug 1945266 *** |
Description of problem: Reduced number of monitors to 3 from 5, one of the monitor is reported as stray Version-Release number of selected component (if applicable): ceph version 16.1.0-1323.el8cp How reproducible: Tried once Steps to Reproduce: 1. Bootstrap a cluster 2. add OSD and other daemons 3. reduce monitor number to 3 from the default 5 Actual results: [ceph: root@pluto002 /]# ceph orch apply mon 3 Scheduled mon update... [ceph: root@pluto002 /]# ceph health detail HEALTH_WARN 1 stray daemon(s) not managed by cephadm [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm stray daemon mon.pluto009 on host pluto009 not managed by cephadm Expected results: To be removed monitors should get removed completely Additional info: