Bug 2104207
Summary: | [Tracker for BZ #2138216] [MetroDR] Monitor crash - ceph_assert(0 == \"how did we try and do stretch recovery while we have dead monitor buckets?\" | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Josh Durgin <jdurgin> | |
Component: | ceph | Assignee: | Kamoltat (Junior) Sirivadhna <ksirivad> | |
ceph sub component: | RADOS | QA Contact: | akarsha <akrai> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | CC: | akrai, bniver, ebenahar, gfarnum, kramdoss, ksirivad, mbukatov, muagarwa, nojha, odf-bz-bot, pdhiran, rtalur, rzarzyns, sheggodu, vereddy, vumrao | |
Version: | 4.8 | |||
Target Milestone: | --- | |||
Target Release: | ODF 4.14.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | 4.14.0-126 | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2138216 (view as bug list) | Environment: | ||
Last Closed: | 2023-11-08 18:49:50 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2138216 | |||
Bug Blocks: |
Description
Josh Durgin
2022-07-05 19:03:44 UTC
This is supposed to be checking for the case when all the monitors from one of (the two) CRUSH buckets we are using as failure domains are still dead, but somehow we ended up trying to go into recovery mode. If O were to guess, either the condition doesn’t work quite right when Rook creates new monitors in those zones, or Rook has created new monitors in different zones and that’s bypassing some checks? Hi Greg, For this particular issue I don’t think it is required for `tiebreaker_monitor` to be down, because it gets filtered out when we are accounting for `dead_mon_buckets` and `OSDs` in the subtree of the ``stretch_bucket_divider`` which in this case is a `zone`, in the function: `maybe_go_degraded_stretch_mode()` https://github.com/ceph/ceph/blob/main/src/mon/Monitor.cc#L6747-L6768. Two scenarios that can surely cause the crash: 1. First is when you fail a zone, go into `degraded_stretch_mode` and somehow is able to revive 2 OSDs in the downed zone. This will make the ratio of (osdmap.num_up_osd / (double) osdmap.num_osd) > cct->_conf.get_val<double>("mon_stretch_cluster_recovery_ratio") = 4/6 > 0.6. Making us going into assertion. This is what I think happened in the ODF cluster. 2. Second is when you fail a zone and add 3 or more OSDs into cluster (assuming we don’t have a max OSD limit), making the ratio of (osdmap.num_up_osd / (double) osdmap.num_osd) > cct->_conf.get_val<double>("mon_stretch_cluster_recovery_ratio"). Lets say we add 3 more OSDs, that will raise the num_up_osd ratio to 7/11 = 0.64 > mon_stretch_cluster_recovery_ratio = 0.6. Therefore, we will still going into the assertion. The first scenario is what I think happened with the ODF cluster, I have noticed that Rook automatically tries to restart an OSDs when it fails. Since the OSDs came back from 4 to 6, and as I have explained in my previous analysis, it causes us to go into the assertion. In a normal vstart cluster, starting back up an OSD that is down and in a failed zone is not possible since the OSDs will try to `log_to_monitor` but since all the monitors in the zone are down, the process will hang. The second scenario is also something we should deal with since it is possible for users to add OSDs. I have created a PR (https://github.com/ceph/ceph/pull/47340) with a standalone test along with the solution, and have tested out the solution, here is the result for scenario 2: With Patch: ceph -s ``` cluster: id: ea6f50a3-a5b3-4fe5-817e-4ed89e254ed7 health: HEALTH_WARN We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer 2/5 mons down, quorum a,b,e 4 osds down 2 hosts (4 osds) down 1 zone (4 osds) down Degraded data redundancy: 4/8 objects degraded (50.000%), 1 pg degraded, 33 pgs undersized services: mon: 5 daemons, quorum a,b,e (age 7m), out of quorum: c, d mgr: x(active, since 13m), standbys: y, z osd: 11 osds: 7 up (since 2m), 11 in (since 3m); 9 remapped pgs data: pools: 2 pools, 33 pgs objects: 2 objects, 577 KiB usage: 11 GiB used, 1.1 TiB / 1.1 TiB avail pgs: 4/8 objects degraded (50.000%) 23 active+undersized 9 active+undersized+remapped 1 active+undersized+degraded ``` ceph osd tree ``` ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 1.08569 root default -5 0.39435 zone iris -9 0.19717 host node-2 0 hdd 0.09859 osd.0 up 1.00000 1.00000 1 hdd 0.09859 osd.1 up 1.00000 1.00000 -10 0.19717 host node-3 2 hdd 0.09859 osd.2 up 1.00000 1.00000 3 hdd 0.09859 osd.3 up 1.00000 1.00000 -7 0.39435 zone pze -11 0.19716 host node-4 4 hdd 0.09859 osd.4 down 1.00000 1.00000 5 hdd 0.09859 osd.5 down 1.00000 1.00000 -12 0.19716 host node-5 6 hdd 0.09859 osd.6 down 1.00000 1.00000 7 hdd 0.09859 osd.7 down 1.00000 1.00000 -3 0.29700 host vossi03 8 hdd 0.09900 osd.8 up 1.00000 1.00000 9 hdd 0.09900 osd.9 up 1.00000 1.00000 10 hdd 0.09900 osd.10 up 1.00000 1.00000 ``` mon.b.log ``` 2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).osd e110 update_logger 2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 Stretch mode enabled in this map 2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader) e10 try_engage_stretch_mode 2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 Degraded stretch mode set in this map 2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader) e10 set_degraded_stretch_mode 2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 prev_num_up_osd: 6 2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 osdmap.num_up_osd: 7 2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 osdmap.num_osd: 11 2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 mon_stretch_cluster_recovery_ratio: 0.6 2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).paxosservice(logm 1..352) refresh 2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).log v352 update_from_paxos 2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).log v352 update_from_paxos version 352 summary v 352 2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).paxosservice(monmap 1..10) refresh 2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).paxosservice(auth 1..16) refresh 2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).auth v16 update_from_paxos ``` See how the ratio between `num_up_osd/num_osd` is 7/11 = 0.636 which is more than `mon_stretch_cluster_recovery_ratio` which is 0.6, yet we don’t go into assertion. Without Patch: ceph -s ``` Command did nothing because of the crash in leader monitor. (cons of using vstart) ``` ceph osd tree ``` Command did nothing because of the crash in leader monitor. (cons of using vstart) ``` mon.b.log ``` -17> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 Stretch mode enabled in this map -16> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 try_engage_stretch_mode -15> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 Degraded stretch mode set in this map -14> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 set_degraded_stretch_mode -13> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 prev_num_up_osd: 6 -12> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 osdmap.num_up_osd: 7 -11> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 osdmap.num_osd: 11 -10> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 mon_stretch_cluster_recovery_ratio: 0.6 -9> 2022-08-03T21:14:48.807+0000 7fd79548f700 10 mon.b@1(leader).osd e114 Enabling recovery stretch mode in this map -8> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 go_recovery_stretch_mode -7> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 is_leader(): 1 -6> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 is_degraded_stretch_mode(): 1 -5> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 is_recovering_stretch_mode(): 0 -4> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 dead_mon_buckets.size(): 1 -3> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 dead_mon_buckets: {pze=c,d} -2> 2022-08-03T21:14:48.807+0000 7fd797c94700 1 -- [v2:127.0.0.1:40321/0,v1:127.0.0.1:40322/0] <== osd.10 v2:127.0.0.1:6926/3076179 15 ==== mon_subscribe({osdmap=0}) v3 ==== 34+0+0 (secure 0 0 0) 0x5567396f1380 con 0x556738b14000 -1> 2022-08-03T21:14:48.814+0000 7fd79548f700 -1 ../src/mon/Monitor.cc: In function 'void Monitor::go_recovery_stretch_mode()' thread 7fd79548f700 time 2022-08-03T21:14:48.807548+0000 ../src/mon/Monitor.cc: 6718: FAILED ceph_assert(0 == "how did we try and do stretch recovery while we have dead monitor buckets?") ceph version 17.0.0-13939-g5bff94917d1 (5bff94917d17ba4d45185766bd2815b1c4a00660) quincy (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7fd7a43616c5] 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7fd7a43618f7] 3: (Monitor::go_recovery_stretch_mode()+0x747) [0x556730d4c2d9] 4: (OSDMonitor::update_from_paxos(bool*)+0x3be7) [0x556730f4db7f] 5: (PaxosService::refresh(bool*)+0x102) [0x556730eb8ede] 6: (Monitor::refresh_from_paxos(bool*)+0x360) [0x556730d2d7b4] 7: (Paxos::do_refresh()+0x4c) [0x556730ea929a] 8: (Paxos::commit_finish()+0xa98) [0x556730eb3202] 9: (C_Committed::finish(int)+0x89) [0x556730eb7755] ``` without the patch, we go into assertion with the same condition: `num_up_osd/num_osd` is 7/11 = 0.636 which is more than `mon_stretch_cluster_recovery_ratio` which is 0.6 Let me know what you think. I’m on pto so will have to look at the details more when I get back. But OSDs should be blocked from connecting to monitors which aren’t in their stretch zone (technically, their crush bucket which is used as the divider). So OSDs from a dead zone can’t come up without at least one of their monitors returning. If an admin/rook manages to add nodes to a live zone but not add matching ones to the dead zone, I guess that could break the checks? Do we think that’s what happened here? Hi Pawan, Review this Bug and see if you can recreate this on Standalone RHCS. Regards, Veera Neha/Veera, do we have a ceph clone tracking the fix for downstream? I don't think so, 6.1 is fine. Thanks for creating the clone. Nope, all done. Thanks. QE is planning to verify this bz using the steps mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2069405#c0. Talur, if this is not correct or anything else needs to be covered please let us know. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832 |