Bug 2104207

Summary:	[Tracker for BZ #2138216] [MetroDR] Monitor crash - ceph_assert(0 == \"how did we try and do stretch recovery while we have dead monitor buckets?\"
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Josh Durgin <jdurgin>
Component:	ceph	Assignee:	Kamoltat (Junior) Sirivadhna <ksirivad>
ceph sub component:	RADOS	QA Contact:	akarsha <akrai>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	akrai, bniver, ebenahar, gfarnum, kramdoss, ksirivad, mbukatov, muagarwa, nojha, odf-bz-bot, pdhiran, rtalur, rzarzyns, sheggodu, vereddy, vumrao
Version:	4.8
Target Milestone:	---
Target Release:	ODF 4.14.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.14.0-126	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2138216 (view as bug list)		Environment:
Last Closed:	2023-11-08 18:49:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2138216
Bug Blocks:

Description Josh Durgin 2022-07-05 19:03:44 UTC

Description of problem (please be detailed as possible and provide log
snippests):

When testing stretch mode failover, as described in https://bugzilla.redhat.com/show_bug.cgi?id=2069405#c0 

The monitors crash with this assert:


    "assert_msg": "/builddir/build/BUILD/ceph-14.2.11/src/mon/Monitor.cc: In function 'void Monitor::go_recovery_stretch_mode()' thread 7ff241fba700 time 2022-05-25 17:04:32.548884\n/builddir/build/BUILD/ceph-14.2.11/src/mon/Monitor.cc: 6714: FAILED ceph_assert(0 == \"how did we try and do stretch recovery while we have dead monitor buckets?\")\n",
    "backtrace": [
        "(()+0x12c20) [0x7ff24ffd6c20]",
        "(gsignal()+0x10f) [0x7ff24ec3537f]",
        "(abort()+0x127) [0x7ff24ec1fdb5]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7ff2532da359]",
        "(()+0x275522) [0x7ff2532da522]",
        "(Monitor::go_recovery_stretch_mode()+0x254) [0x55cfb472ede4]",
        "(OSDMonitor::update_from_paxos(bool*)+0x3c55) [0x55cfb48d1da5]",
        "(PaxosService::refresh(bool*)+0x10e) [0x55cfb48499ce]",
        "(Monitor::refresh_from_paxos(bool*)+0x1ac) [0x55cfb470f67c]",
        "(Paxos::do_refresh()+0x57) [0x55cfb4837c97]",
        "(Paxos::commit_finish()+0x621) [0x55cfb4844251]",
        "(C_Committed::finish(int)+0x38) [0x55cfb4848818]",
        "(Context::complete(int)+0xd) [0x55cfb474bc7d]",
        "(MonitorDBStore::C_DoTransaction::finish(int)+0x98) [0x55cfb4848548]",
        "(Context::complete(int)+0xd) [0x55cfb474bc7d]",
        "(Finisher::finisher_thread_entry()+0x18d) [0x7ff25336cbed]",
        "(()+0x817a) [0x7ff24ffcc17a]",
        "(clone()+0x43) [0x7ff24ecfadc3]"



Version of all relevant components (if applicable):

ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Since rook starts new monitors in case of crashes, the cluster is still usable, though there would be some delay for clients connecting for the first time.


Is there any workaround available to the best of your knowledge?
yes, starting new monitors

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Unknown

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. setup stretch cluster
2. failover


Actual results:
mon crashes

Expected results:
no mon crashes

Additional info:

Comment 4 Greg Farnum 2022-07-06 05:33:09 UTC

This is supposed to be checking for the case when all the monitors from one of (the two) CRUSH buckets we are using as failure domains are still dead, but somehow we ended up trying to go into recovery mode. If O were to guess, either the condition doesn’t work quite right when Rook creates new monitors in those zones, or Rook has created new monitors in different zones and that’s bypassing some checks?

Comment 11 Kamoltat (Junior) Sirivadhna 2022-08-04 14:07:55 UTC

Hi Greg,

For this particular issue I don’t think it is required for `tiebreaker_monitor` to be down, because it gets filtered out  when we are accounting for `dead_mon_buckets` and `OSDs` in the subtree of the ``stretch_bucket_divider`` which in this case is a `zone`, in the function: `maybe_go_degraded_stretch_mode()`
https://github.com/ceph/ceph/blob/main/src/mon/Monitor.cc#L6747-L6768.
Two scenarios that can surely cause the crash:

1. First is when you fail a zone, go into `degraded_stretch_mode` and somehow is able to revive 2 OSDs in the downed zone. This will make the ratio of (osdmap.num_up_osd / (double) osdmap.num_osd) > cct->_conf.get_val<double>("mon_stretch_cluster_recovery_ratio") = 4/6 > 0.6.
Making us going into assertion. This is what I think happened in the ODF cluster.

2. Second is when you fail a zone and add 3 or more OSDs into cluster (assuming we don’t have a max OSD limit), making the ratio of (osdmap.num_up_osd / (double) osdmap.num_osd) > cct->_conf.get_val<double>("mon_stretch_cluster_recovery_ratio"). Lets say we add 3 more OSDs, that will raise the num_up_osd ratio to  7/11 = 0.64 > mon_stretch_cluster_recovery_ratio = 0.6. Therefore, we will still going into the assertion. 

The first scenario is what I think happened with the ODF cluster, I have noticed that Rook automatically tries to restart an OSDs when it fails. Since the OSDs came back from 4 to 6, and as I have explained in my previous analysis, it causes us to go into the assertion. In a normal vstart cluster, starting back up an OSD that is down and in a failed zone is not possible since the OSDs will try to `log_to_monitor` but since all the monitors in the zone are down, the process will hang. The second scenario is also something we should deal with since it is possible for users to add OSDs.

I have created a PR (https://github.com/ceph/ceph/pull/47340) with a standalone test along with the solution, and have tested out the solution, here is the result for scenario 2:

With Patch:

ceph -s

```
  cluster:
    id:     ea6f50a3-a5b3-4fe5-817e-4ed89e254ed7
    health: HEALTH_WARN
            We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer
            2/5 mons down, quorum a,b,e
            4 osds down
            2 hosts (4 osds) down
            1 zone (4 osds) down
            Degraded data redundancy: 4/8 objects degraded (50.000%), 1 pg degraded, 33 pgs undersized
 
  services:
    mon: 5 daemons, quorum a,b,e (age 7m), out of quorum: c, d
    mgr: x(active, since 13m), standbys: y, z
    osd: 11 osds: 7 up (since 2m), 11 in (since 3m); 9 remapped pgs
 
  data:
    pools:   2 pools, 33 pgs
    objects: 2 objects, 577 KiB
    usage:   11 GiB used, 1.1 TiB / 1.1 TiB avail
    pgs:     4/8 objects degraded (50.000%)
             23 active+undersized
             9  active+undersized+remapped
             1  active+undersized+degraded
 ```

ceph osd tree

```
ID   CLASS  WEIGHT   TYPE NAME            STATUS  REWEIGHT  PRI-AFF
 -1         1.08569  root default                                  
 -5         0.39435      zone iris                                 
 -9         0.19717          host node-2                           
  0    hdd  0.09859              osd.0        up   1.00000  1.00000
  1    hdd  0.09859              osd.1        up   1.00000  1.00000
-10         0.19717          host node-3                           
  2    hdd  0.09859              osd.2        up   1.00000  1.00000
  3    hdd  0.09859              osd.3        up   1.00000  1.00000
 -7         0.39435      zone pze                                  
-11         0.19716          host node-4                           
  4    hdd  0.09859              osd.4      down   1.00000  1.00000
  5    hdd  0.09859              osd.5      down   1.00000  1.00000
-12         0.19716          host node-5                           
  6    hdd  0.09859              osd.6      down   1.00000  1.00000
  7    hdd  0.09859              osd.7      down   1.00000  1.00000
 -3         0.29700      host vossi03                              
  8    hdd  0.09900          osd.8            up   1.00000  1.00000
  9    hdd  0.09900          osd.9            up   1.00000  1.00000
 10    hdd  0.09900          osd.10           up   1.00000  1.00000
```

mon.b.log

```
2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).osd e110 update_logger
2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 Stretch mode enabled in this map
2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader) e10 try_engage_stretch_mode
2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 Degraded stretch mode set in this map
2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader) e10 set_degraded_stretch_mode
2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 prev_num_up_osd: 6
2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 osdmap.num_up_osd: 7
2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 osdmap.num_osd: 11
2022-08-03T19:31:35.270+0000 7f692c65f700 20 mon.b@1(leader).osd e110 mon_stretch_cluster_recovery_ratio: 0.6
2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).paxosservice(logm 1..352) refresh
2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).log v352 update_from_paxos
2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).log v352 update_from_paxos version 352 summary v 352
2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).paxosservice(monmap 1..10) refresh
2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).paxosservice(auth 1..16) refresh
2022-08-03T19:31:35.270+0000 7f692c65f700 10 mon.b@1(leader).auth v16 update_from_paxos
```

See how the ratio between `num_up_osd/num_osd`  is 7/11 = 0.636 which is more than `mon_stretch_cluster_recovery_ratio`
which is 0.6, yet we don’t go into assertion.

Without Patch:

ceph -s

```
Command did nothing because of the crash in leader monitor. (cons of using vstart)
```

ceph osd tree

```
Command did nothing because of the crash in leader monitor.  (cons of using vstart)
```

mon.b.log

```
   -17> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 Stretch mode enabled in this map
   -16> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 try_engage_stretch_mode
   -15> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 Degraded stretch mode set in this map
   -14> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 set_degraded_stretch_mode
   -13> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 prev_num_up_osd: 6
   -12> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 osdmap.num_up_osd: 7
   -11> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 osdmap.num_osd: 11
   -10> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader).osd e114 mon_stretch_cluster_recovery_ratio: 0.6
    -9> 2022-08-03T21:14:48.807+0000 7fd79548f700 10 mon.b@1(leader).osd e114 Enabling recovery stretch mode in this map
    -8> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 go_recovery_stretch_mode
    -7> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 is_leader(): 1
    -6> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 is_degraded_stretch_mode(): 1
    -5> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 is_recovering_stretch_mode(): 0
    -4> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 dead_mon_buckets.size(): 1
    -3> 2022-08-03T21:14:48.807+0000 7fd79548f700 20 mon.b@1(leader) e10 dead_mon_buckets: {pze=c,d}
-2> 2022-08-03T21:14:48.807+0000 7fd797c94700  1 -- [v2:127.0.0.1:40321/0,v1:127.0.0.1:40322/0] <== osd.10 v2:127.0.0.1:6926/3076179 15 ==== mon_subscribe({osdmap=0}) v3 ==== 34+0+0 (secure 0 0 0) 0x5567396f1380 con 0x556738b14000
 -1> 2022-08-03T21:14:48.814+0000 7fd79548f700 -1 ../src/mon/Monitor.cc: In function 'void Monitor::go_recovery_stretch_mode()' thread 7fd79548f700 time 2022-08-03T21:14:48.807548+0000
../src/mon/Monitor.cc: 6718: FAILED ceph_assert(0 == "how did we try and do stretch recovery while we have dead monitor buckets?")

 ceph version 17.0.0-13939-g5bff94917d1 (5bff94917d17ba4d45185766bd2815b1c4a00660) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7fd7a43616c5]
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7fd7a43618f7]
 3: (Monitor::go_recovery_stretch_mode()+0x747) [0x556730d4c2d9]
 4: (OSDMonitor::update_from_paxos(bool*)+0x3be7) [0x556730f4db7f]
 5: (PaxosService::refresh(bool*)+0x102) [0x556730eb8ede]
 6: (Monitor::refresh_from_paxos(bool*)+0x360) [0x556730d2d7b4]
 7: (Paxos::do_refresh()+0x4c) [0x556730ea929a]
 8: (Paxos::commit_finish()+0xa98) [0x556730eb3202]
 9: (C_Committed::finish(int)+0x89) [0x556730eb7755]
```

without the patch, we go into assertion with the same condition: 
`num_up_osd/num_osd`  is 7/11 = 0.636 which is more than `mon_stretch_cluster_recovery_ratio`
which is 0.6

Let me know what you think.

Comment 13 Greg Farnum 2022-08-04 18:50:23 UTC

I’m on pto so will have to look at the details more when I get back. But OSDs should be blocked from connecting to monitors which aren’t in their stretch zone (technically, their crush bucket which is used as the divider). So OSDs from a dead zone can’t come up without at least one of their monitors returning.

If an admin/rook manages to add nodes to a live zone but not add matching ones to the dead zone, I guess that could break the checks? Do we think that’s what happened here?

Comment 17 Veera Raghava Reddy 2022-09-05 04:33:14 UTC

Hi Pawan,
Review this Bug and see if you can recreate this on Standalone RHCS.

Regards,
Veera

Comment 23 Mudit Agarwal 2022-10-26 03:39:23 UTC

Neha/Veera, do we have a ceph clone tracking the fix for downstream?

Comment 27 Mudit Agarwal 2022-10-28 02:12:38 UTC

I don't think so, 6.1 is fine. Thanks for creating the clone.

Comment 52 Mudit Agarwal 2023-09-06 06:37:39 UTC

Nope, all done. Thanks.

Comment 53 Harish NV Rao 2023-09-11 06:03:36 UTC

QE is planning to verify this bz using the steps mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2069405#c0.

Talur, if this is not correct or anything else needs to be covered please let us know.

Comment 58 errata-xmlrpc 2023-11-08 18:49:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832