1939617 – [Arbiter] Mons cannot be failed over in stretch mode

Bug 1939617 - [Arbiter] Mons cannot be failed over in stretch mode

Summary: [Arbiter] Mons cannot be failed over in stretch mode

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Travis Nielsen
QA Contact:	Martin Bukatovic
Docs Contact:
URL:
Whiteboard:
Depends On:	1939007 1939766
Blocks:	1941918
TreeView+	depends on / blocked

Reported:	2021-03-16 17:13 UTC by Travis Nielsen
Modified:	2021-06-01 08:50 UTC (History)
CC List:	14 users (show)
Fixed In Version:	4.7.0-318.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1939007
Clones:	1941918 (view as bug list)
Environment:
Last Closed:	2021-05-19 09:20:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Logs for the stretch mons when starting up mon.f to failover from mon.b (980.45 KB, application/zip) 2021-04-09 21:17 UTC, Travis Nielsen	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 198	None	open	Bug 1939617: ceph: disable mon failover on stretched cluster	2021-03-22 17:14:58 UTC
Github	rook rook pull 7535	None	open	ceph: Set the location on the mon daemon for stretch clusters	2021-04-06 22:33:41 UTC
Red Hat Product Errata	RHSA-2021:2041	None	None	None	2021-05-19 09:21:13 UTC

Comment 1 Travis Nielsen 2021-03-16 17:19:25 UTC

This BZ is cloned from the Ceph BZ with the following proposal...

Option 1

If the Ceph BZ can be fixed in time for 4.2z1:
- Rook is updated to set the mon location with a new argument at mon startup time. 

Option 2

If the Ceph BZ cannot be fixed in time for 4.2z1:
- We disable the mon failover functionality in Rook for stretch cluster scenarios. This seems like a reasonable solution for 4.7 since stretch cluster is in tech preview and we are already starting with 5 mons, which have some redundancy naturally built-in.
- Disabling mon failover is a setting on the CephCluster CR that will be set by the OCS operator

Moving to the Rook component for now, assuming Option 1 is still possible, with Option 2 as the fallback plan where it would move back to the OCS operator.

Comment 2 Martin Bukatovic 2021-03-19 11:50:01 UTC

What is expected to happen to a stretch cluster, when one data zone loses connection to both arbiter and other data zone with mon failover disabled?

Comment 5 Sébastien Han 2021-03-22 15:43:35 UTC

The decision was to go with option 2 since 4.2z1 does not accept anything new.
Will provide a fix soon.

Comment 6 Michael Adam 2021-03-22 17:21:15 UTC

(In reply to Sébastien Han from comment #5)
> The decision was to go with option 2 since 4.2z1 does not accept anything
> new.

Well, kind of right.
We have agreed with the RHCS program to have an async update right after the 4.2z1 release in order to facilitate a few critical bug fixes for OCS etc. So if we had a fix in ceph already, we might still get it in. Not sure.

> Will provide a fix soon.

Comment 7 Mudit Agarwal 2021-03-23 07:08:29 UTC

(In reply to Martin Bukatovic from comment #2)
> What is expected to happen to a stretch cluster, when one data zone loses
> connection to both arbiter and other data zone with mon failover disabled?

I have created a doc BZ to record the behaviour/workaround in this case. 
https://bugzilla.redhat.com/show_bug.cgi?id=1941918

Comment 10 Travis Nielsen 2021-03-30 17:57:13 UTC

(In reply to Martin Bukatovic from comment #2)
> What is expected to happen to a stretch cluster, when one data zone loses
> connection to both arbiter and other data zone with mon failover disabled?

When one data zone loses connection to both the other zones:
- The data zone that still has connection to the arbiter zone will be available for reads/writes
- The data zone without connection to the arbiter zone will remain down until it can connect to the other zones again

When mon failover is disabled, no new mons will be created in place of the failed mons. 

When mon failover is again enabled, new mon(s) could be started up in the failed zone as long as there are nodes available in that zone with connectivity.

Comment 11 Martin Bukatovic 2021-03-30 22:17:29 UTC

Tested on vsphere cluster of 6 worker/storage nodes and 3 master nodes with:

- OCP 4.7.0-0.nightly-2021-03-27-082615
- LSO 4.7.0-202103130041.p0
- OCS 4.7.0-324.ci (latest-stable-47)

During verification steps, I:

- created StorageCluster in arbiter mode via OCP Consol
- wrote 4 GB via fio job on cephfs PV, and another 4 GB on rbd volume
- selected worker node for draining, so that there is both ceph mon and osd pod
  running: `compute-0` (the 1st worker node)
- marked the node as unschedulable: `oc adm cordon compute-0`
- drained the node:
  `oc adm drain compute-0 --force --delete-local-data --ignore-daemonsets`
- let it run for more than 15 minutes
- and then finally, uncordon the node: `oc adm uncordon compute-0`

Observations:

- during the drain process, ceph mon-b was evicted from node/compute-0
  and since this moment ceph mon-b is out of quorum 
- after the drain, ceph mon-b is moved to node/compute-2, where it remains
  running (but out of quorum, as noted above)
- after the uncordon, mon-b is not moved out of node/compute-2, but remains
  there running out of quorum, unlike osd-3 which was down but now it's back
  running on it's original node/compute-0
- in the end, no pod is in CLBO state, and ceph cluster has 4 monitors in a
  quorum, so that it can remain operational

Detail of the ceph status from toolbox pod:

```
[root@compute-0 /]# ceph status
  cluster:
    id:     f3fb899a-e6d2-4eee-9984-7d94e1b554c5
    health: HEALTH_WARN
            3496 slow ops, oldest one blocked for 8786 sec, mon.b has slow ops
            1/5 mons down, quorum a,c,d,e
 
  services:
    mon: 5 daemons, quorum a,c,d,e (age 40m), out of quorum: b
    mgr: a(active, since 6h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 4 osds: 4 up (since 101m), 4 in (since 101m)
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   10 pools, 272 pgs
    objects: 2.37k objects, 8.1 GiB
    usage:   36 GiB used, 28 GiB / 64 GiB avail
    pgs:     272 active+clean
 
  io:
    client:   4.9 KiB/s rd, 4.2 KiB/s wr, 5 op/s rd, 3 op/s wr
```

This means that OCS cluster can now survive the usecase from the bug.

Besides this, I also tried to deploy a simple machineconfig on all worker
nodes, as during this process, MCO drains all nodes one by one, and I hit this
BZ during this use case as noted in comment
https://bugzilla.redhat.com/show_bug.cgi?id=1939007#c8

```
$ cat worker-example.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: worker-example
  labels:
    machineconfiguration.openshift.io/role: worker
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - path: /etc/example
        contents:
          source: data:text/plain;charset=utf-8;base64,SGVsbG8K
        mode: 0444
        user:
          name: root
        group:
          name: root
        overwrite: true
$ oc create -f worker-example.yaml
```

And again, cluster can survive that without loosing quorum or any pod reaching
CLBO status. Monitor quorum is `2/5 mons down, quorum b,d,e` though.

In this state, I was able to run another workload which stored 4Gb on the
cluster without any problem.

Based on this, I'm marking this bug as VERIFIED.

That said, while this workaround works, it also makes the cluster vulnerable
to further disruptions. Especially disruptions which a stretch cluster is
expected to handle would result in nonoperational cluster which is difficult to
recover. I assume that this is well known limitation to the workaround proposed
in comment 5.

Comment 12 Martin Bukatovic 2021-03-31 18:44:27 UTC

Since there was some discussion about what exactly should be fixed in this bug, and option 2 from comment 1 is not explained here in more detail, I'm asking Travis to check if the comment 11 describes expected behaviour of the fix here.

Comment 13 Martin Bukatovic 2021-03-31 18:54:26 UTC

I'm asking dev team to tell me which ceph version was present in OCS 4.7.0-324.ci image, so that I can answer the question about 4.2z1.

Comment 14 Martin Bukatovic 2021-03-31 18:55:43 UTC

(In reply to Martin Bukatovic from comment #13)
> I'm asking dev team to tell me which ceph version was present in OCS
> 4.7.0-324.ci image, so that I can answer the question about 4.2z1.

quay.io no longer provides an answer:

https://quay.io/repository///manifest/sha256:73a4413f49c7cb3ef288313eaed37951e04b9ba57a4ec8bac05004f1f4b97b25

reports not found.

Comment 15 Travis Nielsen 2021-03-31 20:47:02 UTC

The unexpected behavior from comment 11 is that the mon was moved to another node after the node drain. When running on LSO, the mon should have node affinity and never move nodes. The mon should wait for the node to come back up instead of rescheduling on a different node. When you have a repro please let me know to take a look.

Comment 16 Martin Bukatovic 2021-03-31 23:26:55 UTC

Based on Travis reply from comment 15 moving this back to assigned.

Comment 17 Mudit Agarwal 2021-04-01 05:43:55 UTC

Travis/Sebastien, we have mon failover support in Ceph 4.2z1 now (https://bugzilla.redhat.com/show_bug.cgi?id=1939766)

Shall we enable it back in OCS?

Comment 19 Travis Nielsen 2021-04-01 18:03:06 UTC

For 4.7.0 I would propose we leave mon failover disabled for arbiter. The changes to support the mon failover require another change in rook besides reverting the disabling, which will take more time to test. I don't see mon failover as critical to the feature while it's in tech preview. The important scenario is that stretch continues to work when a data zone goes down, in which case mon failover isn't even possible. 

@Mudit How about we move this to 4.7.z or else 4.8?

Comment 20 Mudit Agarwal 2021-04-05 10:57:03 UTC

My bad, changed the wrong bug by mistake. Reverting it to the original state.

Comment 21 Travis Nielsen 2021-04-05 19:20:15 UTC

After further discussions, let's go ahead and re-enable mon failover for stretch clusters.

Two changes are necessary to reenable mon failover in stretch clusters since the functionality is available from the latest RHCS 4.2z1 RC:
- Revert the commit that disabled mon failover in stretch clusters
- Implement a new mon parameter that sets the location of the mon in the stretch cluster

Comment 22 Travis Nielsen 2021-04-06 22:19:57 UTC

While testing the Rook changes to enable mon failover in a stretch cluster, I'm not able to get the mon to join quorum.

The mon log is showing:

debug 2021-04-06T22:11:06.037+0000 7f2d568fd700 10 mon.i@-1(probing) e14  ready to join, but i'm not in the monmap/my addr is blank/location is wrong, trying to join

All the mons and the failed-over mon are using a location argument like so:

--set-crush-location zone=us-east-2c

The original 5 mons start fine with that argument, but the new mon is not joining quorum.

Here is the full mon log for the mon attempting to join quorum:
https://gist.github.com/travisn/5a25603c13f8fcc8c5da638433333a1a#file-mon-h-log

Here is the full mon pod spec:
https://gist.github.com/travisn/5a25603c13f8fcc8c5da638433333a1a#gistcomment-3695810

@gfarnum Is the location incorrect on the mon, or why would it not be joining quorum?

Comment 23 Travis Nielsen 2021-04-06 22:33:44 UTC

Here are Rook changes to use the new --set-crush-location flag. https://github.com/rook/rook/pull/7535

Comment 24 Greg Farnum 2021-04-09 20:07:16 UTC

(In reply to Travis Nielsen from comment #22)
> While testing the Rook changes to enable mon failover in a stretch cluster,
> I'm not able to get the mon to join quorum.
> 
> The mon log is showing:
> 
> debug 2021-04-06T22:11:06.037+0000 7f2d568fd700 10 mon.i@-1(probing) e14 
> ready to join, but i'm not in the monmap/my addr is blank/location is wrong,
> trying to join

Hmm, that log message is expected initially, at which point the monitor send an MMonJoin with the updated data that should go into the MonMap and update it. This works in my tests but perhaps you're exercising a conditional I'm not?

Can you grab the log of the monitors which are in quorum?

> 
> All the mons and the failed-over mon are using a location argument like so:
> 
> --set-crush-location zone=us-east-2c

Yeah, that should be good.

Comment 25 Travis Nielsen 2021-04-09 21:17:11 UTC

Created attachment 1770800 [details]
Logs for the stretch mons when starting up mon.f to failover from mon.b

mon.b was failed, and mon.f was started up to replace it with location=us-east-2b. 

This looks suspicious from one of the other logs, but see attached for the full mon logs.

debug 2021-04-09T21:08:44.391+0000 7f575afc2700 10 mon.a@0(peon).monmap v9 preprocess_join f at [v2:172.30.1.69:3300/0,v1:172.30.1.69:6789/0]
debug 2021-04-09T21:08:44.391+0000 7f575afc2700 20 is_capable service=mon command= write exec addr v2:172.30.1.69:3300/0 on cap allow *
debug 2021-04-09T21:08:44.391+0000 7f575afc2700 20  allow so far , doing grant allow *
debug 2021-04-09T21:08:44.391+0000 7f575afc2700 20  allow all
debug 2021-04-09T21:08:44.391+0000 7f575afc2700 10 mon.a@0(peon) e9 forward_request won't forward (non-local) mon request mon_join(f [v2:172.30.1.69:3300/0,v1:172.30.1.69:6789/0] {zone=us-east-2b}) v3
debug 2021-04-09T21:08:44.411+0000 7f575afc2700 20 mon.a@0(peon) e9 _ms_dispatch existing session 0x55ef64be4000 for mon.?

Comment 26 Martin Bukatovic 2021-04-12 19:34:57 UTC

(In reply to Travis Nielsen from comment #15)
> The unexpected behavior from comment 11 is that the mon was moved to another
> node after the node drain. When running on LSO, the mon should have node
> affinity and never move nodes. The mon should wait for the node to come back
> up instead of rescheduling on a different node. When you have a repro please
> let me know to take a look.

I tried to repeat the procedure from comment 11, but I failed to do so:

- I selected a worker node for draining, so that there is both ceph mon
  and osd pod running: `compute-0` (the 1st worker node).
- Marked the node as unschedulable: `oc adm cordon compute-0`.
- Drained the node via
  `oc adm drain compute-0 --force --delete-local-data --ignore-daemonsets`

But the drain got stuck on:

```
evicting pod openshift-storage/rook-ceph-mon-a-889497f48-6zrqx
error when evicting pods/"rook-ceph-mon-a-889497f48-6zrqx" -n "openshift-storage" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
```

Comment 27 Martin Bukatovic 2021-04-13 15:10:35 UTC

Based on Mudit's request, a BZ to track reenablement of arbiter mon failover was opened: BZ 1949165

Comment 28 Travis Nielsen 2021-04-13 15:26:43 UTC

Moving back to ON_QA per separate thread and tracking the re-enabling with the new BZ.

 (In reply to Martin Bukatovic from comment #26)
> (In reply to Travis Nielsen from comment #15)
> > The unexpected behavior from comment 11 is that the mon was moved to another
> > node after the node drain. When running on LSO, the mon should have node
> > affinity and never move nodes. The mon should wait for the node to come back
> > up instead of rescheduling on a different node. When you have a repro please
> > let me know to take a look.
> 
> I tried to repeat the procedure from comment 11, but I failed to do so:
> 
> - I selected a worker node for draining, so that there is both ceph mon
>   and osd pod running: `compute-0` (the 1st worker node).
> - Marked the node as unschedulable: `oc adm cordon compute-0`.
> - Drained the node via
>   `oc adm drain compute-0 --force --delete-local-data --ignore-daemonsets`
> 
> But the drain got stuck on:
> 
> ```
> evicting pod openshift-storage/rook-ceph-mon-a-889497f48-6zrqx
> error when evicting pods/"rook-ceph-mon-a-889497f48-6zrqx" -n
> "openshift-storage" (will retry after 5s): Cannot evict pod as it would
> violate the pod's disruption budget.
> ```

It's expected that a node cannot be drained with a mon if the mons are not fully in quorum. Was there a mon already out of quorum when you tried to drain the node?

Comment 29 Greg Farnum 2021-04-13 15:43:08 UTC

(In reply to Travis Nielsen from comment #25)
> Created attachment 1770800 [details]
> Logs for the stretch mons when starting up mon.f to failover from mon.b
> 
> mon.b was failed, and mon.f was started up to replace it with
> location=us-east-2b. 
> 
> This looks suspicious from one of the other logs, but see attached for the
> full mon logs.
> 
> debug 2021-04-09T21:08:44.391+0000 7f575afc2700 10 mon.a@0(peon).monmap v9
> preprocess_join f at [v2:172.30.1.69:3300/0,v1:172.30.1.69:6789/0]
> debug 2021-04-09T21:08:44.391+0000 7f575afc2700 20 is_capable service=mon
> command= write exec addr v2:172.30.1.69:3300/0 on cap allow *
> debug 2021-04-09T21:08:44.391+0000 7f575afc2700 20  allow so far , doing
> grant allow *
> debug 2021-04-09T21:08:44.391+0000 7f575afc2700 20  allow all
> debug 2021-04-09T21:08:44.391+0000 7f575afc2700 10 mon.a@0(peon) e9
> forward_request won't forward (non-local) mon request mon_join(f
> [v2:172.30.1.69:3300/0,v1:172.30.1.69:6789/0] {zone=us-east-2b}) v3
> debug 2021-04-09T21:08:44.411+0000 7f575afc2700 20 mon.a@0(peon) e9
> _ms_dispatch existing session 0x55ef64be4000 for mon.?

Yep, that's definitely the issue — good eyes. Since this isn't a 4.7 blocker I think it'll have to wait, though, as just getting through tests takes some time and we have other priorities.

Comment 31 Martin Bukatovic 2021-04-15 08:54:33 UTC

(In reply to Travis Nielsen from comment #28)
> It's expected that a node cannot be drained with a mon if the mons are not
> fully in quorum. Was there a mon already out of quorum when you tried to
> drain the node?

There was no problem with mon quorum if I recall right.

Comment 33 Martin Bukatovic 2021-04-15 17:22:31 UTC

Ok, so now I'm little confused.

Could someone from the dev team confirm what is expected to happen when one performs verification steps (as I did in comment 11):

- create StorageCluster in arbiter mode via OCP Console
- write some data on cephfs PV, and another 4 GB on rbd volume
- select a worker node for draining, so that there is both ceph mon and osd pod
  running: `compute-0` (the 1st worker node)
- mark the node as unschedulable: `oc adm cordon compute-0`
- drained the node:
  `oc adm drain compute-0 --force --delete-local-data --ignore-daemonsets`
- let it run for more than 15 minutes
- and then finally, uncordon the node: `oc adm uncordon compute-0`

I would expect that most observation from comment 11 still holds (with exception of the issue with which node the mon is deployed again), and that I should still be able to deploy a simple machineconfig on worker mcp.

If that is not the case, I will move the BZ right into assigned again, and also try to perform the validation steps to gather more data points.

Comment 34 Travis Nielsen 2021-04-16 18:07:43 UTC

@Martin Correct, the same validation steps should apply. The change with this BZ is that no new mon (such as mon.f) will be created automatically after a mon is down for too long. While the node is down, you would see the mon and osd stay down from that node since they are not portable in stretch clusters built on LSO. Then when the node is brought back up, you should see the mon and osd pods running again on that node.

Comment 35 Martin Bukatovic 2021-04-20 17:13:38 UTC

I'm trying to verify the problem with:

- OCP: 4.7.0-0.nightly-2021-04-15-110345
- LSO 4.7.0-202104030128.p0
- OCS 4.7.0-353.ci latest-stable-47 (4.7.0-rc5)

I followed steps from comment 11, but when I started drain of a node:

```
$ oc adm drain compute-2 --force --delete-local-data --ignore-daemonsets
```

The process got stuck (in the same was as reported in comment 26):

```
evicting pod openshift-storage/rook-ceph-mon-c-756c8487d6-l7ppf
error when evicting pods/"rook-ceph-mon-c-756c8487d6-l7ppf" -n "openshift-storage" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod openshift-storage/rook-ceph-mon-c-756c8487d6-l7ppf
error when evicting pods/"rook-ceph-mon-c-756c8487d6-l7ppf" -n "openshift-storage" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
```

Looking at PodDisruptionBudget I see that it prevents the draining:

```
$ oc get PodDisruptionBudget -n openshift-storage                               
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     24h
rook-ceph-mon-pdb                                 N/A             0                 0                     24h
rook-ceph-osd-zone-data-b                         N/A             0                 0                     28m
$ oc get PodDisruptionBudget/rook-ceph-mon-pdb -n openshift-storage -o yaml | tail -11
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app: rook-ceph-mon
status:
  currentHealthy: 5
  desiredHealthy: 5
  disruptionsAllowed: 0
  expectedPods: 5
  observedGeneration: 8
```

I don't want to comment whether this particular configuration of PodDisruptionBudget is ok, but it prevents me to verify the bug, because it's not possible to perform the reproducer, and check the expected behaviour. Moreover performing machine config update would be also blocked by this.

>>> ASSIGNED

Comment 36 Travis Nielsen 2021-04-20 18:33:00 UTC

@Martin Since https://bugzilla.redhat.com/show_bug.cgi?id=1935065 was fixed, nodes are disallowed from being drained if a mon is down. In this test, are any mons down down before you tried to drain the node? Or is everything perfectly healthy before you try to drain the node? If everything was perfectly healthy and the mon PDB disallows a drain, then there would be an issue on 1935065 to follow up on, rather than this one.

To simulate a mon going down to see if mon failover is triggered, you should be able to set the mon deployment replicas to 0 so the mon pod will stop. Did you try that?

Comment 37 Martin Bukatovic 2021-04-22 20:12:26 UTC

(In reply to Travis Nielsen from comment #36)
> @Martin Since https://bugzilla.redhat.com/show_bug.cgi?id=1935065 was fixed,
> nodes are disallowed from being drained if a mon is down. In this test, are
> any mons down down before you tried to drain the node? Or is everything
> perfectly healthy before you try to drain the node? If everything was
> perfectly healthy and the mon PDB disallows a drain, then there would be an
> issue on 1935065 to follow up on, rather than this one.

Looking in my log, I see that I did:

- install storage cluster CR via OCP Console
- stored 1 GB on cephfs based PV, 1GB on rbd PV
- run all net splits we have in a test plan over night
- reproduced BZ 1946592

That said, the cluster was healthy when I started with retesting use case
from this BZ.

I will retest on a fresh cluster.

I agree that there is something else going on, which may require a separate bug.

> To simulate a mon going down to see if mon failover is triggered, you should
> be able to set the mon deployment replicas to 0 so the mon pod will stop.
> Did you try that?

No, I haven't. I could retry.

Comment 38 Martin Bukatovic 2021-04-22 20:28:02 UTC

Here is an observation when I retested this a fresh cluster, when only reproducer steps from this bug were performed.

Retested with
=============

OCP 4.7.0-0.nightly-2021-04-21-093400
LSO 4.7.0-202104090228.p0
OCS 4.7.0-353.ci

Observations before the drain
=============================

I selected node compute-2 to be drained.

```
$ oc get pods -n openshift-storage -o wide | grep compute-2 | cut -d' ' -f1 | egrep "(mon|osd)"
rook-ceph-mon-b-5b95bc7c75-zjslp
rook-ceph-osd-3-778966b55f-45rx9
rook-ceph-osd-prepare-ocs-deviceset-arbiter-1-data-08d9jk-hg8qj
$ oc get PodDisruptionBudget -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     78m
rook-ceph-mon-pdb                                 N/A             1                 1                     75m
rook-ceph-osd                                     N/A             1                 1                     75m
$ oc get PodDisruptionBudget/rook-ceph-mon-pdb -n openshift-storage -o yaml | tail -11
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: rook-ceph-mon
status:
  currentHealthy: 5
  desiredHealthy: 4
  disruptionsAllowed: 1
  expectedPods: 5
  observedGeneration: 1
```

After the drain
===============

I see that affected mon-b was respawned on another node, while keeping quorum at 5:

```
$ oc get pods -n openshift-storage -o wide | grep mon
rook-ceph-mon-a-69445d5c76-nzn45                                  2/2     Running     0          135m   10.128.4.194   compute-0         <none>           <none>
rook-ceph-mon-b-5b95bc7c75-4b8wg                                  2/2     Running     0          52m    10.131.0.112   compute-1         <none>           <none>
rook-ceph-mon-c-65f9cf6744-k82j5                                  2/2     Running     0          134m   10.130.2.13    compute-5         <none>           <none>
rook-ceph-mon-d-78b4db8c7c-t4kbv                                  2/2     Running     0          134m   10.129.2.234   compute-3         <none>           <none>
rook-ceph-mon-e-75f646fdb8-rkxwr                                  2/2     Running     0          134m   10.128.0.52    control-plane-2   <none>           <none>
$  oc get PodDisruptionBudget -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     138m
rook-ceph-mon-pdb                                 N/A             1                 1                     136m
rook-ceph-osd-zone-data-b                         N/A             0                 0                     55m
```

After the uncordon
==================

```
$ oc get pods -n openshift-storage -o wide | grep mon
rook-ceph-mon-a-69445d5c76-nzn45                                  2/2     Running     0          142m    10.128.4.194   compute-0         <none>           <none>
rook-ceph-mon-b-5b95bc7c75-4b8wg                                  2/2     Running     0          59m     10.131.0.112   compute-1         <none>           <none>
rook-ceph-mon-c-65f9cf6744-k82j5                                  2/2     Running     0          141m    10.130.2.13    compute-5         <none>           <none>
rook-ceph-mon-d-78b4db8c7c-t4kbv                                  2/2     Running     0          141m    10.129.2.234   compute-3         <none>           <none>
rook-ceph-mon-e-75f646fdb8-rkxwr                                  2/2     Running     0          141m    10.128.0.52    control-plane-2   <none>           <none>
$  oc get PodDisruptionBudget -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     142m
rook-ceph-mon-pdb                                 N/A             1                 1                     140m
rook-ceph-osd                                     N/A             1                 1                     8s
```

Mon status
==========

According to `ceph_mon_quorum_status`, no mon was out of quorum.

This looks like the mon failover works.

Conclusion
==========

The observation conflicts with description from comment 34.

Comment 40 Martin Bukatovic 2021-04-22 21:10:09 UTC

(In reply to Martin Bukatovic from comment #37)
> > To simulate a mon going down to see if mon failover is triggered, you should
> > be able to set the mon deployment replicas to 0 so the mon pod will stop.
> > Did you try that?
> 
> No, I haven't. I could retry.

When I do that, after retrying the reproducer as noted in comment 38, I see that
scaling down given mon doesn't create new mon deployment which would replace the
scaled down mon.

```
$ oc scale --replicas 0 deployment/rook-ceph-mon-b -n openshift-storage
deployment.apps/rook-ceph-mon-b scaled
$ oc get pods -n openshift-storage| grep mon
rook-ceph-mon-a-69445d5c76-nzn45                                  2/2     Running     0          3h53m
rook-ceph-mon-c-65f9cf6744-k82j5                                  2/2     Running     0          3h53m
rook-ceph-mon-d-78b4db8c7c-t4kbv                                  2/2     Running     0          3h53m
rook-ceph-mon-e-75f646fdb8-rkxwr                                  2/2     Running     0          3h52m
```

Comment 41 Travis Nielsen 2021-04-23 00:00:40 UTC

(In reply to Martin Bukatovic from comment #38)
> 
> Mon status
> ==========
> 
> According to `ceph_mon_quorum_status`, no mon was out of quorum.
> 
> This looks like the mon failover works.
> 
> Conclusion
> ==========
> 
> The observation conflicts with description from comment 34.

Your observations sound consistent with comment 34. To clarify:
- An existing mon may still be moved to another node if another node in the same zone is available during the node drain. This is not "mon failover", but is just the same mon moving to another node.
- "mon failover" in comment 34 means that a mon with a new name (such as mon-f) will be created, and the down mon (e.g. mon-b) would be destroyed by the operator after replaced.

Moving back to on_qa so you can mark as verified if you agree.

Comment 42 Martin Bukatovic 2021-04-23 10:29:09 UTC

Thanks for clarification.

Marking as VERIFIED based on previous evidence (comment 38) and clarification from Travis.

Issues I run into (comment 37) will be rechecked and if necessary, separate BZ will be reported, since it is not related to this bug.

Comment 44 errata-xmlrpc 2021-05-19 09:20:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.