Bug 1946837 - OCS 4.7 Arbiter Mode Cluster becomes stuck when entire zone is shutdown
Summary: OCS 4.7 Arbiter Mode Cluster becomes stuck when entire zone is shutdown
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ceph
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: OCS 4.7.0
Assignee: Greg Farnum
QA Contact: Pratik Surve
URL:
Whiteboard:
Depends On: 1949166
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-07 01:01 UTC by Jean-Charles Lopez
Modified: 2021-06-01 08:50 UTC (History)
12 users (show)

Fixed In Version: ceph-14.2.11-147.el8cp, ceph-14.2.11-147.el7cp, 4.7.0-351.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1949166 (view as bug list)
Environment:
Last Closed: 2021-05-19 09:21:26 UTC
Embargoed:


Attachments (Terms of Use)

Description Jean-Charles Lopez 2021-04-07 01:01:35 UTC
Description of problem (please be detailed as possible and provide log
snippests):
OCS cluster deployed in Metro DR Streched Mode (Arbiter mode). When shutting down an entire zone to test servcie continuity all the PGs in the cluster become inactive.

4 worker node cluster
1 OSD per node
LSO

Version of all relevant components (if applicable):
Deployed OCS 4.7 RC3
VMware environment (BLR lab)
Client Version: 4.7.4
Server Version: 4.7.2
Kubernetes Version: v1.20.0+5fbfd19

$ oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.7.0-339.ci   OpenShift Container Storage   4.7.0-339.ci              Succeeded

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5

Can this issue reproducible?
Deploy cluster has mentioned above

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCS 4.7 RC3 using 4 worker nodes, 1 OSD per node
2. Shutdown entire zone (2 OSDs and 1 MON will go down at least
3. Check status of the OCS cluster


Actual results:
The cluster becomes titally not available.

Expected results:
Cluster keeps operating using the 2 surviving copies of each PG.

Additional info:
First it is to mention that the OSDs that go down are never marked out of the cluster

# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME                             STATUS REWEIGHT PRI-AFF
 -1       0.39075 root default
 -4       0.19537     zone datacenter1
 -3       0.09769         host perf1-mz8bt-worker-d2hdm
  2   ssd 0.09769             osd.2                         up  1.00000 1.00000
-13       0.09769         host perf1-mz8bt-worker-k68rv
  3   ssd 0.09769             osd.3                         up  1.00000 1.00000
 -8       0.19537     zone datacenter2
 -7       0.09769         host perf1-mz8bt-worker-ntkp8
  0   ssd 0.09769             osd.0                       down        0 1.00000
-11       0.09769         host perf1-mz8bt-worker-qpwsr
  1   ssd 0.09769             osd.1                       down        0 1.00000


# ceph -s
  cluster:
    id:     fed692ff-aec8-4955-98c9-cba480032c9e
    health: HEALTH_WARN
            1 filesystem is degraded
            insufficient standby MDS daemons available
            1 MDSs report slow metadata IOs
            Reduced data availability: 272 pgs inactive
            Degraded data redundancy: 1016/2032 objects degraded (50.000%), 158 pgs degraded, 103 pgs undersized
            2/5 mons down, quorum c,d,e

  services:
    mon: 5 daemons, quorum c,d,e (age 2h), out of quorum: a, b
    mgr: a(active, since 2h)
    mds: ocs-storagecluster-cephfilesystem:1/1 {0=ocs-storagecluster-cephfilesystem-b=up:replay}
    osd: 4 osds: 2 up (since 3m), 2 in (since 19m)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.b)

  task status:

  data:
    pools:   10 pools, 272 pgs
    objects: 508 objects, 598 MiB
    usage:   3.0 GiB used, 197 GiB / 200 GiB avail
    pgs:     100.000% pgs not active
             1016/2032 objects degraded (50.000%)
             158 undersized+degraded+peered
             114 undersized+peered


CRUSH rule generated for all the pools
rule default_stretch_cluster_rule {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type zone
        step chooseleaf firstn 2 type host
        step emit
}


Tried to change the CRUSH tule in many ways, to mark the OSDs out but nothing allowed the cluster to come back online.

Here is an excerpt of the PG map
1.78          1                  0        2         0       0  3944448           0          0 3042     3042 undersized+degraded+peered 2021-04-07 00:43:03.961281  68'6914   291:7352 [3,2]          3  [3,2]              3     42'402 2021-04-06 00:53:16.390866             0'0 2021-04-06 00:51:52.548637             0
1.79          1                  0        2         0       0  4194304           0          0   20       20 undersized+degraded+peered 2021-04-07 00:43:03.964491    42'20    290:420 [2,3]          2  [2,3]              2      36'19 2021-04-06 00:54:29.998342             0'0 2021-04-06 00:51:52.548637             0
1.7a          1                  0        2         0       0     8192           0          0  185      185 undersized+degraded+peered 2021-04-07 00:43:03.964235   42'185    290:600 [2,3]          2  [2,3]              2     36'181 2021-04-06 00:53:30.768098             0'0 2021-04-06 00:51:52.548637             0
1.7b          2                  0        4         0       0  4227072           0          0  493      493 undersized+degraded+peered 2021-04-07 00:43:03.965468   42'493    290:806 [2,3]          2  [2,3]              2     36'144 2021-04-06 00:53:00.571403          36'144 2021-04-06 00:53:00.571403             0
1.7c          1                  0        2         0       0  4128768           0          0 1198     1198 undersized+degraded+peered 2021-04-07 00:43:03.965629  68'1198   290:1631 [2,3]          2  [2,3]              2     42'168 2021-04-06 00:53:42.393343             0'0 2021-04-06 00:51:52.548637             0
1.7d          0                  0        0         0       0        0           0          0   18       18          undersized+peered 2021-04-07 00:43:03.966430    42'18    290:325 [2,3]          2  [2,3]              2      36'15 2021-04-06 00:53:33.763293             0'0 2021-04-06 00:51:52.548637             0
1.7e          2                  0        4         0       0  6422528           0          0  382      382 undersized+degraded+peered 2021-04-07 00:43:03.965340   42'382    290:780 [2,3]          2  [2,3]              2      36'22 2021-04-06 00:54:09.280864           36'22 2021-04-06 00:54:09.280864             0
1.7f          0                  0        0         0       0   833536           0          0   38       38          undersized+peered 2021-04-07 00:43:03.968886    42'38    291:358 [3,2]          3  [3,2]              3      36'35 2021-04-06 00:53:55.973299             0'0 2021-04-06 00:51:52.548637             0

10   1 0   2 0 0      1024 0 0  2831  2831
9   16 0  32 0 0      4907 0 0    34    34
8    0 0   0 0 0         0 0 0     0     0
7   22 0  44 0 0         0 0 0 24196 24196
2    8 0  16 0 0         0 0 0 22017 22017
1  214 0 428 0 0 627013701 0 0 72004 72004
3   22 0  44 0 0      2808 0 0    43    43
4    0 0   0 0 0         0 0 0     0     0
5   12 0  24 0 0      2855 0 0  6901  6901
6  213 0 426 0 0      3691 0 0 24252 24252

sum 508 0 1016 0 0 627028986 0 0 152278 152278
OSD_STAT USED     AVAIL   USED_RAW TOTAL   HB_PEERS PG_SUM PRIMARY_PG_SUM
3         504 MiB  99 GiB  1.5 GiB 100 GiB      [2]    272            125
2         504 MiB  99 GiB  1.5 GiB 100 GiB      [3]    272            147
sum      1009 MiB 197 GiB  3.0 GiB 200 GiB

Comment 4 Mudit Agarwal 2021-04-07 06:57:41 UTC
Moving it to rook for initial analysis.

Comment 5 Martin Bukatovic 2021-04-07 18:30:02 UTC
Note that there are few bugs open about OCS cluster unable to recover itself during zone disruption, see:

- BZ 1942680 idle storage cluster in arbiter mode doesn't recover from short network split between one data zone and remaining arbiter and data zone (ON QA)
- BZ 1943596 When Performed zone(zone=a) Power off and Power On, 3 mon pod(zone=b,c) goes in CLBO after node Power off and 2 Osd(zone=a) goes in CLBO after node Power on (MODIFIED)
- BZ 1939617 Mons cannot be failed over in stretch mode (POST)

QE team haven't yet reached a point when we can start testing with workloads, as we need to make sure that simple disruption scenarios without workload are solid first.

Comment 6 Martin Bukatovic 2021-04-07 18:40:01 UTC
The fact that we see 100%PGs down is definitelly a problem. I haven't seen this so far. We will definitely need to retest it when the bugs noted in comment 5 are verified.

Comment 7 Travis Nielsen 2021-04-07 18:42:46 UTC
Martin The fixes for 1942680 and 1943596 were both included in RC3. I was testing stopping and starting nodes and at least verified the latter BZ was not an issue anymore.

Since rc3 has those blockers fixed, can you not go ahead and test with a workload now?

Comment 9 Travis Nielsen 2021-04-07 20:10:06 UTC
On the original issue, Annette and I spent some time on the cluster where this issue with 100% unknown PGs was hit.

To get the cluster healthy again, we did the following:
- Start the nodes that were down
- Even after starting all the nodes, the PGs remained unknown and the OSDs were not showing as "up". 
- The "reweight" was showing as 0.0 for the two OSDs that had been down
- We set the reweight on those two osds from the toolbox with a command such as: ceph osd reweight osd.0 1.0
- Then the PGs immediately became healthy again, the cluster health was restored, and the wordpress app was again responsive

Next, we attempted to repro by bringing down the zone again a couple times. 
- We could not repro the PGs being 100% unknown. Each time we brought a zone down, the PGs were only 50% unavailable as expected, with the cluster staying responsive

@gfarnum Does the issue with 100% unknown PGs sound related to the zone going down or other stretch cluster? In any case, if we could not repro it would not be a blocker for 4.7.

We did observe two issues to consider separately from this BZ:
1. If the mgr was in the zone that was taken down, the ceph status was inaccurate and the ceph osd commands were not responsive for a couple minutes until rook moved the mgr to the other zone. This happened automatically with rook/k8s. There is already an improvement for this in 4.8 where we will have two mgrs, one in each data zone. So this is expected. We may still want to consider backporting this to 4.7.
2. If the application pod was on a killed node, the app could not move to another zone because of the famous "multi-attach error" from the rbd volume. This is the same issue that affects any OCS cluster when a node is unresponsive.

Comment 10 Annette Clewett 2021-04-07 22:21:59 UTC
@Travis 
Just one correction. 

To get the cluster healthy again, we did the following:
- Start the nodes that were down
- Even after starting all the nodes, the PGs remained unknown and the OSDs were showing as "up".

Before manually setting the reweight to 1 (from 0) all four OSDs were "up" and "in". Even so the cluster did not recover until the reweight was manually changed from 0 to 1 for osd.0 and osd.1.

Comment 15 Mudit Agarwal 2021-04-13 10:04:52 UTC
Greg/Scott, do we have a corresponding RHCS BZ for this or should I create one. 
Providing dev ack as this is expected to be fixed in 4.2z1

Comment 16 Greg Farnum 2021-04-13 15:16:09 UTC
Created https://bugzilla.redhat.com/show_bug.cgi?id=1949166 in RHCS

Comment 17 Greg Farnum 2021-04-13 19:47:22 UTC
Pushed fix to ceph-4.2-rhel-patches, so it should be available for OCS testing soon.

Comment 18 Greg Farnum 2021-04-14 03:12:51 UTC
> Fixed In Version: ceph-14.2.11-147.el8cp, ceph-14.2.11-147.el7cp

Comment 25 errata-xmlrpc 2021-05-19 09:21:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Comment 26 Martin Bukatovic 2021-05-31 17:43:49 UTC
Dropping needinfo on me from 2021-04-07.


Note You need to log in before you can comment on or make changes to this bug.