Description of problem (please be detailed as possible and provide log snippests): OCS cluster deployed in Metro DR Streched Mode (Arbiter mode). When shutting down an entire zone to test servcie continuity all the PGs in the cluster become inactive. 4 worker node cluster 1 OSD per node LSO Version of all relevant components (if applicable): Deployed OCS 4.7 RC3 VMware environment (BLR lab) Client Version: 4.7.4 Server Version: 4.7.2 Kubernetes Version: v1.20.0+5fbfd19 $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.0-339.ci OpenShift Container Storage 4.7.0-339.ci Succeeded Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 5 Can this issue reproducible? Deploy cluster has mentioned above Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy OCS 4.7 RC3 using 4 worker nodes, 1 OSD per node 2. Shutdown entire zone (2 OSDs and 1 MON will go down at least 3. Check status of the OCS cluster Actual results: The cluster becomes titally not available. Expected results: Cluster keeps operating using the 2 surviving copies of each PG. Additional info: First it is to mention that the OSDs that go down are never marked out of the cluster # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.39075 root default -4 0.19537 zone datacenter1 -3 0.09769 host perf1-mz8bt-worker-d2hdm 2 ssd 0.09769 osd.2 up 1.00000 1.00000 -13 0.09769 host perf1-mz8bt-worker-k68rv 3 ssd 0.09769 osd.3 up 1.00000 1.00000 -8 0.19537 zone datacenter2 -7 0.09769 host perf1-mz8bt-worker-ntkp8 0 ssd 0.09769 osd.0 down 0 1.00000 -11 0.09769 host perf1-mz8bt-worker-qpwsr 1 ssd 0.09769 osd.1 down 0 1.00000 # ceph -s cluster: id: fed692ff-aec8-4955-98c9-cba480032c9e health: HEALTH_WARN 1 filesystem is degraded insufficient standby MDS daemons available 1 MDSs report slow metadata IOs Reduced data availability: 272 pgs inactive Degraded data redundancy: 1016/2032 objects degraded (50.000%), 158 pgs degraded, 103 pgs undersized 2/5 mons down, quorum c,d,e services: mon: 5 daemons, quorum c,d,e (age 2h), out of quorum: a, b mgr: a(active, since 2h) mds: ocs-storagecluster-cephfilesystem:1/1 {0=ocs-storagecluster-cephfilesystem-b=up:replay} osd: 4 osds: 2 up (since 3m), 2 in (since 19m) rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.b) task status: data: pools: 10 pools, 272 pgs objects: 508 objects, 598 MiB usage: 3.0 GiB used, 197 GiB / 200 GiB avail pgs: 100.000% pgs not active 1016/2032 objects degraded (50.000%) 158 undersized+degraded+peered 114 undersized+peered CRUSH rule generated for all the pools rule default_stretch_cluster_rule { id 1 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type zone step chooseleaf firstn 2 type host step emit } Tried to change the CRUSH tule in many ways, to mark the OSDs out but nothing allowed the cluster to come back online. Here is an excerpt of the PG map 1.78 1 0 2 0 0 3944448 0 0 3042 3042 undersized+degraded+peered 2021-04-07 00:43:03.961281 68'6914 291:7352 [3,2] 3 [3,2] 3 42'402 2021-04-06 00:53:16.390866 0'0 2021-04-06 00:51:52.548637 0 1.79 1 0 2 0 0 4194304 0 0 20 20 undersized+degraded+peered 2021-04-07 00:43:03.964491 42'20 290:420 [2,3] 2 [2,3] 2 36'19 2021-04-06 00:54:29.998342 0'0 2021-04-06 00:51:52.548637 0 1.7a 1 0 2 0 0 8192 0 0 185 185 undersized+degraded+peered 2021-04-07 00:43:03.964235 42'185 290:600 [2,3] 2 [2,3] 2 36'181 2021-04-06 00:53:30.768098 0'0 2021-04-06 00:51:52.548637 0 1.7b 2 0 4 0 0 4227072 0 0 493 493 undersized+degraded+peered 2021-04-07 00:43:03.965468 42'493 290:806 [2,3] 2 [2,3] 2 36'144 2021-04-06 00:53:00.571403 36'144 2021-04-06 00:53:00.571403 0 1.7c 1 0 2 0 0 4128768 0 0 1198 1198 undersized+degraded+peered 2021-04-07 00:43:03.965629 68'1198 290:1631 [2,3] 2 [2,3] 2 42'168 2021-04-06 00:53:42.393343 0'0 2021-04-06 00:51:52.548637 0 1.7d 0 0 0 0 0 0 0 0 18 18 undersized+peered 2021-04-07 00:43:03.966430 42'18 290:325 [2,3] 2 [2,3] 2 36'15 2021-04-06 00:53:33.763293 0'0 2021-04-06 00:51:52.548637 0 1.7e 2 0 4 0 0 6422528 0 0 382 382 undersized+degraded+peered 2021-04-07 00:43:03.965340 42'382 290:780 [2,3] 2 [2,3] 2 36'22 2021-04-06 00:54:09.280864 36'22 2021-04-06 00:54:09.280864 0 1.7f 0 0 0 0 0 833536 0 0 38 38 undersized+peered 2021-04-07 00:43:03.968886 42'38 291:358 [3,2] 3 [3,2] 3 36'35 2021-04-06 00:53:55.973299 0'0 2021-04-06 00:51:52.548637 0 10 1 0 2 0 0 1024 0 0 2831 2831 9 16 0 32 0 0 4907 0 0 34 34 8 0 0 0 0 0 0 0 0 0 0 7 22 0 44 0 0 0 0 0 24196 24196 2 8 0 16 0 0 0 0 0 22017 22017 1 214 0 428 0 0 627013701 0 0 72004 72004 3 22 0 44 0 0 2808 0 0 43 43 4 0 0 0 0 0 0 0 0 0 0 5 12 0 24 0 0 2855 0 0 6901 6901 6 213 0 426 0 0 3691 0 0 24252 24252 sum 508 0 1016 0 0 627028986 0 0 152278 152278 OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 3 504 MiB 99 GiB 1.5 GiB 100 GiB [2] 272 125 2 504 MiB 99 GiB 1.5 GiB 100 GiB [3] 272 147 sum 1009 MiB 197 GiB 3.0 GiB 200 GiB
Moving it to rook for initial analysis.
Note that there are few bugs open about OCS cluster unable to recover itself during zone disruption, see: - BZ 1942680 idle storage cluster in arbiter mode doesn't recover from short network split between one data zone and remaining arbiter and data zone (ON QA) - BZ 1943596 When Performed zone(zone=a) Power off and Power On, 3 mon pod(zone=b,c) goes in CLBO after node Power off and 2 Osd(zone=a) goes in CLBO after node Power on (MODIFIED) - BZ 1939617 Mons cannot be failed over in stretch mode (POST) QE team haven't yet reached a point when we can start testing with workloads, as we need to make sure that simple disruption scenarios without workload are solid first.
The fact that we see 100%PGs down is definitelly a problem. I haven't seen this so far. We will definitely need to retest it when the bugs noted in comment 5 are verified.
Martin The fixes for 1942680 and 1943596 were both included in RC3. I was testing stopping and starting nodes and at least verified the latter BZ was not an issue anymore. Since rc3 has those blockers fixed, can you not go ahead and test with a workload now?
On the original issue, Annette and I spent some time on the cluster where this issue with 100% unknown PGs was hit. To get the cluster healthy again, we did the following: - Start the nodes that were down - Even after starting all the nodes, the PGs remained unknown and the OSDs were not showing as "up". - The "reweight" was showing as 0.0 for the two OSDs that had been down - We set the reweight on those two osds from the toolbox with a command such as: ceph osd reweight osd.0 1.0 - Then the PGs immediately became healthy again, the cluster health was restored, and the wordpress app was again responsive Next, we attempted to repro by bringing down the zone again a couple times. - We could not repro the PGs being 100% unknown. Each time we brought a zone down, the PGs were only 50% unavailable as expected, with the cluster staying responsive @gfarnum Does the issue with 100% unknown PGs sound related to the zone going down or other stretch cluster? In any case, if we could not repro it would not be a blocker for 4.7. We did observe two issues to consider separately from this BZ: 1. If the mgr was in the zone that was taken down, the ceph status was inaccurate and the ceph osd commands were not responsive for a couple minutes until rook moved the mgr to the other zone. This happened automatically with rook/k8s. There is already an improvement for this in 4.8 where we will have two mgrs, one in each data zone. So this is expected. We may still want to consider backporting this to 4.7. 2. If the application pod was on a killed node, the app could not move to another zone because of the famous "multi-attach error" from the rbd volume. This is the same issue that affects any OCS cluster when a node is unresponsive.
@Travis Just one correction. To get the cluster healthy again, we did the following: - Start the nodes that were down - Even after starting all the nodes, the PGs remained unknown and the OSDs were showing as "up". Before manually setting the reweight to 1 (from 0) all four OSDs were "up" and "in". Even so the cluster did not recover until the reweight was manually changed from 0 to 1 for osd.0 and osd.1.
Greg/Scott, do we have a corresponding RHCS BZ for this or should I create one. Providing dev ack as this is expected to be fixed in 4.2z1
Created https://bugzilla.redhat.com/show_bug.cgi?id=1949166 in RHCS
Pushed fix to ceph-4.2-rhel-patches, so it should be available for OCS testing soon.
> Fixed In Version: ceph-14.2.11-147.el8cp, ceph-14.2.11-147.el7cp
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041
Dropping needinfo on me from 2021-04-07.