Bug 1950419
| Summary: | [RFE] Change PDB Controler behavior for single OSD failure caused by failed drive | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Jean-Charles Lopez <jelopez> |
| Component: | rook | Assignee: | Santosh Pillai <sapillai> |
| Status: | CLOSED ERRATA | QA Contact: | krishnaram Karthick <kramdoss> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | aclewett, etamir, jelopez, kramdoss, madam, muagarwa, nberry, ocs-bugs, olakra, ratamir, sapillai, tnielsen |
| Target Milestone: | --- | Keywords: | FutureFeature |
| Target Release: | OCS 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: |
.Prevent adding `no-out` flag on the failure domain if an OSD is down due to reasons other than node drain
When an OSD is down due to disk failure, a `no-out` flag is added on the failure domain. This prevents the OSD from being marked out using standard ceph mon_osd_down_out_interval. With this update, when an OSD is down due to reasons other than node drain, say, disk failure, in such a situation, if the pgs are unhealthy then rook will create a blocking PodDisruptionBudget on other failure domains to prevent further node drains on them. `noout` flag won't be set on node in this case. If the OSD is down but all the pgs are `active+clean`, the cluster will be treated as fully healthy. The default PodDisruptionBudget (with maxUnavailable=1) will be added back and the blocking ones will be deleted.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-08-03 18:15:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Just needs to be backported downstream This is in the latest resync to release-4.8 LGTM, thanks Test 1:
========
3 zones, 1 node per zone and 2 OSDs per node; Fail OSD1 from node 1
Behavior seen:
==========
no blocking pdbs created
pgs are active+clean; no-out flag wasn't set (as expected)
output:
========
oc get csv
NAME DISPLAY VERSION REPLACES PHASE
ocs-operator.v4.8.0-450.ci OpenShift Container Storage 4.8.0-450.ci Succeeded
oc rsh rook-ceph-tools-64d88c9b9f-jbmts ceph -s
cluster:
id: 1c7cc447-0457-4b99-908a-2eb8446b640b
health: HEALTH_WARN
1 daemons have recently crashed
services:
mon: 3 daemons, quorum a,b,c (age 2h)
mgr: a(active, since 2h)
mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
osd: 6 osds: 5 up (since 66m), 5 in (since 56m)
data:
pools: 3 pools, 288 pgs
objects: 556 objects, 1.4 GiB
usage: 8.4 GiB used, 2.5 TiB / 2.5 TiB avail
pgs: 288 active+clean
oc rsh rook-ceph-tools-64d88c9b9f-jbmts ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 3.00000 root default
-5 3.00000 region us-east-2
-4 1.00000 zone us-east-2a
-3 0.50000 host ocs-deviceset-1-data-0wh4nf
0 ssd 0.50000 osd.0 down 0 1.00000
-17 0.50000 host ocs-deviceset-2-data-1prdmr
3 ssd 0.50000 osd.3 up 1.00000 1.00000
-10 1.00000 zone us-east-2b
-9 0.50000 host ocs-deviceset-0-data-06khrt
2 ssd 0.50000 osd.2 up 1.00000 1.00000
-21 0.50000 host ocs-deviceset-0-data-1cbb5n
5 ssd 0.50000 osd.5 up 1.00000 1.00000
-14 1.00000 zone us-east-2c
-19 0.50000 host ocs-deviceset-1-data-1p9hsj
4 ssd 0.50000 osd.4 up 1.00000 1.00000
-13 0.50000 host ocs-deviceset-2-data-07npwz
1 ssd 0.50000 osd.1 up 1.00000 1.00000
oc get pdb -n openshift-storage
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 3h
rook-ceph-mon-pdb N/A 1 1 178m
rook-ceph-osd N/A 1 0 76m
Test 2:
========
3 zones, 1 node per zone and 2 OSDs per node; Fail OSD1 and OSD2 from node 1
i.e., fail all OSDs on a zone
Behavior seen:
==============
blocking pdbs created
pgs are not active+clean
drain on other node was blocked
output:
========
oc rsh rook-ceph-tools-64d88c9b9f-jbmts ceph -s
cluster:
id: 1c7cc447-0457-4b99-908a-2eb8446b640b
health: HEALTH_WARN
1 osds down
2 hosts (2 osds) down
1 zone (2 osds) down
Degraded data redundancy: 900/2700 objects degraded (33.333%), 144 pgs degraded, 288 pgs undersized
2 daemons have recently crashed
services:
mon: 3 daemons, quorum a,b,c (age 5h)
mgr: a(active, since 5h)
mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
osd: 6 osds: 4 up (since 46m), 5 in (since 3h)
data:
pools: 3 pools, 288 pgs
objects: 900 objects, 2.7 GiB
usage: 12 GiB used, 2.5 TiB / 2.5 TiB avail
pgs: 900/2700 objects degraded (33.333%)
144 active+undersized+degraded
144 active+undersized
io:
client: 1.2 KiB/s rd, 109 KiB/s wr, 2 op/s rd, 2 op/s wr
oc get pdb -n openshift-storage
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 5h3m
rook-ceph-mon-pdb N/A 1 1 5h2m
rook-ceph-osd-zone-us-east-2b N/A 0 0 46m
rook-ceph-osd-zone-us-east-2c N/A 0 0 46m
$ oc adm drain ip-10-0-183-99.us-east-2.compute.internal
node/ip-10-0-183-99.us-east-2.compute.internal cordoned
error: unable to drain node "ip-10-0-183-99.us-east-2.compute.internal", aborting command...
There are pending nodes to be drained:
ip-10-0-183-99.us-east-2.compute.internal
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-rhvnv, openshift-cluster-node-tuning-operator/tuned-8mhdc, openshift-dns/dns-default-zrbvr, openshift-dns/node-resolver-dp489, openshift-image-registry/node-ca-rptwl, openshift-ingress-canary/ingress-canary-jnzxz, openshift-machine-config-operator/machine-config-daemon-7zktc, openshift-monitoring/node-exporter-hvlg5, openshift-multus/multus-additional-cni-plugins-5bvcg, openshift-multus/multus-ngk2d, openshift-multus/network-metrics-daemon-nwmp4, openshift-network-diagnostics/network-check-target-7nqcn, openshift-sdn/sdn-fk4f9, openshift-storage/csi-cephfsplugin-76zkm, openshift-storage/csi-rbdplugin-vkgjl
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-image-registry/image-registry-54c4758b4d-lst42, openshift-monitoring/prometheus-adapter-dcc8d9658-pk9ph, openshift-monitoring/prometheus-k8s-0, openshift-storage/csi-cephfsplugin-provisioner-78d7667cb8-dxmfd, openshift-storage/rook-ceph-mgr-a-85bbdf4f54-tdpt4, openshift-storage/rook-ceph-osd-2-599cbff5c-pp6dq, openshift-storage/rook-ceph-osd-5-77f99cb98-772v4
cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): openshift-marketplace/ocs-catalogsource-pwps9
snippet of rook-operator logs:
2021-07-13 15:01:50.860692 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2021-07-13 15:01:50.860778 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2021-07-13 15:01:51.195883 I | clusterdisruption-controller: osd is down in failure domain "us-east-2a" and pgs are not active+clean. pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:144} {StateName:active+undersized Count:144}]"
Test 3:
Failed one osd from zone 1 node 1; no pdb created
failed one more osd from zone 1 node 2; blocking pdbs created
Waited for pgs to be active clean; blocking pdbs were removed
no no-out flag were set
oc rsh rook-ceph-tools-64d88c9b9f-bh4db ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 6.00000 root default
-5 6.00000 region us-east-2
-4 2.00000 zone us-east-2a
-3 0.50000 host ocs-deviceset-0-data-09mvqv
0 ssd 0.50000 osd.0 down 1.00000 1.00000
-19 0.50000 host ocs-deviceset-1-data-1q5d87
4 ssd 0.50000 osd.4 up 1.00000 1.00000
-31 0.50000 host ocs-deviceset-1-data-3mvqbh
10 ssd 0.50000 osd.10 up 1.00000 1.00000
-27 0.50000 host ocs-deviceset-2-data-25q882
8 ssd 0.50000 osd.8 down 1.00000 1.00000
-10 2.00000 zone us-east-2b
-23 0.50000 host ocs-deviceset-0-data-2c2r4s
6 ssd 0.50000 osd.6 up 1.00000 1.00000
-29 0.50000 host ocs-deviceset-0-data-3pcq88
9 ssd 0.50000 osd.9 up 1.00000 1.00000
-9 0.50000 host ocs-deviceset-2-data-0j95zf
1 ssd 0.50000 osd.1 up 1.00000 1.00000
-17 0.50000 host ocs-deviceset-2-data-16cbbl
5 ssd 0.50000 osd.5 up 1.00000 1.00000
-14 2.00000 zone us-east-2c
-21 0.50000 host ocs-deviceset-0-data-1r4dqw
3 ssd 0.50000 osd.3 up 1.00000 1.00000
-13 0.50000 host ocs-deviceset-1-data-0zcbfz
2 ssd 0.50000 osd.2 up 1.00000 1.00000
-25 0.50000 host ocs-deviceset-1-data-26qd87
7 ssd 0.50000 osd.7 up 1.00000 1.00000
-33 0.50000 host ocs-deviceset-2-data-3cr9f9
11 ssd 0.50000 osd.11 up 1.00000 1.00000
oc get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 6h49m
rook-ceph-mon-pdb N/A 1 1 6h47m
rook-ceph-osd-zone-us-east-2b N/A 0 0 2m32s
rook-ceph-osd-zone-us-east-2c N/A 0 0 2m32s
drain on a node in zone 2 failed:
oc adm drain ip-10-0-180-242.us-east-2.compute.internal
node/ip-10-0-180-242.us-east-2.compute.internal cordoned
error: unable to drain node "ip-10-0-180-242.us-east-2.compute.internal", aborting command...
There are pending nodes to be drained:
ip-10-0-180-242.us-east-2.compute.internal
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-bpmpd, openshift-cluster-node-tuning-operator/tuned-25trr, openshift-dns/dns-default-zbj5v, openshift-dns/node-resolver-psdkz, openshift-image-registry/node-ca-cmq2b, openshift-ingress-canary/ingress-canary-tgb5j, openshift-machine-config-operator/machine-config-daemon-6sfhb, openshift-monitoring/node-exporter-6f5jp, openshift-multus/multus-additional-cni-plugins-rbrnj, openshift-multus/multus-zqgqx, openshift-multus/network-metrics-daemon-zvc9z, openshift-network-diagnostics/network-check-target-sxjc7, openshift-sdn/sdn-t759j, openshift-storage/csi-cephfsplugin-np746, openshift-storage/csi-rbdplugin-mqwtw
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-storage/rook-ceph-osd-6-8649977c5-d26k4, openshift-storage/rook-ceph-osd-9-65b8f95c55-6smrz
[krishnaramkarthickramdoss@localhost ~]$ oc adm uncordon ip-10-0-180-242.us-east-2.compute.internal
oc rsh rook-ceph-tools-64d88c9b9f-bh4db ceph -s
cluster:
id: 3f490417-f3ff-40c8-88ee-df83300233a3
health: HEALTH_WARN
2 osds down
2 hosts (2 osds) down
Degraded data redundancy: 2209/12840 objects degraded (17.204%), 76 pgs degraded, 151 pgs undersized
services:
mon: 3 daemons, quorum a,b,c (age 6h)
mgr: a(active, since 6h)
mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
osd: 12 osds: 10 up (since 2m), 12 in (since 2h)
data:
pools: 3 pools, 288 pgs
objects: 4.28k objects, 16 GiB
usage: 57 GiB used, 5.9 TiB / 6 TiB avail
pgs: 2209/12840 objects degraded (17.204%)
137 active+clean
76 active+undersized+degraded
75 active+undersized
io:
client: 3.2 KiB/s rd, 154 KiB/s wr, 2 op/s rd, 1 op/s wr
(In reply to krishnaram Karthick from comment #21) > Test 1: > ======== > 3 zones, 1 node per zone and 2 OSDs per node; Fail OSD1 from node 1 > > Behavior seen: > ========== > no blocking pdbs created So `Test 1` and `Test 3` are suggesting that blocking pdbs are not created after failing the first OSD. Do PGs never go in degraded state after failing the first disk? When a disk (OSD) is failed and ceph takes few seconds to recognise that. It won't be instantaneous. So PGs will be in degraded state and at that time blocking pdbs should be created. Yes, `no-out` flag won't be set. And once the PGs are active+clean, the blocking pdbs should be deleted. > pgs are active+clean; no-out flag wasn't set (as expected) (In reply to Santosh Pillai from comment #25) > (In reply to krishnaram Karthick from comment #21) > > Test 1: > > ======== > > 3 zones, 1 node per zone and 2 OSDs per node; Fail OSD1 from node 1 > > > > Behavior seen: > > ========== > > no blocking pdbs created > > So `Test 1` and `Test 3` are suggesting that blocking pdbs are not created > after failing the first OSD. Do PGs never go in degraded state after failing > the first disk? I don't see PDBs getting created after the first OSD failure and PGs do go into the degraded state. Do we see this behavior because there were 2 OSDs in all of the tests run? > > > When a disk (OSD) is failed and ceph takes few seconds to recognise that. It > won't be instantaneous. So PGs will be in degraded state and at that time > blocking pdbs should be created. Yes, `no-out` flag won't be set. And once > the PGs are active+clean, the blocking pdbs should be deleted. > > > > pgs are active+clean; no-out flag wasn't set (as expected) (In reply to krishnaram Karthick from comment #26) > (In reply to Santosh Pillai from comment #25) > > (In reply to krishnaram Karthick from comment #21) > > > Test 1: > > > ======== > > > 3 zones, 1 node per zone and 2 OSDs per node; Fail OSD1 from node 1 > > > > > > Behavior seen: > > > ========== > > > no blocking pdbs created > > > > So `Test 1` and `Test 3` are suggesting that blocking pdbs are not created > > after failing the first OSD. Do PGs never go in degraded state after failing > > the first disk? > > I don't see PDBs getting created after the first OSD failure and PGs do go > into the degraded state. Can you provide the rook operator logs of this state (one OSD is down and PGs are degraded and blocking PDBs are not getting created). And maybe the cluster setup in this state as well, if possible. > Do we see this behavior because there were 2 OSDs in all of the tests run? I've only tested with single OSD on each node in a three node cluster. > > > > When a disk (OSD) is failed and ceph takes few seconds to recognise that. It > > won't be instantaneous. So PGs will be in degraded state and at that time > > blocking pdbs should be created. Yes, `no-out` flag won't be set. And once > > the PGs are active+clean, the blocking pdbs should be deleted. > > > > > > > pgs are active+clean; no-out flag wasn't set (as expected) I tried another test with 3 zones; 1 node per zone and 1 osd per node.
Failed one OSD; pgs were degraded; no blocking pbds were seen
Attaching operator logs as requested & providing the cluster details to Santosh in private chat.
oc rsh rook-ceph-tools-bd9b467... localhost.localdomain: Wed Jul 21 13:06:17 2021
cluster:
id: 44805ffa-79f4-4349-9913-9b2e0dc52bfa
health: HEALTH_WARN
1 osds down
1 host (1 osds) down
1 zone (1 osds) down
Degraded data redundancy: 4166/12498 objects degraded (33.333%), 47 pgs degraded,
96 pgs undersized
1 daemons have recently crashed
services:d
mon: 3 daemons, quorum a,b,c (age 95m)
mgr: a(active, since 94m)e
mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active}
1 up:standby-replayp
osd: 3 osds: 2 up (since 13m), 3 in (since 94m)
data:l
pools: 3 pools, 96 pgs1
objects: 4.17k objects, 15 GiB/
oc get pdb localhost.localdomain: Wed Jul 21 13:09:09 2021
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DI
SRUPTIONS AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1
99m
rook-ceph-mon-pdb N/A 1 1
97m
rook-ceph-osd N/A 1 0
98m
Raised a new bug for comment#28 - https://bugzilla.redhat.com/show_bug.cgi?id=1984396 Moving the RFE to verified based on the tests I ran above. A new bug has been raised for the issue discussed in comment#28 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003 |
Description of problem (please be detailed as possible and provide log snippests): If a single drive fails rook will generate tenporary PDB to mark the node as draining. Doing so it also sets the noout flag on the rack containing the node where the OSD runs rpeventing the OSD from being marked out using standard ceph mon_osd_down_out_interval value in favor of a timer based condition. Version of all relevant components (if applicable): OCS 4.7 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes as this would impact the behavior of the cluster if another node was failing in a different availability zone Is there any workaround available to the best of your knowledge? Yes. Remove the OSD from the cluster once it is marked out and the cluster as rebalanced. over 30 minutes to reach this condition on an almost empty cluster. If not doing this the temporary PDB will remain for ever. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 5 Can this issue reproducible? Yes Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: No Steps to Reproduce: 1. Deploy a cluster 2. Pull one physical drive or destroy one drive if in cloud to force OSD to crash 3. Observe PDBs Actual results: OSD 9 is the failed OSD as we pulled the drive out of the chassis. HEALTH_WARN 1 osds down; Degraded data redundancy: 1339/34677 objects degraded (3.861%), 135 pgs degraded, 140 pgs undersized; 1 daemons have recently crashed OSD_DOWN 1 osds down osd.9 (root=default,rack=rack0,host=e4n1-fbond) is down PG_DEGRADED Degraded data redundancy: 1339/34677 objects degraded (3.861%), 135 pgs degraded, 140 pgs undersized After nearly 40 minutes iirc here is the status of the cluster cluster: id: 81670748-ad37-4e3c-b0d2-b1bb47eb76f9 health: HEALTH_WARN 1 daemons have recently crashed services: mon: 3 daemons, quorum a,b,c (age 26h) mgr: a(active, since 7d) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay osd: 24 osds: 23 up (since 46m), 23 in (since 5m) rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 1136 pgs objects: 11.57k objects, 22 GiB usage: 86 GiB used, 83 TiB / 84 TiB avail pgs: 1136 active+clean io: client: 1.6 KiB/s rd, 286 KiB/s wr, 2 op/s rd, 20 op/s wr Rook operator log 2021-04-15 20:21:07.173323 I | op-k8sutil: deployment "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a" did not change, nothing to update 2021-04-15 20:21:07.173348 I | cephclient: getting or creating ceph auth key "mds.ocs-storagecluster-cephfilesystem-b" 2021-04-15 20:21:07.360541 I | ceph-object-controller: Multisite for object-store: realm=ocs-storagecluster-cephobjectstore, zonegroup=ocs-storagecluster-cephobjectstore, zone=ocs-storagecluster-cephobjectstore 2021-04-15 20:21:07.360576 I | ceph-object-controller: multisite configuration for object-store ocs-storagecluster-cephobjectstore is complete 2021-04-15 20:21:07.360591 I | ceph-object-controller: creating object store "ocs-storagecluster-cephobjectstore" in namespace "openshift-storage" 2021-04-15 20:21:07.360621 I | cephclient: getting or creating ceph auth key "client.rgw.ocs.storagecluster.cephobjectstore.a" 2021-04-15 20:21:07.519568 I | op-mds: deployment for mds rook-ceph-mds-ocs-storagecluster-cephfilesystem-b already exists. updating if needed 2021-04-15 20:21:07.528440 I | op-k8sutil: deployment "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b" did not change, nothing to update 2021-04-15 20:21:07.677862 I | ceph-object-controller: object store "ocs-storagecluster-cephobjectstore" deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a" started 2021-04-15 20:21:07.695213 I | ceph-object-controller: object store "ocs-storagecluster-cephobjectstore" deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a" already exists. updating if needed 2021-04-15 20:21:07.701912 I | op-k8sutil: deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a" did not change, nothing to update 2021-04-15 20:21:07.706015 I | ceph-object-controller: config map for object pool ocs-storagecluster-cephobjectstore already exists, not overwriting 2021-04-15 20:21:07.706030 I | cephclient: getting or creating ceph auth key "client.rgw.ocs.storagecluster.cephobjectstore.b" 2021-04-15 20:21:08.018744 I | ceph-object-controller: object store "ocs-storagecluster-cephobjectstore" deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b" started 2021-04-15 20:21:08.037712 I | ceph-object-controller: object store "ocs-storagecluster-cephobjectstore" deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b" already exists. updating if needed 2021-04-15 20:21:08.082079 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean" 2021-04-15 20:21:08.125761 I | op-k8sutil: deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b" did not change, nothing to update 2021-04-15 20:21:08.129802 I | ceph-object-controller: config map for object pool ocs-storagecluster-cephobjectstore already exists, not overwriting 2021-04-15 20:21:09.002703 I | ceph-object-controller: created object store "ocs-storagecluster-cephobjectstore" in namespace "openshift-storage" 2021-04-15 20:21:09.002732 I | ceph-object-controller: starting rgw healthcheck 2021-04-15 20:21:09.329544 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean" 2021-04-15 20:21:09.909850 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean" 2021-04-15 20:21:11.542374 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean" 2021-04-15 20:21:24.926847 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean" 2021-04-15 20:21:25.509704 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean" ^C [root@e1n1 ~]# oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 7d17h rook-ceph-mon-pdb 2 N/A 1 7d17h rook-ceph-osd-rack-rack1 N/A 0 0 82m rook-ceph-osd-rack-rack2 N/A 0 0 82m Removed OSD 9 from the cluster Rook tries to redeploy but prepare stays in Pending # oc get pods -n openshift-storage -o wide | grep rook-ceph-osd rook-ceph-osd-0-7596f7cf56-5sp9s 1/1 Running 0 27h 9.254.6.17 e6n1.fbond <none> <none> rook-ceph-osd-1-7494684958-hfdst 1/1 Running 0 27h 9.254.6.13 e6n1.fbond <none> <none> rook-ceph-osd-10-7c8b6b9475-nmnzx 1/1 Running 0 7d17h 9.254.10.15 e5n1.fbond <none> <none> rook-ceph-osd-11-77dbfc8f65-zqq82 1/1 Running 0 7d17h 9.254.8.17 e4n1.fbond <none> <none> rook-ceph-osd-12-5576cc5d6d-2t482 1/1 Running 0 6d4h 9.254.10.94 e5n1.fbond <none> <none> rook-ceph-osd-13-74bcddf45c-m7qs8 1/1 Running 0 7d17h 9.254.8.14 e4n1.fbond <none> <none> rook-ceph-osd-14-5c574f99dd-vf4bs 1/1 Running 0 2d3h 9.254.10.149 e5n1.fbond <none> <none> rook-ceph-osd-15-74595cb9d8-tkm7j 1/1 Running 0 7d17h 9.254.8.9 e4n1.fbond <none> <none> rook-ceph-osd-16-7c74b48b95-82t44 1/1 Running 0 7d17h 9.254.8.13 e4n1.fbond <none> <none> rook-ceph-osd-17-7d6b994766-wmwsj 1/1 Running 0 7d17h 9.254.10.7 e5n1.fbond <none> <none> rook-ceph-osd-18-84d8f79d9-7pjcq 1/1 Running 0 7d17h 9.254.8.12 e4n1.fbond <none> <none> rook-ceph-osd-19-6954d494-4pjgq 1/1 Running 0 6d1h 9.254.10.128 e5n1.fbond <none> <none> rook-ceph-osd-2-75f67cf787-khxcw 1/1 Running 0 27h 9.254.6.2 e6n1.fbond <none> <none> rook-ceph-osd-20-784595c559-95ck4 1/1 Running 0 7d17h 9.254.10.14 e5n1.fbond <none> <none> rook-ceph-osd-21-6d86fc9cbd-hhqcj 1/1 Running 0 7d17h 9.254.10.9 e5n1.fbond <none> <none> rook-ceph-osd-22-68f59c745b-xsqjn 1/1 Running 0 7d17h 9.254.8.16 e4n1.fbond <none> <none> rook-ceph-osd-23-668958c697-68m4j 1/1 Running 0 7d 9.254.10.73 e5n1.fbond <none> <none> rook-ceph-osd-3-5bf4dc7f9-kgc26 1/1 Running 0 27h 9.254.6.16 e6n1.fbond <none> <none> rook-ceph-osd-4-5dd99fcddd-l5trd 1/1 Running 0 27h 9.254.6.4 e6n1.fbond <none> <none> rook-ceph-osd-5-6d5c7bbcbb-xkcrb 1/1 Running 0 27h 9.254.6.15 e6n1.fbond <none> <none> rook-ceph-osd-6-64b749d5bd-cmh7l 1/1 Running 0 27h 9.254.6.3 e6n1.fbond <none> <none> rook-ceph-osd-7-bdfff75c9-7pxbt 1/1 Running 0 27h 9.254.6.14 e6n1.fbond <none> <none> rook-ceph-osd-8-5959d87cb8-rt7mk 1/1 Running 0 7d17h 9.254.8.11 e4n1.fbond <none> <none> rook-ceph-osd-prepare-ocs-deviceset-0-data-12-xb2wp-z5lq4 0/1 Completed 0 6d4h 9.254.10.92 e5n1.fbond <none> <none> rook-ceph-osd-prepare-ocs-deviceset-0-data-14-vl89q-9pbbt 0/1 Completed 0 2d3h 9.254.10.147 e5n1.fbond <none> <none> rook-ceph-osd-prepare-ocs-deviceset-0-data-19-fcck4-t5qvz 0/1 Completed 0 6d1h 9.254.10.126 e5n1.fbond <none> <none> rook-ceph-osd-prepare-ocs-deviceset-0-data-23-xv6c4-m8cgc 0/1 Completed 0 7d 9.254.10.71 e5n1.fbond <none> <none> rook-ceph-osd-prepare-ocs-deviceset-0-data-8-kjj2r-m6tsx 0/1 Pending 0 17m <none> <none> <none> <none> However PDBs are now back to normal # oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 7d18h rook-ceph-mon-pdb 2 N/A 1 7d18h rook-ceph-osd N/A 1 1 2m38s Another OSD or node can now fail anywhere the date is fully protected. Expected results: Additional info: Annette and myself had a call with Travis suggesting that the PDF controller gets added a new code path to handle differently a node failure and a single OSD failure. When single OSD failure is detected we would simply rely on the ceph auto out interval so that the cluster can rebalance in a timely manner.