Bug 1950419

Summary: [RFE] Change PDB Controler behavior for single OSD failure caused by failed drive
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Jean-Charles Lopez <jelopez>
Component: rookAssignee: Santosh Pillai <sapillai>
Status: CLOSED ERRATA QA Contact: krishnaram Karthick <kramdoss>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.7CC: aclewett, etamir, jelopez, kramdoss, madam, muagarwa, nberry, ocs-bugs, olakra, ratamir, sapillai, tnielsen
Target Milestone: ---Keywords: FutureFeature
Target Release: OCS 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
.Prevent adding `no-out` flag on the failure domain if an OSD is down due to reasons other than node drain When an OSD is down due to disk failure, a `no-out` flag is added on the failure domain. This prevents the OSD from being marked out using standard ceph mon_osd_down_out_interval. With this update, when an OSD is down due to reasons other than node drain, say, disk failure, in such a situation, if the pgs are unhealthy then rook will create a blocking PodDisruptionBudget on other failure domains to prevent further node drains on them. `noout` flag won't be set on node in this case. If the OSD is down but all the pgs are `active+clean`, the cluster will be treated as fully healthy. The default PodDisruptionBudget (with maxUnavailable=1) will be added back and the blocking ones will be deleted.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-03 18:15:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jean-Charles Lopez 2021-04-16 15:41:57 UTC
Description of problem (please be detailed as possible and provide log
snippests):
If a single drive fails rook will generate tenporary PDB to mark the node as draining. Doing so it also sets the noout flag on the rack containing the node where the OSD runs rpeventing the OSD from being marked out using standard ceph mon_osd_down_out_interval value in favor of a timer based condition.

Version of all relevant components (if applicable):
OCS 4.7

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes as this would impact the behavior of the cluster if another node was failing in a different availability zone

Is there any workaround available to the best of your knowledge?
Yes. Remove the OSD from the cluster once it is marked out and the cluster as rebalanced. over 30 minutes to reach this condition on an almost empty cluster. If not doing this the temporary PDB will remain for ever.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
1. Deploy a cluster
2. Pull one physical drive or destroy one drive if in cloud to force OSD to crash
3. Observe PDBs


Actual results:
OSD 9 is the failed OSD as we pulled the drive out of the chassis.

HEALTH_WARN 1 osds down; Degraded data redundancy: 1339/34677 objects degraded (3.861%), 135 pgs degraded, 140 pgs undersized; 1 daemons have recently crashed
OSD_DOWN 1 osds down
    osd.9 (root=default,rack=rack0,host=e4n1-fbond) is down
PG_DEGRADED Degraded data redundancy: 1339/34677 objects degraded (3.861%), 135 pgs degraded, 140 pgs undersized


After nearly 40 minutes iirc here is the status of the cluster

  cluster:
    id:     81670748-ad37-4e3c-b0d2-b1bb47eb76f9
    health: HEALTH_WARN
            1 daemons have recently crashed
  services:
    mon: 3 daemons, quorum a,b,c (age 26h)
    mgr: a(active, since 7d)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 24 osds: 23 up (since 46m), 23 in (since 5m)
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
  data:
    pools:   10 pools, 1136 pgs
    objects: 11.57k objects, 22 GiB
    usage:   86 GiB used, 83 TiB / 84 TiB avail
    pgs:     1136 active+clean
  io:
    client:   1.6 KiB/s rd, 286 KiB/s wr, 2 op/s rd, 20 op/s wr

Rook operator log

2021-04-15 20:21:07.173323 I | op-k8sutil: deployment "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a" did not change, nothing to update
2021-04-15 20:21:07.173348 I | cephclient: getting or creating ceph auth key "mds.ocs-storagecluster-cephfilesystem-b"
2021-04-15 20:21:07.360541 I | ceph-object-controller: Multisite for object-store: realm=ocs-storagecluster-cephobjectstore, zonegroup=ocs-storagecluster-cephobjectstore, zone=ocs-storagecluster-cephobjectstore
2021-04-15 20:21:07.360576 I | ceph-object-controller: multisite configuration for object-store ocs-storagecluster-cephobjectstore is complete
2021-04-15 20:21:07.360591 I | ceph-object-controller: creating object store "ocs-storagecluster-cephobjectstore" in namespace "openshift-storage"
2021-04-15 20:21:07.360621 I | cephclient: getting or creating ceph auth key "client.rgw.ocs.storagecluster.cephobjectstore.a"
2021-04-15 20:21:07.519568 I | op-mds: deployment for mds rook-ceph-mds-ocs-storagecluster-cephfilesystem-b already exists. updating if needed
2021-04-15 20:21:07.528440 I | op-k8sutil: deployment "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b" did not change, nothing to update
2021-04-15 20:21:07.677862 I | ceph-object-controller: object store "ocs-storagecluster-cephobjectstore" deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a" started
2021-04-15 20:21:07.695213 I | ceph-object-controller: object store "ocs-storagecluster-cephobjectstore" deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a" already exists. updating if needed
2021-04-15 20:21:07.701912 I | op-k8sutil: deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a" did not change, nothing to update
2021-04-15 20:21:07.706015 I | ceph-object-controller: config map for object pool ocs-storagecluster-cephobjectstore already exists, not overwriting
2021-04-15 20:21:07.706030 I | cephclient: getting or creating ceph auth key "client.rgw.ocs.storagecluster.cephobjectstore.b"
2021-04-15 20:21:08.018744 I | ceph-object-controller: object store "ocs-storagecluster-cephobjectstore" deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b" started
2021-04-15 20:21:08.037712 I | ceph-object-controller: object store "ocs-storagecluster-cephobjectstore" deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b" already exists. updating if needed
2021-04-15 20:21:08.082079 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean"
2021-04-15 20:21:08.125761 I | op-k8sutil: deployment "rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b" did not change, nothing to update
2021-04-15 20:21:08.129802 I | ceph-object-controller: config map for object pool ocs-storagecluster-cephobjectstore already exists, not overwriting
2021-04-15 20:21:09.002703 I | ceph-object-controller: created object store "ocs-storagecluster-cephobjectstore" in namespace "openshift-storage"
2021-04-15 20:21:09.002732 I | ceph-object-controller: starting rgw healthcheck
2021-04-15 20:21:09.329544 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean"
2021-04-15 20:21:09.909850 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean"
2021-04-15 20:21:11.542374 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean"
2021-04-15 20:21:24.926847 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean"
2021-04-15 20:21:25.509704 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. currently draining: "rack0". pg health: "all PGs in cluster are clean"
^C
[root@e1n1 ~]# oc get pdb
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     7d17h
rook-ceph-mon-pdb                                 2               N/A               1                     7d17h
rook-ceph-osd-rack-rack1                          N/A             0                 0                     82m
rook-ceph-osd-rack-rack2                          N/A             0                 0                     82m

Removed OSD 9 from the cluster

Rook tries to redeploy but prepare stays in Pending

# oc get pods -n openshift-storage -o wide | grep rook-ceph-osd
rook-ceph-osd-0-7596f7cf56-5sp9s                                  1/1     Running     0          27h     9.254.6.17     e6n1.fbond   <none>           <none>
rook-ceph-osd-1-7494684958-hfdst                                  1/1     Running     0          27h     9.254.6.13     e6n1.fbond   <none>           <none>
rook-ceph-osd-10-7c8b6b9475-nmnzx                                 1/1     Running     0          7d17h   9.254.10.15    e5n1.fbond   <none>           <none>
rook-ceph-osd-11-77dbfc8f65-zqq82                                 1/1     Running     0          7d17h   9.254.8.17     e4n1.fbond   <none>           <none>
rook-ceph-osd-12-5576cc5d6d-2t482                                 1/1     Running     0          6d4h    9.254.10.94    e5n1.fbond   <none>           <none>
rook-ceph-osd-13-74bcddf45c-m7qs8                                 1/1     Running     0          7d17h   9.254.8.14     e4n1.fbond   <none>           <none>
rook-ceph-osd-14-5c574f99dd-vf4bs                                 1/1     Running     0          2d3h    9.254.10.149   e5n1.fbond   <none>           <none>
rook-ceph-osd-15-74595cb9d8-tkm7j                                 1/1     Running     0          7d17h   9.254.8.9      e4n1.fbond   <none>           <none>
rook-ceph-osd-16-7c74b48b95-82t44                                 1/1     Running     0          7d17h   9.254.8.13     e4n1.fbond   <none>           <none>
rook-ceph-osd-17-7d6b994766-wmwsj                                 1/1     Running     0          7d17h   9.254.10.7     e5n1.fbond   <none>           <none>
rook-ceph-osd-18-84d8f79d9-7pjcq                                  1/1     Running     0          7d17h   9.254.8.12     e4n1.fbond   <none>           <none>
rook-ceph-osd-19-6954d494-4pjgq                                   1/1     Running     0          6d1h    9.254.10.128   e5n1.fbond   <none>           <none>
rook-ceph-osd-2-75f67cf787-khxcw                                  1/1     Running     0          27h     9.254.6.2      e6n1.fbond   <none>           <none>
rook-ceph-osd-20-784595c559-95ck4                                 1/1     Running     0          7d17h   9.254.10.14    e5n1.fbond   <none>           <none>
rook-ceph-osd-21-6d86fc9cbd-hhqcj                                 1/1     Running     0          7d17h   9.254.10.9     e5n1.fbond   <none>           <none>
rook-ceph-osd-22-68f59c745b-xsqjn                                 1/1     Running     0          7d17h   9.254.8.16     e4n1.fbond   <none>           <none>
rook-ceph-osd-23-668958c697-68m4j                                 1/1     Running     0          7d      9.254.10.73    e5n1.fbond   <none>           <none>
rook-ceph-osd-3-5bf4dc7f9-kgc26                                   1/1     Running     0          27h     9.254.6.16     e6n1.fbond   <none>           <none>
rook-ceph-osd-4-5dd99fcddd-l5trd                                  1/1     Running     0          27h     9.254.6.4      e6n1.fbond   <none>           <none>
rook-ceph-osd-5-6d5c7bbcbb-xkcrb                                  1/1     Running     0          27h     9.254.6.15     e6n1.fbond   <none>           <none>
rook-ceph-osd-6-64b749d5bd-cmh7l                                  1/1     Running     0          27h     9.254.6.3      e6n1.fbond   <none>           <none>
rook-ceph-osd-7-bdfff75c9-7pxbt                                   1/1     Running     0          27h     9.254.6.14     e6n1.fbond   <none>           <none>
rook-ceph-osd-8-5959d87cb8-rt7mk                                  1/1     Running     0          7d17h   9.254.8.11     e4n1.fbond   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-0-data-12-xb2wp-z5lq4         0/1     Completed   0          6d4h    9.254.10.92    e5n1.fbond   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-0-data-14-vl89q-9pbbt         0/1     Completed   0          2d3h    9.254.10.147   e5n1.fbond   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-0-data-19-fcck4-t5qvz         0/1     Completed   0          6d1h    9.254.10.126   e5n1.fbond   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-0-data-23-xv6c4-m8cgc         0/1     Completed   0          7d      9.254.10.71    e5n1.fbond   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-0-data-8-kjj2r-m6tsx          0/1     Pending     0          17m     <none>         <none>       <none>           <none>

However PDBs are now back to normal
# oc get pdb
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     7d18h
rook-ceph-mon-pdb                                 2               N/A               1                     7d18h
rook-ceph-osd                                     N/A             1                 1                     2m38s

Another OSD or node can now fail anywhere the date is fully protected.

Expected results:


Additional info:

Annette and myself had a call with Travis suggesting that the PDF controller gets added a new code path to handle differently a node failure and a single OSD failure. When single OSD failure is detected we would simply rely on the ceph auto out interval so that the cluster can rebalance in a timely manner.

Comment 7 Travis Nielsen 2021-05-11 14:44:15 UTC
Just needs to be backported downstream

Comment 8 Travis Nielsen 2021-05-17 15:43:06 UTC
This is in the latest resync to release-4.8

Comment 20 Mudit Agarwal 2021-07-12 06:00:26 UTC
LGTM, thanks

Comment 21 krishnaram Karthick 2021-07-13 13:53:24 UTC
Test 1:
========
3 zones, 1 node per zone and 2 OSDs per node; Fail OSD1 from node 1

Behavior seen:
==========
no blocking pdbs created
pgs are active+clean; no-out flag wasn't set (as expected)


output:
========
oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.8.0-450.ci   OpenShift Container Storage   4.8.0-450.ci              Succeeded


oc rsh rook-ceph-tools-64d88c9b9f-jbmts ceph -s
  cluster:
    id:     1c7cc447-0457-4b99-908a-2eb8446b640b
    health: HEALTH_WARN
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2h)
    mgr: a(active, since 2h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 6 osds: 5 up (since 66m), 5 in (since 56m)
 
  data:
    pools:   3 pools, 288 pgs
    objects: 556 objects, 1.4 GiB
    usage:   8.4 GiB used, 2.5 TiB / 2.5 TiB avail
    pgs:     288 active+clean


oc rsh rook-ceph-tools-64d88c9b9f-jbmts ceph osd tree
ID  CLASS WEIGHT  TYPE NAME                                    STATUS REWEIGHT PRI-AFF 
 -1       3.00000 root default                                                         
 -5       3.00000     region us-east-2                                                 
 -4       1.00000         zone us-east-2a                                              
 -3       0.50000             host ocs-deviceset-1-data-0wh4nf                         
  0   ssd 0.50000                 osd.0                          down        0 1.00000 
-17       0.50000             host ocs-deviceset-2-data-1prdmr                         
  3   ssd 0.50000                 osd.3                            up  1.00000 1.00000 
-10       1.00000         zone us-east-2b                                              
 -9       0.50000             host ocs-deviceset-0-data-06khrt                         
  2   ssd 0.50000                 osd.2                            up  1.00000 1.00000 
-21       0.50000             host ocs-deviceset-0-data-1cbb5n                         
  5   ssd 0.50000                 osd.5                            up  1.00000 1.00000 
-14       1.00000         zone us-east-2c                                              
-19       0.50000             host ocs-deviceset-1-data-1p9hsj                         
  4   ssd 0.50000                 osd.4                            up  1.00000 1.00000 
-13       0.50000             host ocs-deviceset-2-data-07npwz                         
  1   ssd 0.50000                 osd.1                            up  1.00000 1.00000 


oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     3h
rook-ceph-mon-pdb                                 N/A             1                 1                     178m
rook-ceph-osd                                     N/A             1                 0                     76m

Comment 22 krishnaram Karthick 2021-07-13 15:05:22 UTC
Test 2:
========
3 zones, 1 node per zone and 2 OSDs per node; Fail OSD1 and OSD2 from node 1 
i.e., fail all OSDs on a zone

Behavior seen:
==============
blocking pdbs created
pgs are not active+clean
drain on other node was blocked


output:
========

oc rsh rook-ceph-tools-64d88c9b9f-jbmts ceph -s
  cluster:
    id:     1c7cc447-0457-4b99-908a-2eb8446b640b
    health: HEALTH_WARN
            1 osds down
            2 hosts (2 osds) down
            1 zone (2 osds) down
            Degraded data redundancy: 900/2700 objects degraded (33.333%), 144 pgs degraded, 288 pgs undersized
            2 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 5h)
    mgr: a(active, since 5h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 6 osds: 4 up (since 46m), 5 in (since 3h)
 
  data:
    pools:   3 pools, 288 pgs
    objects: 900 objects, 2.7 GiB
    usage:   12 GiB used, 2.5 TiB / 2.5 TiB avail
    pgs:     900/2700 objects degraded (33.333%)
             144 active+undersized+degraded
             144 active+undersized
 
  io:
    client:   1.2 KiB/s rd, 109 KiB/s wr, 2 op/s rd, 2 op/s wr


oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     5h3m
rook-ceph-mon-pdb                                 N/A             1                 1                     5h2m
rook-ceph-osd-zone-us-east-2b                     N/A             0                 0                     46m
rook-ceph-osd-zone-us-east-2c                     N/A             0                 0                     46m


$ oc adm drain ip-10-0-183-99.us-east-2.compute.internal
node/ip-10-0-183-99.us-east-2.compute.internal cordoned
error: unable to drain node "ip-10-0-183-99.us-east-2.compute.internal", aborting command...

There are pending nodes to be drained:
 ip-10-0-183-99.us-east-2.compute.internal
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-rhvnv, openshift-cluster-node-tuning-operator/tuned-8mhdc, openshift-dns/dns-default-zrbvr, openshift-dns/node-resolver-dp489, openshift-image-registry/node-ca-rptwl, openshift-ingress-canary/ingress-canary-jnzxz, openshift-machine-config-operator/machine-config-daemon-7zktc, openshift-monitoring/node-exporter-hvlg5, openshift-multus/multus-additional-cni-plugins-5bvcg, openshift-multus/multus-ngk2d, openshift-multus/network-metrics-daemon-nwmp4, openshift-network-diagnostics/network-check-target-7nqcn, openshift-sdn/sdn-fk4f9, openshift-storage/csi-cephfsplugin-76zkm, openshift-storage/csi-rbdplugin-vkgjl
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-image-registry/image-registry-54c4758b4d-lst42, openshift-monitoring/prometheus-adapter-dcc8d9658-pk9ph, openshift-monitoring/prometheus-k8s-0, openshift-storage/csi-cephfsplugin-provisioner-78d7667cb8-dxmfd, openshift-storage/rook-ceph-mgr-a-85bbdf4f54-tdpt4, openshift-storage/rook-ceph-osd-2-599cbff5c-pp6dq, openshift-storage/rook-ceph-osd-5-77f99cb98-772v4
cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): openshift-marketplace/ocs-catalogsource-pwps9

snippet of rook-operator logs:
2021-07-13 15:01:50.860692 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2021-07-13 15:01:50.860778 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2021-07-13 15:01:51.195883 I | clusterdisruption-controller: osd is down in failure domain "us-east-2a" and pgs are not active+clean. pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:144} {StateName:active+undersized Count:144}]"

Comment 24 krishnaram Karthick 2021-07-15 16:31:28 UTC
Test 3:

Failed one osd from zone 1 node 1; no pdb created
failed one more osd from zone 1 node 2; blocking pdbs created
Waited for pgs to be active clean; blocking pdbs were removed
no no-out flag were set

oc rsh rook-ceph-tools-64d88c9b9f-bh4db ceph osd tree
ID  CLASS WEIGHT  TYPE NAME                                    STATUS REWEIGHT PRI-AFF 
 -1       6.00000 root default                                                         
 -5       6.00000     region us-east-2                                                 
 -4       2.00000         zone us-east-2a                                              
 -3       0.50000             host ocs-deviceset-0-data-09mvqv                         
  0   ssd 0.50000                 osd.0                          down  1.00000 1.00000 
-19       0.50000             host ocs-deviceset-1-data-1q5d87                         
  4   ssd 0.50000                 osd.4                            up  1.00000 1.00000 
-31       0.50000             host ocs-deviceset-1-data-3mvqbh                         
 10   ssd 0.50000                 osd.10                           up  1.00000 1.00000 
-27       0.50000             host ocs-deviceset-2-data-25q882                         
  8   ssd 0.50000                 osd.8                          down  1.00000 1.00000 
-10       2.00000         zone us-east-2b                                              
-23       0.50000             host ocs-deviceset-0-data-2c2r4s                         
  6   ssd 0.50000                 osd.6                            up  1.00000 1.00000 
-29       0.50000             host ocs-deviceset-0-data-3pcq88                         
  9   ssd 0.50000                 osd.9                            up  1.00000 1.00000 
 -9       0.50000             host ocs-deviceset-2-data-0j95zf                         
  1   ssd 0.50000                 osd.1                            up  1.00000 1.00000 
-17       0.50000             host ocs-deviceset-2-data-16cbbl                         
  5   ssd 0.50000                 osd.5                            up  1.00000 1.00000 
-14       2.00000         zone us-east-2c                                              
-21       0.50000             host ocs-deviceset-0-data-1r4dqw                         
  3   ssd 0.50000                 osd.3                            up  1.00000 1.00000 
-13       0.50000             host ocs-deviceset-1-data-0zcbfz                         
  2   ssd 0.50000                 osd.2                            up  1.00000 1.00000 
-25       0.50000             host ocs-deviceset-1-data-26qd87                         
  7   ssd 0.50000                 osd.7                            up  1.00000 1.00000 
-33       0.50000             host ocs-deviceset-2-data-3cr9f9                         
 11   ssd 0.50000                 osd.11                           up  1.00000 1.00000 


oc get pdb
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     6h49m
rook-ceph-mon-pdb                                 N/A             1                 1                     6h47m
rook-ceph-osd-zone-us-east-2b                     N/A             0                 0                     2m32s
rook-ceph-osd-zone-us-east-2c                     N/A             0                 0                     2m32s

drain on a node in zone 2 failed:
oc adm drain ip-10-0-180-242.us-east-2.compute.internal
node/ip-10-0-180-242.us-east-2.compute.internal cordoned
error: unable to drain node "ip-10-0-180-242.us-east-2.compute.internal", aborting command...

There are pending nodes to be drained:
 ip-10-0-180-242.us-east-2.compute.internal
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-bpmpd, openshift-cluster-node-tuning-operator/tuned-25trr, openshift-dns/dns-default-zbj5v, openshift-dns/node-resolver-psdkz, openshift-image-registry/node-ca-cmq2b, openshift-ingress-canary/ingress-canary-tgb5j, openshift-machine-config-operator/machine-config-daemon-6sfhb, openshift-monitoring/node-exporter-6f5jp, openshift-multus/multus-additional-cni-plugins-rbrnj, openshift-multus/multus-zqgqx, openshift-multus/network-metrics-daemon-zvc9z, openshift-network-diagnostics/network-check-target-sxjc7, openshift-sdn/sdn-t759j, openshift-storage/csi-cephfsplugin-np746, openshift-storage/csi-rbdplugin-mqwtw
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-storage/rook-ceph-osd-6-8649977c5-d26k4, openshift-storage/rook-ceph-osd-9-65b8f95c55-6smrz
[krishnaramkarthickramdoss@localhost ~]$ oc adm uncordon ip-10-0-180-242.us-east-2.compute.internal


oc rsh rook-ceph-tools-64d88c9b9f-bh4db ceph -s
  cluster:
    id:     3f490417-f3ff-40c8-88ee-df83300233a3
    health: HEALTH_WARN
            2 osds down
            2 hosts (2 osds) down
            Degraded data redundancy: 2209/12840 objects degraded (17.204%), 76 pgs degraded, 151 pgs undersized
 
  services:
    mon: 3 daemons, quorum a,b,c (age 6h)
    mgr: a(active, since 6h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 12 osds: 10 up (since 2m), 12 in (since 2h)
 
  data:
    pools:   3 pools, 288 pgs
    objects: 4.28k objects, 16 GiB
    usage:   57 GiB used, 5.9 TiB / 6 TiB avail
    pgs:     2209/12840 objects degraded (17.204%)
             137 active+clean
             76  active+undersized+degraded
             75  active+undersized
 
  io:
    client:   3.2 KiB/s rd, 154 KiB/s wr, 2 op/s rd, 1 op/s wr

Comment 25 Santosh Pillai 2021-07-16 04:23:44 UTC
(In reply to krishnaram Karthick from comment #21)
> Test 1:
> ========
> 3 zones, 1 node per zone and 2 OSDs per node; Fail OSD1 from node 1
> 
> Behavior seen:
> ==========
> no blocking pdbs created

So `Test 1` and `Test 3` are suggesting that blocking pdbs are not created after failing the first OSD. Do PGs never go in degraded state after failing the first disk?


When a disk (OSD) is failed and ceph takes few seconds to recognise that. It won't be instantaneous. So PGs will be in degraded state and at that time blocking pdbs should be created. Yes, `no-out` flag won't be set. And once the PGs are active+clean, the blocking pdbs should be deleted. 


> pgs are active+clean; no-out flag wasn't set (as expected)

Comment 26 krishnaram Karthick 2021-07-19 05:19:28 UTC
(In reply to Santosh Pillai from comment #25)
> (In reply to krishnaram Karthick from comment #21)
> > Test 1:
> > ========
> > 3 zones, 1 node per zone and 2 OSDs per node; Fail OSD1 from node 1
> > 
> > Behavior seen:
> > ==========
> > no blocking pdbs created
> 
> So `Test 1` and `Test 3` are suggesting that blocking pdbs are not created
> after failing the first OSD. Do PGs never go in degraded state after failing
> the first disk?

I don't see PDBs getting created after the first OSD failure and PGs do go into the degraded state.
Do we see this behavior because there were 2 OSDs in all of the tests run? 

> 
> 
> When a disk (OSD) is failed and ceph takes few seconds to recognise that. It
> won't be instantaneous. So PGs will be in degraded state and at that time
> blocking pdbs should be created. Yes, `no-out` flag won't be set. And once
> the PGs are active+clean, the blocking pdbs should be deleted. 
> 
> 
> > pgs are active+clean; no-out flag wasn't set (as expected)

Comment 27 Santosh Pillai 2021-07-19 05:31:30 UTC
(In reply to krishnaram Karthick from comment #26)
> (In reply to Santosh Pillai from comment #25)
> > (In reply to krishnaram Karthick from comment #21)
> > > Test 1:
> > > ========
> > > 3 zones, 1 node per zone and 2 OSDs per node; Fail OSD1 from node 1
> > > 
> > > Behavior seen:
> > > ==========
> > > no blocking pdbs created
> > 
> > So `Test 1` and `Test 3` are suggesting that blocking pdbs are not created
> > after failing the first OSD. Do PGs never go in degraded state after failing
> > the first disk?
> 
> I don't see PDBs getting created after the first OSD failure and PGs do go
> into the degraded state.

Can you provide the rook operator logs of this state (one OSD is down and PGs are degraded and blocking PDBs are not getting created). 
And maybe the cluster setup in this state as well, if possible. 

> Do we see this behavior because there were 2 OSDs in all of the tests run? 
I've only tested with single OSD on each node in a three node cluster. 

> > 
> > When a disk (OSD) is failed and ceph takes few seconds to recognise that. It
> > won't be instantaneous. So PGs will be in degraded state and at that time
> > blocking pdbs should be created. Yes, `no-out` flag won't be set. And once
> > the PGs are active+clean, the blocking pdbs should be deleted. 
> > 
> > 
> > > pgs are active+clean; no-out flag wasn't set (as expected)

Comment 28 krishnaram Karthick 2021-07-21 08:10:40 UTC
I tried another test with 3 zones; 1 node per zone and 1 osd per node. 
Failed one OSD; pgs were degraded; no blocking pbds were seen

Attaching operator logs as requested & providing the cluster details to Santosh in private chat. 

oc rsh rook-ceph-tools-bd9b467...  localhost.localdomain: Wed Jul 21 13:06:17 2021

  cluster:
    id:     44805ffa-79f4-4349-9913-9b2e0dc52bfa
    health: HEALTH_WARN
            1 osds down
            1 host (1 osds) down
            1 zone (1 osds) down
            Degraded data redundancy: 4166/12498 objects degraded (33.333%), 47 pgs degraded,
96 pgs undersized
            1 daemons have recently crashed

  services:d
    mon: 3 daemons, quorum a,b,c (age 95m)
    mgr: a(active, since 94m)e
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active}
 1 up:standby-replayp
    osd: 3 osds: 2 up (since 13m), 3 in (since 94m)

  data:l
    pools:   3 pools, 96 pgs1
    objects: 4.17k objects, 15 GiB/



oc get pdb                         localhost.localdomain: Wed Jul 21 13:09:09 2021

NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DI
SRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1
            99m
rook-ceph-mon-pdb                                 N/A             1                 1
            97m
rook-ceph-osd                                     N/A             1                 0
            98m

Comment 30 krishnaram Karthick 2021-07-21 11:12:53 UTC
Raised a new bug for comment#28 - https://bugzilla.redhat.com/show_bug.cgi?id=1984396

Comment 31 krishnaram Karthick 2021-07-21 11:21:46 UTC
Moving the RFE to verified based on the tests I ran above. 
A new bug has been raised for the issue discussed in comment#28

Comment 33 errata-xmlrpc 2021-08-03 18:15:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003