Bug 1984396

Summary:	Failing the only OSD of a node on a 3 node cluster doesn't create blocking PDBs
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	krishnaram Karthick <kramdoss>
Component:	rook	Assignee:	Santosh Pillai <sapillai>
Status:	CLOSED ERRATA	QA Contact:	Anna Sandler <asandler>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	jelopez, madam, muagarwa, ocs-bugs, odf-bz-bot, sapillai, sostapov, tnielsen
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	ODF 4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	v4.9.0-158.ci	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-12-13 17:44:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description krishnaram Karthick 2021-07-21 11:11:00 UTC

Description of problem (please be detailed as possible and provide log
snippets):

when the OSD of a node on a 3 node OCS cluster is failed, no blocking PDBs are created on the other two nodes even when pgs are unhealthy. Ideally in such a situation, blocking PDBs should be created so node drain on the nonfailing nodes are blocked. 

see https://bugzilla.redhat.com/show_bug.cgi?id=1950419#c28 for more details


Version of all relevant components (if applicable):
oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.8.0-456.ci   OpenShift Container Storage   4.8.0-456.ci              Succeeded



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
NA

Is there any workaround available to the best of your knowledge?
Replace the failed OSD


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
NA


Steps to Reproduce:
[test cluster; 3 ocs worker nodes with one OSD on each node]
1. create OCP + OCS cluster with 3 master and 3 worker nodes
2. create RBD volume and write data (this is to ensure that when OSD fails, Ps are degraded)
3. Fail one of the OSD on one node (I force detached the disk from AWS)
4. watch for any change in PDB


Actual results:
No blocking PDBs are created

Expected results:
Blocking PDBs should be created

Additional info:

Comment 2 Travis Nielsen 2021-08-02 15:37:37 UTC

Santosh can you take a look?

Comment 3 Santosh Pillai 2021-08-03 04:38:13 UTC

I'll look into this week.

Comment 4 Travis Nielsen 2021-08-23 15:56:21 UTC

How's it looking?

Comment 5 Santosh Pillai 2021-09-07 08:43:21 UTC

Tested it with rook on 3 node minikube cluster by deleting the disk from virtual box. Observed the following:

1. osd pod (for which the disk was removed) when into CLBO state:

```
 oc get pods -n rook-ceph -o wide | grep osd
rook-ceph-osd-0-77b7459f77-l27r2                         0/1     CrashLoopBackOff   5          19m   10.244.2.9       minikube-m03   <none>           <none>
rook-ceph-osd-1-749b5fbd74-5gg47                         1/1     Running            0          19m   10.244.3.8       minikube-m04   <none>           <none>
rook-ceph-osd-2-85984c996d-6rxht                         1/1     Running            0          19m   10.244.1.10      minikube-m02   <none>           <none>
rook-ceph-osd-prepare-minikube-m02-gtxrs                 0/1     Completed          0          20m   10.244.1.9       minikube-m02   <none>           <none>
rook-ceph-osd-prepare-minikube-m03-5lr6b                 0/1     Completed          0          20m   10.244.2.7       minikube-m03   <none>           <none>
rook-ceph-osd-prepare-minikube-m04-29hhd                 0/1     Completed          0          20m   10.244.3.7       minikube-m04   <none>           <none>
```

2. Ceph Status was degraded:

```
Every 2.0s: ceph status                                                              rook-ceph-tools-78cdfd976c-dmj98: Tue Sep  7 08:39:04 2021

  cluster:
    id:     cff26850-1cfc-4542-8bed-bb19c42523e9
    health: HEALTH_WARN
            1 osds down
            1 host (1 osds) down
            Degraded data redundancy: 224/672 objects degraded (33.333%), 34 pgs degraded, 81 pgs undersized
            1 daemons have recently crashed

  services:
    mon: 3 daemons, quorum a,b,c (age 21m)
    mgr: a(active, since 20m)
    osd: 3 osds: 2 up (since 4m), 3 in (since 21m)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    pools:   8 pools, 81 pgs
    objects: 224 objects, 9.8 KiB
    usage:   34 MiB used, 30 GiB / 30 GiB avail
    pgs:     224/672 objects degraded (33.333%)
             47 active+undersized
             34 active+undersized+degraded
```

3. Blocking PDBs got created successfully on other failure domains (nodes)

```
Every 2.0s: oc get pdb -n rook-ceph                                                                    localhost.localdomain: Tue Sep  7 14:09:39 2021

NAME                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mon-pdb                 N/A             1                 1                     22m
rook-ceph-osd-host-minikube-m02   N/A             0                 0                     5m15s
rook-ceph-osd-host-minikube-m04   N/A             0                 0                     5m15s
```


rook logs:

```
2021-09-07 08:34:24.245486 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2021-09-07 08:34:24.845075 I | clusterdisruption-controller: osd is down in failure domain "minikube-m03" and pgs are not active+clean. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:52} {StateName:stale+active+clean Count:29}]"
2021-09-07 08:34:24.853990 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-minikube-m02" with maxUnavailable=0 for "host" failure domain "minikube-m02"
2021-09-07 08:34:24.865325 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-minikube-m04" with maxUnavailable=0 for "host" failure domain "minikube-m04"
2021-09-07 08:34:24.888968 I | clusterdisruption-controller: deleting the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd

```

So this looks expected. 

I'll test it a few more times to see if there is any inconsistency in the behavior.

Comment 6 Santosh Pillai 2021-09-13 09:52:22 UTC

I was able to reproduce this bug on Openshift AWS instance when deploying rook with OCS-operator. 

Possible root cause is this line of code - https://github.com/rook/rook/blob/ab728e0183c92e059af7d663b287b00e95d6e175/pkg/operator/ceph/disruption/clusterdisruption/osd.go#L525

Rook checks whether an OSD is down by checking the `ReadyReplicas` in the OSD deployement. When OSD pod is in CLBO due to disk failure, then there is a delay in updating OSD `deployement.Status.ReadyReplicas` to 0. Although the POD is CBLO but the `ReadyReplicas` count is still 1 when rook checks it.  The delay causes rook to miss if any OSD is down at all and hence no blocking PDBs are created for other failure domains. And only default PDB is there with `AllowedDisruption` Count set to 0. 

(Note: This delay was observed in Openshift AWS instances and not on local minikube instances of rook)

One possible solution is to reconcile in case `AllowedDisruption` count in the default PDB is 0

Comment 7 Travis Nielsen 2021-09-13 13:55:13 UTC

This should just be in post, right?

Comment 8 Santosh Pillai 2021-09-13 14:36:15 UTC

yeah. Sorry. Only upstream patch is ready. Should be post.

Comment 9 Travis Nielsen 2021-09-20 15:07:15 UTC

Santosh, could you open the downstream backport PR?

Comment 16 Anna Sandler 2021-10-19 23:50:11 UTC

tested on OCP + OCS cluster
detached the volumes manually from AWS console and blocking PDBs were created timidity as expected 

[asandler@fedora ~]$ oc get pods -A | grep osd
openshift-storage                                  rook-ceph-osd-0-55f5495846-bpmgx                                      1/2     CrashLoopBackOff       3 (41s ago)    87m

[asandler@fedora ~]$ oc get pdb -A
openshift-storage                      rook-ceph-osd-zone-us-east-2b                     N/A             0                 0                     73s
openshift-storage                      rook-ceph-osd-zone-us-east-2c                     N/A             0                 0                     73s

Comment 18 errata-xmlrpc 2021-12-13 17:44:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086