1984396 – Failing the only OSD of a node on a 3 node cluster doesn't create blocking PDBs

Bug 1984396 - Failing the only OSD of a node on a 3 node cluster doesn't create blocking PDBs

Summary: Failing the only OSD of a node on a 3 node cluster doesn't create blocking PDBs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.9.0
Assignee:	Santosh Pillai
QA Contact:	Anna Sandler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-21 11:11 UTC by krishnaram Karthick
Modified:	2023-08-09 17:03 UTC (History)
CC List:	8 users (show)
Fixed In Version:	v4.9.0-158.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-13 17:44:54 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 20	None	open	Bug 1984396: ceph: reconcile osd pdb if allowed disruption is 0	2021-09-21 16:12:40 UTC
Github	rook rook pull 8698	None	Draft	ceph: reconcile osd pdb if allowed disruption is 0	2021-09-13 10:51:29 UTC
Red Hat Product Errata	RHSA-2021:5086	None	None	None	2021-12-13 17:45:16 UTC

Description krishnaram Karthick 2021-07-21 11:11:00 UTC

Description of problem (please be detailed as possible and provide log
snippets):

when the OSD of a node on a 3 node OCS cluster is failed, no blocking PDBs are created on the other two nodes even when pgs are unhealthy. Ideally in such a situation, blocking PDBs should be created so node drain on the nonfailing nodes are blocked. 

see https://bugzilla.redhat.com/show_bug.cgi?id=1950419#c28 for more details


Version of all relevant components (if applicable):
oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.8.0-456.ci   OpenShift Container Storage   4.8.0-456.ci              Succeeded



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
NA

Is there any workaround available to the best of your knowledge?
Replace the failed OSD


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
NA


Steps to Reproduce:
[test cluster; 3 ocs worker nodes with one OSD on each node]
1. create OCP + OCS cluster with 3 master and 3 worker nodes
2. create RBD volume and write data (this is to ensure that when OSD fails, Ps are degraded)
3. Fail one of the OSD on one node (I force detached the disk from AWS)
4. watch for any change in PDB


Actual results:
No blocking PDBs are created

Expected results:
Blocking PDBs should be created

Additional info:

Comment 2 Travis Nielsen 2021-08-02 15:37:37 UTC

Santosh can you take a look?

Comment 3 Santosh Pillai 2021-08-03 04:38:13 UTC

I'll look into this week.

Comment 4 Travis Nielsen 2021-08-23 15:56:21 UTC

How's it looking?

Comment 5 Santosh Pillai 2021-09-07 08:43:21 UTC

Tested it with rook on 3 node minikube cluster by deleting the disk from virtual box. Observed the following:

1. osd pod (for which the disk was removed) when into CLBO state:

```
 oc get pods -n rook-ceph -o wide | grep osd
rook-ceph-osd-0-77b7459f77-l27r2                         0/1     CrashLoopBackOff   5          19m   10.244.2.9       minikube-m03   <none>           <none>
rook-ceph-osd-1-749b5fbd74-5gg47                         1/1     Running            0          19m   10.244.3.8       minikube-m04   <none>           <none>
rook-ceph-osd-2-85984c996d-6rxht                         1/1     Running            0          19m   10.244.1.10      minikube-m02   <none>           <none>
rook-ceph-osd-prepare-minikube-m02-gtxrs                 0/1     Completed          0          20m   10.244.1.9       minikube-m02   <none>           <none>
rook-ceph-osd-prepare-minikube-m03-5lr6b                 0/1     Completed          0          20m   10.244.2.7       minikube-m03   <none>           <none>
rook-ceph-osd-prepare-minikube-m04-29hhd                 0/1     Completed          0          20m   10.244.3.7       minikube-m04   <none>           <none>
```

2. Ceph Status was degraded:

```
Every 2.0s: ceph status                                                              rook-ceph-tools-78cdfd976c-dmj98: Tue Sep  7 08:39:04 2021

  cluster:
    id:     cff26850-1cfc-4542-8bed-bb19c42523e9
    health: HEALTH_WARN
            1 osds down
            1 host (1 osds) down
            Degraded data redundancy: 224/672 objects degraded (33.333%), 34 pgs degraded, 81 pgs undersized
            1 daemons have recently crashed

  services:
    mon: 3 daemons, quorum a,b,c (age 21m)
    mgr: a(active, since 20m)
    osd: 3 osds: 2 up (since 4m), 3 in (since 21m)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    pools:   8 pools, 81 pgs
    objects: 224 objects, 9.8 KiB
    usage:   34 MiB used, 30 GiB / 30 GiB avail
    pgs:     224/672 objects degraded (33.333%)
             47 active+undersized
             34 active+undersized+degraded
```

3. Blocking PDBs got created successfully on other failure domains (nodes)

```
Every 2.0s: oc get pdb -n rook-ceph                                                                    localhost.localdomain: Tue Sep  7 14:09:39 2021

NAME                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mon-pdb                 N/A             1                 1                     22m
rook-ceph-osd-host-minikube-m02   N/A             0                 0                     5m15s
rook-ceph-osd-host-minikube-m04   N/A             0                 0                     5m15s
```


rook logs:

```
2021-09-07 08:34:24.245486 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2021-09-07 08:34:24.845075 I | clusterdisruption-controller: osd is down in failure domain "minikube-m03" and pgs are not active+clean. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:52} {StateName:stale+active+clean Count:29}]"
2021-09-07 08:34:24.853990 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-minikube-m02" with maxUnavailable=0 for "host" failure domain "minikube-m02"
2021-09-07 08:34:24.865325 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-minikube-m04" with maxUnavailable=0 for "host" failure domain "minikube-m04"
2021-09-07 08:34:24.888968 I | clusterdisruption-controller: deleting the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd

```

So this looks expected. 

I'll test it a few more times to see if there is any inconsistency in the behavior.

Comment 6 Santosh Pillai 2021-09-13 09:52:22 UTC

I was able to reproduce this bug on Openshift AWS instance when deploying rook with OCS-operator. 

Possible root cause is this line of code - https://github.com/rook/rook/blob/ab728e0183c92e059af7d663b287b00e95d6e175/pkg/operator/ceph/disruption/clusterdisruption/osd.go#L525

Rook checks whether an OSD is down by checking the `ReadyReplicas` in the OSD deployement. When OSD pod is in CLBO due to disk failure, then there is a delay in updating OSD `deployement.Status.ReadyReplicas` to 0. Although the POD is CBLO but the `ReadyReplicas` count is still 1 when rook checks it.  The delay causes rook to miss if any OSD is down at all and hence no blocking PDBs are created for other failure domains. And only default PDB is there with `AllowedDisruption` Count set to 0. 

(Note: This delay was observed in Openshift AWS instances and not on local minikube instances of rook)

One possible solution is to reconcile in case `AllowedDisruption` count in the default PDB is 0

Comment 7 Travis Nielsen 2021-09-13 13:55:13 UTC

This should just be in post, right?

Comment 8 Santosh Pillai 2021-09-13 14:36:15 UTC

yeah. Sorry. Only upstream patch is ready. Should be post.

Comment 9 Travis Nielsen 2021-09-20 15:07:15 UTC

Santosh, could you open the downstream backport PR?

Comment 16 Anna Sandler 2021-10-19 23:50:11 UTC

tested on OCP + OCS cluster
detached the volumes manually from AWS console and blocking PDBs were created timidity as expected 

[asandler@fedora ~]$ oc get pods -A | grep osd
openshift-storage                                  rook-ceph-osd-0-55f5495846-bpmgx                                      1/2     CrashLoopBackOff       3 (41s ago)    87m

[asandler@fedora ~]$ oc get pdb -A
openshift-storage                      rook-ceph-osd-zone-us-east-2b                     N/A             0                 0                     73s
openshift-storage                      rook-ceph-osd-zone-us-east-2c                     N/A             0                 0                     73s

Comment 18 errata-xmlrpc 2021-12-13 17:44:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086

Note You need to log in before you can comment on or make changes to this bug.