2193206 – [RDR] Ceph status stuck in warning due to OSD having {NOUP,NODOWN,NOIN,NOOUT} flags during node failure tests

Bug 2193206 - [RDR] Ceph status stuck in warning due to OSD having {NOUP,NODOWN,NOIN,NOOUT} flags during node failure tests

Summary: [RDR] Ceph status stuck in warning due to OSD having {NOUP,NODOWN,NOIN,NOOUT}...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Santosh Pillai
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-05-04 17:11 UTC by Sidhant Agrawal
Modified:	2023-08-09 17:03 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-06-08 21:17:26 UTC
Embargoed:

Attachments	(Terms of Use)

Description Sidhant Agrawal 2023-05-04 17:11:19 UTC

Description of problem (please be detailed as possible and provide log
snippests):
On a RDR setup, while running some node failure tests (1 worker node failure at a time) it was observed that for one of the managed cluster, Ceph health did not recover even after 3+ hours with below message in ceph status
```
health: HEALTH_WARN
        1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
```

Version of all relevant components (if applicable):
OCP: 4.13.0-0.nightly-2023-04-21-084440
ODF: 4.13.0-178
ACM: 2.7.3
Submariner: 0.14.3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, Ceph stuck in HEALTH_WARN state

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Configure RDR setup consisting of 3 OCP clusters(Hub, C1, C2)
2. Deploy an application containing 20 PVCs/Pods on C1
3. Fail the C1 cluster node (Power off the VM) where ramen-dr-cluster-operator pod is running 
4. Wait for old ramen-dr-cluster-operator pod to be deleted and new pod to start
5. Start the node and wait for node to come up
6. Wait for ODF, DR and submariner related pods to reach running state
7. Check mirroring status is OK 
8. Repeat the above steps from 3 to 7 on cluster C2
9. Verify Ceph health is OK at the end with wait time of 40 minutes
10. Repeat the above steps from 2 to 9 again for rbd-mirror pod node
11. Observed Ceph health does not become OK on C1 cluster

This issue was found during automation of node failures tests. PR: github.com/red-hat-storage/ocs-ci/pull/6675
There are more tests in above PR but issue was hit after executing above steps.

Actual results:
Ceph health stuck in HEALTH_WARN due to "1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set"

Expected results:
Ceph health should recover and become OK

Additional Info:

Last node failure event details on the cluster C1:
12:08:38 - Power off compute-1 node
12:10:48 - Power on compute-1 node
12:11:35 - compute-1 node reached status Ready
12:27:54 - Started checking for Ceph health OK
13:07:36 - Ceph health not recovered, test failed at this point.

The cluster was kept in same state for few more hours, but it did not recover

> Ceph health at 13:11:37

```
Thu May  4 13:11:37 UTC 2023
  cluster:
    id:     6e74d685-a732-409e-8631-174096f34641
    health: HEALTH_WARN
            1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set

  services:
    mon:        3 daemons, quorum a,b,d (age 49m)
    mgr:        a(active, since 62m)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 59m), 3 in (since 6d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 185 pgs
    objects: 3.15k objects, 5.3 GiB
    usage:   17 GiB used, 6.0 TiB / 6 TiB avail
    pgs:     185 active+clean

  io:
    client:   4.1 KiB/s rd, 4.8 KiB/s wr, 5 op/s rd, 0 op/s wr
```

> Ceph status at 16:44:12

```
Thu May  4 16:44:12 UTC 2023
  cluster:
    id:     6e74d685-a732-409e-8631-174096f34641
    health: HEALTH_WARN
            1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set

  services:
    mon:        3 daemons, quorum a,b,d (age 4h)
    mgr:        a(active, since 4h)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 4h), 3 in (since 6d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 185 pgs
    objects: 3.15k objects, 5.3 GiB
    usage:   18 GiB used, 6.0 TiB / 6 TiB avail
    pgs:     185 active+clean

  io:
    client:   5.1 KiB/s rd, 9.8 KiB/s wr, 6 op/s rd, 1 op/s wr


> Ceph status at 17:03:21
```
Thu May  4 17:03:21 UTC 2023
  cluster:
    id:     6e74d685-a732-409e-8631-174096f34641
    health: HEALTH_WARN
            1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set

  services:
    mon:        3 daemons, quorum a,b,d (age 4h)
    mgr:        a(active, since 4h)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 4h), 3 in (since 6d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 185 pgs
    objects: 3.15k objects, 5.3 GiB
    usage:   18 GiB used, 6.0 TiB / 6 TiB avail
    pgs:     185 active+clean

  io:
    client:   853 B/s rd, 8.3 KiB/s wr, 1 op/s rd, 1 op/s wr
```

Comment 5 Santosh Pillai 2023-05-05 12:25:08 UTC

`ceph status` suggests that all the pools are active+clean
But last output from the rook operator suggests that pgs are not active + clean.

C2:
Ceph-status:
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
sh-5.1$ ceph status 
  cluster:
    id:     9add0d42-5298-4e69-bf61-f6cc93bebbf1
    health: HEALTH_OK
 
  services:
    mon:        3 daemons, quorum a,b,c (age 24h)
    mgr:        a(active, since 44h)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 24h), 3 in (since 7d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 1.35k objects, 1.4 GiB
    usage:   5.1 GiB used, 6.0 TiB / 6 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   2.1 KiB/s rd, 12 KiB/s wr, 3 op/s rd, 1 op/s wr


rook operator:
ean Count:146} {StateName:active+clean+snaptrim_wait Count:19} {StateName:active+clean+snaptrim Count:4}]"
2023-05-05 08:33:31.199305 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:157} {StateName:active+clean+snaptrim_wait Count:10} {StateName:active+clean+snaptrim Count:2}]"


The rook operator on the clusters (c1 and c2) seem to be stuck and not proceeding further. But are not stuck around the time stamp of `2023-05-05 08:33:3`

Comment 6 Santosh Pillai 2023-05-05 13:13:20 UTC

Scratch my above comment #5

On C1:

- nout was set on rack1 at 12:10:19 after osd was drained.
rack1-noout-last-set-at: "2023-05-04T12:10:19Z"

- OSD came back and all pgs were active + clean around 12:18
2023-05-04 12:18:57.582955 I | clusterdisruption-controller: all PGs are active+clean. Restoring default OSD pdb settings

- Around the same time, one of the mons was failing over and rook failed to get osd dump and thus it couldn't remove the noout flag.
2023-05-04 12:18:57.580928 E | clusterdisruption-controller: failed to update maintenance noout in cluster "openshift-storage/ocs-storagecluster-cephcluster". failed to get osddump for reconciling maintenance noout in namespace openshift-storage: failed to get osd dump: exit status 1

Currently we are just logging this error (https://github.com/rook/rook/blob/c2abb7f006fe9f5c75b2a8f0e60e4dd933d15eb9/pkg/operator/ceph/disruption/clusterdisruption/osd.go#L381) and not handling/returning it. So a quick fix can be to return this error

Comment 7 krishnaram Karthick 2023-05-08 10:09:31 UTC

This potentially breaks our automation and it is important to have fix for RDR.
So, proposing this bug as a blocker and targeting it for 4.13.

Comment 8 Santosh Pillai 2023-05-08 10:58:15 UTC

(In reply to krishnaram Karthick from comment #7)
> This potentially breaks our automation and it is important to have fix for
> RDR.
> So, proposing this bug as a blocker and targeting it for 4.13.
I have also asked for repro.

Note You need to log in before you can comment on or make changes to this bug.