Bug 2193206
| Summary: | [RDR] Ceph status stuck in warning due to OSD having {NOUP,NODOWN,NOIN,NOOUT} flags during node failure tests | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Sidhant Agrawal <sagrawal> |
| Component: | rook | Assignee: | Santosh Pillai <sapillai> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Neha Berry <nberry> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.13 | CC: | amagrawa, kramdoss, muagarwa, ocs-bugs, odf-bz-bot, tnielsen |
| Target Milestone: | --- | Keywords: | Automation |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-06-08 21:17:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
`ceph status` suggests that all the pools are active+clean
But last output from the rook operator suggests that pgs are not active + clean.
C2:
Ceph-status:
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
sh-5.1$ ceph status
cluster:
id: 9add0d42-5298-4e69-bf61-f6cc93bebbf1
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 24h)
mgr: a(active, since 44h)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 24h), 3 in (since 7d)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 1.35k objects, 1.4 GiB
usage: 5.1 GiB used, 6.0 TiB / 6 TiB avail
pgs: 169 active+clean
io:
client: 2.1 KiB/s rd, 12 KiB/s wr, 3 op/s rd, 1 op/s wr
rook operator:
ean Count:146} {StateName:active+clean+snaptrim_wait Count:19} {StateName:active+clean+snaptrim Count:4}]"
2023-05-05 08:33:31.199305 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1 rack2]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:157} {StateName:active+clean+snaptrim_wait Count:10} {StateName:active+clean+snaptrim Count:2}]"
The rook operator on the clusters (c1 and c2) seem to be stuck and not proceeding further. But are not stuck around the time stamp of `2023-05-05 08:33:3`
Scratch my above comment #5 On C1: - nout was set on rack1 at 12:10:19 after osd was drained. rack1-noout-last-set-at: "2023-05-04T12:10:19Z" - OSD came back and all pgs were active + clean around 12:18 2023-05-04 12:18:57.582955 I | clusterdisruption-controller: all PGs are active+clean. Restoring default OSD pdb settings - Around the same time, one of the mons was failing over and rook failed to get osd dump and thus it couldn't remove the noout flag. 2023-05-04 12:18:57.580928 E | clusterdisruption-controller: failed to update maintenance noout in cluster "openshift-storage/ocs-storagecluster-cephcluster". failed to get osddump for reconciling maintenance noout in namespace openshift-storage: failed to get osd dump: exit status 1 Currently we are just logging this error (https://github.com/rook/rook/blob/c2abb7f006fe9f5c75b2a8f0e60e4dd933d15eb9/pkg/operator/ceph/disruption/clusterdisruption/osd.go#L381) and not handling/returning it. So a quick fix can be to return this error This potentially breaks our automation and it is important to have fix for RDR. So, proposing this bug as a blocker and targeting it for 4.13. (In reply to krishnaram Karthick from comment #7) > This potentially breaks our automation and it is important to have fix for > RDR. > So, proposing this bug as a blocker and targeting it for 4.13. I have also asked for repro. |
Description of problem (please be detailed as possible and provide log snippests): On a RDR setup, while running some node failure tests (1 worker node failure at a time) it was observed that for one of the managed cluster, Ceph health did not recover even after 3+ hours with below message in ceph status ``` health: HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set ``` Version of all relevant components (if applicable): OCP: 4.13.0-0.nightly-2023-04-21-084440 ODF: 4.13.0-178 ACM: 2.7.3 Submariner: 0.14.3 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, Ceph stuck in HEALTH_WARN state Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Configure RDR setup consisting of 3 OCP clusters(Hub, C1, C2) 2. Deploy an application containing 20 PVCs/Pods on C1 3. Fail the C1 cluster node (Power off the VM) where ramen-dr-cluster-operator pod is running 4. Wait for old ramen-dr-cluster-operator pod to be deleted and new pod to start 5. Start the node and wait for node to come up 6. Wait for ODF, DR and submariner related pods to reach running state 7. Check mirroring status is OK 8. Repeat the above steps from 3 to 7 on cluster C2 9. Verify Ceph health is OK at the end with wait time of 40 minutes 10. Repeat the above steps from 2 to 9 again for rbd-mirror pod node 11. Observed Ceph health does not become OK on C1 cluster This issue was found during automation of node failures tests. PR: github.com/red-hat-storage/ocs-ci/pull/6675 There are more tests in above PR but issue was hit after executing above steps. Actual results: Ceph health stuck in HEALTH_WARN due to "1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set" Expected results: Ceph health should recover and become OK Additional Info: Last node failure event details on the cluster C1: 12:08:38 - Power off compute-1 node 12:10:48 - Power on compute-1 node 12:11:35 - compute-1 node reached status Ready 12:27:54 - Started checking for Ceph health OK 13:07:36 - Ceph health not recovered, test failed at this point. The cluster was kept in same state for few more hours, but it did not recover > Ceph health at 13:11:37 ``` Thu May 4 13:11:37 UTC 2023 cluster: id: 6e74d685-a732-409e-8631-174096f34641 health: HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set services: mon: 3 daemons, quorum a,b,d (age 49m) mgr: a(active, since 62m) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 59m), 3 in (since 6d) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 185 pgs objects: 3.15k objects, 5.3 GiB usage: 17 GiB used, 6.0 TiB / 6 TiB avail pgs: 185 active+clean io: client: 4.1 KiB/s rd, 4.8 KiB/s wr, 5 op/s rd, 0 op/s wr ``` > Ceph status at 16:44:12 ``` Thu May 4 16:44:12 UTC 2023 cluster: id: 6e74d685-a732-409e-8631-174096f34641 health: HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set services: mon: 3 daemons, quorum a,b,d (age 4h) mgr: a(active, since 4h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 4h), 3 in (since 6d) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 185 pgs objects: 3.15k objects, 5.3 GiB usage: 18 GiB used, 6.0 TiB / 6 TiB avail pgs: 185 active+clean io: client: 5.1 KiB/s rd, 9.8 KiB/s wr, 6 op/s rd, 1 op/s wr > Ceph status at 17:03:21 ``` Thu May 4 17:03:21 UTC 2023 cluster: id: 6e74d685-a732-409e-8631-174096f34641 health: HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set services: mon: 3 daemons, quorum a,b,d (age 4h) mgr: a(active, since 4h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 4h), 3 in (since 6d) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 185 pgs objects: 3.15k objects, 5.3 GiB usage: 18 GiB used, 6.0 TiB / 6 TiB avail pgs: 185 active+clean io: client: 853 B/s rd, 8.3 KiB/s wr, 1 op/s rd, 1 op/s wr ```