Bug 2177864
| Summary: | [5.2 ceph cluster] Both MDS are stuck in replay | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Brett Hull <bhull> |
| Component: | CephFS | Assignee: | Venky Shankar <vshankar> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Hemanth Kumar <hyelloji> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 5.2 | CC: | ceph-eng-bugs, cephqe-warriors, gfarnum, snipp, vshankar |
| Target Milestone: | --- | ||
| Target Release: | 6.1z1 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-07-12 01:42:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Description of problem: both mds 0 and 1 are stuck in replay, keeps failing to standby mds. Prior to issue: 1) both mds active. 2) kicked off a mds scrub via: a) ceph tell mds.0 scrub start / recursive,repair Attempted to pause mds scrubs This mds paused quickly ceph-mds.root.host7.oqqvka.asok This one is still trying to pause ceph-mds.root.host3.rdnzhn.asok tried to abort after pausing has been running for 10 minutes, now its just stuck pausing+aborting { "status": "PAUSING+ABORTING (2959255 inodes in the stack)", "scrubs": { "382d27c9-2a0f-472e-bc68-3149557ff890": { "path": "/", "tag": "382d27c9-2a0f-472e-bc68-3149557ff890", "options": "recursive,repair" } } } The pausing/aborting would not clear on its own. we ended up failing both mds, after that all mds scrubing had stopped. we kicked off mds scrub again, and it finished for both mds. we also did a compaction of the omap for each ssd. We have mounted 70 clients and re-started their rsyncs. So we are back up, and appear stable. ceph fs status root - 70 clients ==== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active root.host1.amvgfe Reqs: 667 /s 8258k 8250k 230k 283k 1 active root.host3.rdnzhn Reqs: 2610 /s 4333k 4333k 28.7k 376k POOL TYPE USED AVAIL cephfs.meta metadata 45.0G 12.8T cephfs.data data 128T 244T STANDBY MDS root.host7.oqqvka root.host5.lbrqru MDS version: ceph version 16.2.8-85.el8cp (0bdc6db9a80af40dd496b05674a938d406a9f6f5) pacific (stable) Version-Release number of selected component (if applicable): MDS version: ceph version 16.2.8-85.el8cp How reproducible: This is not reproducible, but we have seen it twice so far. 1) 21-Feb-2023 - both mds' were stuck in replay 2) 07-Mar-2023 - One mds stuck in replay a) mds.1 was stuck in replay for aprox. 15 minutes. b) then went to reconnect, rejoin, active c) finished. Steps to Reproduce: N/A Actual results: MDS stuck in replay. Ceph only cluster. Expected results: MDS do not become stuck. Additional info: Similar if not the same behavior reported in BZ 2107110 - closed with not enough data. Causes cephFS to be unavailable. users are impacted.