Bug 2064849
| Summary: | [GSS] 1 pg recovery_unfound | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | khover |
| Component: | ceph | Assignee: | Josh Durgin <jdurgin> |
| Status: | CLOSED NOTABUG | QA Contact: | Prasad Desala <tdesala> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | bniver, hnallurv, jdurgin, madam, muagarwa, nberry, ocs-bugs, odf-bz-bot, sostapov |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-25 17:39:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
list_unfound pg query in attachment @jdurgin. Not sure what can be garnered here since the environment has been restored, but please take a look. Given that there are only 3 osds and they are all back in the environment, there's not a lot we can learn about what happened here. Since they are hosted on ESXi, I'd check the disk configuration there to ensure it is safe in the case of sudden shutdown, e.g. has disk caching disabled and is sending writes directly to hardware. The provided pg info is all for one pg, but it suggests there were a few writes that were lost to each object - rbd data objects - so the impact is limited to particular PVs. There's nothing we can do at this point to recover those writes, so go ahead with mark_lost_unfound revert. @jdurgin Thanks for looking at this. Ill have the cu run mark_lost_unfound revert We can close this BZ |
Description of problem (please be detailed as possible and provide log snippests): Customer description of how the osd nodes have led to the current state. The "outage" was where a two ESXi hosts went down, and one host lost some networking capabilities. The root cause is unknown (I don't manage the environment) but the environment has been restored since. Cluster nodes were shut down for a few hours, but all pods and nodes recovered after starting them up again. HEALTH_ERR 33/275032 objects unfound (0.012%); Possible data damage: 15 pgs recovery_unfound; Degraded data redundancy: 65014/825096 objects degraded (7.880%), 15 pgs degraded, 15 pgs undersized; 15 pgs not deep-scrubbed in time; 15 pgs not scrubbed in time; 3 daemons have recently crashed OBJECT_UNFOUND 33/275032 objects unfound (0.012%) pg 1.e has 2 unfound objects pg 1.d has 1 unfound objects pg 1.9 has 3 unfound objects pg 1.8 has 3 unfound objects pg 1.7 has 5 unfound objects pg 1.1 has 1 unfound objects pg 1.2 has 1 unfound objects pg 1.3 has 3 unfound objects pg 1.4 has 2 unfound objects pg 1.5 has 2 unfound objects pg 1.10 has 1 unfound objects pg 1.17 has 3 unfound objects pg 1.1a has 1 unfound objects pg 1.1d has 3 unfound objects pg 1.1f has 2 unfound objects we're wanting to run the 'ceph pg $PGID mark_unfound_lost revert' cmd but wanted to double check with engineering before doing so as this is a potential data loss scenario Version of all relevant components (if applicable): OCS Version is : 4.8.8 ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: