Bug 2064849 - [GSS] 1 pg recovery_unfound
Summary: [GSS] 1 pg recovery_unfound
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Josh Durgin
QA Contact: Prasad Desala
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-16 18:43 UTC by khover
Modified: 2023-08-09 16:37 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-25 17:39:30 UTC
Embargoed:


Attachments (Terms of Use)

Description khover 2022-03-16 18:43:26 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Customer description of how the osd nodes have led to the current state.

The "outage" was where a two ESXi hosts went down, and one host lost some networking capabilities. The root cause is unknown (I don't manage the environment) but the environment has been restored since. Cluster nodes were shut down for a few hours, but all pods and nodes recovered after starting them up again.

HEALTH_ERR 33/275032 objects unfound (0.012%); Possible data damage: 15 pgs recovery_unfound; Degraded data redundancy: 65014/825096 objects degraded (7.880%), 15 pgs degraded, 15 pgs undersized; 15 pgs not deep-scrubbed in time; 15 pgs not scrubbed in time; 3 daemons have recently crashed
OBJECT_UNFOUND 33/275032 objects unfound (0.012%)
    pg 1.e has 2 unfound objects
    pg 1.d has 1 unfound objects
    pg 1.9 has 3 unfound objects
    pg 1.8 has 3 unfound objects
    pg 1.7 has 5 unfound objects
    pg 1.1 has 1 unfound objects
    pg 1.2 has 1 unfound objects
    pg 1.3 has 3 unfound objects
    pg 1.4 has 2 unfound objects
    pg 1.5 has 2 unfound objects
    pg 1.10 has 1 unfound objects
    pg 1.17 has 3 unfound objects
    pg 1.1a has 1 unfound objects
    pg 1.1d has 3 unfound objects
    pg 1.1f has 2 unfound objects

we're wanting to run the 'ceph pg $PGID mark_unfound_lost revert' cmd but wanted to double check with engineering before doing so as this is a potential data loss scenario

Version of all relevant components (if applicable):

OCS Version is : 4.8.8

ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 khover 2022-03-16 18:45:50 UTC
list_unfound

pg query 

in attachment

Comment 4 Scott Ostapovicz 2022-03-17 15:59:53 UTC
@jdurgin. Not sure what can be garnered here since the environment has been restored, but please take a look.

Comment 5 Josh Durgin 2022-03-17 23:59:06 UTC
Given that there are only 3 osds and they are all back in the environment, there's not a lot we can learn about what happened here. Since they are hosted on ESXi, I'd check the disk configuration there to ensure it is safe in the case of sudden shutdown, e.g. has disk caching disabled and is sending writes directly to hardware.

The provided pg info is all for one pg, but it suggests there were a few writes that were lost to each object - rbd data objects - so the impact is limited to particular PVs. There's nothing we can do at this point to recover those writes, so go ahead with mark_lost_unfound revert.

Comment 6 khover 2022-03-18 13:31:05 UTC
@jdurgin

Thanks for looking at this.

Ill have the cu run mark_lost_unfound revert


We can close this BZ


Note You need to log in before you can comment on or make changes to this bug.