Bug 2064849

Summary:	[GSS] 1 pg recovery_unfound
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	khover
Component:	ceph	Assignee:	Josh Durgin <jdurgin>
Status:	CLOSED NOTABUG	QA Contact:	Prasad Desala <tdesala>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	bniver, hnallurv, jdurgin, madam, muagarwa, nberry, ocs-bugs, odf-bz-bot, sostapov
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-25 17:39:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description khover 2022-03-16 18:43:26 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Customer description of how the osd nodes have led to the current state.

The "outage" was where a two ESXi hosts went down, and one host lost some networking capabilities. The root cause is unknown (I don't manage the environment) but the environment has been restored since. Cluster nodes were shut down for a few hours, but all pods and nodes recovered after starting them up again.

HEALTH_ERR 33/275032 objects unfound (0.012%); Possible data damage: 15 pgs recovery_unfound; Degraded data redundancy: 65014/825096 objects degraded (7.880%), 15 pgs degraded, 15 pgs undersized; 15 pgs not deep-scrubbed in time; 15 pgs not scrubbed in time; 3 daemons have recently crashed
OBJECT_UNFOUND 33/275032 objects unfound (0.012%)
    pg 1.e has 2 unfound objects
    pg 1.d has 1 unfound objects
    pg 1.9 has 3 unfound objects
    pg 1.8 has 3 unfound objects
    pg 1.7 has 5 unfound objects
    pg 1.1 has 1 unfound objects
    pg 1.2 has 1 unfound objects
    pg 1.3 has 3 unfound objects
    pg 1.4 has 2 unfound objects
    pg 1.5 has 2 unfound objects
    pg 1.10 has 1 unfound objects
    pg 1.17 has 3 unfound objects
    pg 1.1a has 1 unfound objects
    pg 1.1d has 3 unfound objects
    pg 1.1f has 2 unfound objects

we're wanting to run the 'ceph pg $PGID mark_unfound_lost revert' cmd but wanted to double check with engineering before doing so as this is a potential data loss scenario

Version of all relevant components (if applicable):

OCS Version is : 4.8.8

ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 khover 2022-03-16 18:45:50 UTC

list_unfound

pg query 

in attachment

Comment 4 Scott Ostapovicz 2022-03-17 15:59:53 UTC

@jdurgin. Not sure what can be garnered here since the environment has been restored, but please take a look.

Comment 5 Josh Durgin 2022-03-17 23:59:06 UTC

Given that there are only 3 osds and they are all back in the environment, there's not a lot we can learn about what happened here. Since they are hosted on ESXi, I'd check the disk configuration there to ensure it is safe in the case of sudden shutdown, e.g. has disk caching disabled and is sending writes directly to hardware.

The provided pg info is all for one pg, but it suggests there were a few writes that were lost to each object - rbd data objects - so the impact is limited to particular PVs. There's nothing we can do at this point to recover those writes, so go ahead with mark_lost_unfound revert.

Comment 6 khover 2022-03-18 13:31:05 UTC

@jdurgin

Thanks for looking at this.

Ill have the cu run mark_lost_unfound revert


We can close this BZ