1459123 – Pg repair corrupts all shards if the primary shard((or the "selected" copy) is corrupted

Bug 1459123 - Pg repair corrupts all shards if the primary shard((or the "selected" copy) is corrupted

Summary: Pg repair corrupts all shards if the primary shard((or the "selected" copy) ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	2.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	2.*
Assignee:	David Zafman
QA Contact:	Parikshith
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-06 11:08 UTC by Parikshith
Modified:	2018-10-31 21:29 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-31 21:29:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	20186	0	None	None	None	2017-06-13 01:23:56 UTC

Description Parikshith 2017-06-06 11:08:02 UTC

Description of problem: 

Pg repair corrupts all shards if the xattr(or data) in primary shard(or the "selected" copy) is corrupted.

Note: this bz covers the 'latent' issue highlighted in https://bugzilla.redhat.com/show_bug.cgi?id=1457097#c10.

Version-Release number of selected component (if applicable): 10.2.7-24.el7cp RHEL


How reproducible:

1. Created pool. Wrote some data. 
2. Picked one of the shards (primary) and corrupt one of the objects xattr
3. Scrubbed, it find the inconsistent object
4. Did a pg repair to successfully repaired the object.
5. Ran scrub again on the pg, it found the same object inconsistent again.

2017-06-03 20:09:04.877818 osd.0 [INF] 3.17 scrub starts
2017-06-03 20:09:04.879714 osd.0 [ERR] scrub 3.17 3:e98a0d29:::benchmark_data_aircobra.lab.eng.blr.redhat.c_129845_object53:head can't decode 'snapset' attr buffer::malformed_input: void SnapSet::decode(ceph::buffer::list::iterator&) unknown encoding version > 2
2017-06-03 20:09:04.879782 osd.0 [ERR] 3.17 scrub 1 errors

6. So did a Pg repair again, it didn't repair this time

2017-06-03 20:17:57.929809 osd.0 [INF] 3.17 repair starts
2017-06-03 20:17:59.446822 osd.0 [ERR] repair 3.17 3:e98a0d29:::benchmark_data_aircobra.lab.eng.blr.redhat.c_129845_object53:head can't decode 'snapset' attr buffer::malformed_input: void SnapSet::decode(ceph::buffer::list::iterator&) unknown encoding version > 2
2017-06-03 20:17:59.446855 osd.0 [ERR] 3.17 repair 1 errors, 0 fixed

After the repair, all copies will have its xattr(or data) value of the primary shard.If somehow primary (or the "selected" copy) is corrupted all its copies will also be corrupted.

[osd.4]
getfattr -d -m . 
./benchmark\\udata\\uaircobra.lab.eng.blr.redhat.c\\u129845\\uobject53__head_94B05197__3 
# file: benchmark\134udata\134uaircobra.lab.eng.blr.redhat.c\134u129845\134uobject53__head_94B05197__3
security.selinux="system_u:object_r:ceph_var_lib_t:s0"
user.ceph._=0sDwghAQAABANdAAAAAAAAADwAAABiZW5jaG1hcmtfZGF0YV9haXJjb2JyYS5sYWIuZW5nLmJsci5yZWRoYXQuY18xMjk4NDVfb2JqZWN0NTP+/////////5dRsJQAAAAAAAMAAAAAAAAABgMcAAAAAwAAAAAAAAD/////AAAAAAAAAAD//////////wAAAAANAAAAAAAAACkAAAAAAAAAAAAAAAAAAAACAhUAAAAInBAAAAAAAAA2AAAAAAAAAAAAAAAAAEAAAAAAAD/JMllPeq4bAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==
user.ceph._@1=0sAAAAAAAAAAANAAAAAAAAAAAAAAAAAAAAADQAAAA/yTJZEXQXHCNKM4n/////
user.ceph.snapset="0000000xxxx000000000"
user.cephos.spill_out=0sMAA=

Actual results:


Expected results:


Additional info:

Comment 3 Ian Colle 2017-06-06 14:02:50 UTC

Existing bad behavior - fix in async

Comment 6 Greg Farnum 2018-10-31 21:29:42 UTC

As the linked PR got merged to the upstream release stream and this is fixed in subsequent releases, I am closing this bugzilla.

Note You need to log in before you can comment on or make changes to this bug.