Bug 1459123 - Pg repair corrupts all shards if the primary shard((or the "selected" copy) is corrupted
Summary: Pg repair corrupts all shards if the primary shard((or the "selected" copy) ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 2.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 2.*
Assignee: David Zafman
QA Contact: Parikshith
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-06 11:08 UTC by Parikshith
Modified: 2018-10-31 21:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-31 21:29:42 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 20186 0 None None None 2017-06-13 01:23:56 UTC

Description Parikshith 2017-06-06 11:08:02 UTC
Description of problem: 

Pg repair corrupts all shards if the xattr(or data) in primary shard(or the "selected" copy) is corrupted.

Note: this bz covers the 'latent' issue highlighted in https://bugzilla.redhat.com/show_bug.cgi?id=1457097#c10.

Version-Release number of selected component (if applicable): 10.2.7-24.el7cp RHEL


How reproducible:

1. Created pool. Wrote some data. 
2. Picked one of the shards (primary) and corrupt one of the objects xattr
3. Scrubbed, it find the inconsistent object
4. Did a pg repair to successfully repaired the object.
5. Ran scrub again on the pg, it found the same object inconsistent again.

2017-06-03 20:09:04.877818 osd.0 [INF] 3.17 scrub starts
2017-06-03 20:09:04.879714 osd.0 [ERR] scrub 3.17 3:e98a0d29:::benchmark_data_aircobra.lab.eng.blr.redhat.c_129845_object53:head can't decode 'snapset' attr buffer::malformed_input: void SnapSet::decode(ceph::buffer::list::iterator&) unknown encoding version > 2
2017-06-03 20:09:04.879782 osd.0 [ERR] 3.17 scrub 1 errors

6. So did a Pg repair again, it didn't repair this time

2017-06-03 20:17:57.929809 osd.0 [INF] 3.17 repair starts
2017-06-03 20:17:59.446822 osd.0 [ERR] repair 3.17 3:e98a0d29:::benchmark_data_aircobra.lab.eng.blr.redhat.c_129845_object53:head can't decode 'snapset' attr buffer::malformed_input: void SnapSet::decode(ceph::buffer::list::iterator&) unknown encoding version > 2
2017-06-03 20:17:59.446855 osd.0 [ERR] 3.17 repair 1 errors, 0 fixed

After the repair, all copies will have its xattr(or data) value of the primary shard.If somehow primary (or the "selected" copy) is corrupted all its copies will also be corrupted.

[osd.4]
getfattr -d -m . 
./benchmark\\udata\\uaircobra.lab.eng.blr.redhat.c\\u129845\\uobject53__head_94B05197__3 
# file: benchmark\134udata\134uaircobra.lab.eng.blr.redhat.c\134u129845\134uobject53__head_94B05197__3
security.selinux="system_u:object_r:ceph_var_lib_t:s0"
user.ceph._=0sDwghAQAABANdAAAAAAAAADwAAABiZW5jaG1hcmtfZGF0YV9haXJjb2JyYS5sYWIuZW5nLmJsci5yZWRoYXQuY18xMjk4NDVfb2JqZWN0NTP+/////////5dRsJQAAAAAAAMAAAAAAAAABgMcAAAAAwAAAAAAAAD/////AAAAAAAAAAD//////////wAAAAANAAAAAAAAACkAAAAAAAAAAAAAAAAAAAACAhUAAAAInBAAAAAAAAA2AAAAAAAAAAAAAAAAAEAAAAAAAD/JMllPeq4bAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==
user.ceph._@1=0sAAAAAAAAAAANAAAAAAAAAAAAAAAAAAAAADQAAAA/yTJZEXQXHCNKM4n/////
user.ceph.snapset="0000000xxxx000000000"
user.cephos.spill_out=0sMAA=

Actual results:


Expected results:


Additional info:

Comment 3 Ian Colle 2017-06-06 14:02:50 UTC
Existing bad behavior - fix in async

Comment 6 Greg Farnum 2018-10-31 21:29:42 UTC
As the linked PR got merged to the upstream release stream and this is fixed in subsequent releases, I am closing this bugzilla.


Note You need to log in before you can comment on or make changes to this bug.