Bug 1459123 - Pg repair corrupts all shards if the primary shard((or the "selected" copy) is corrupted
Pg repair corrupts all shards if the primary shard((or the "selected" copy) ...
Status: ASSIGNED
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
2.3
Unspecified Unspecified
unspecified Severity unspecified
: rc
: 2.5
Assigned To: David Zafman
Parikshith
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-06 07:08 EDT by Parikshith
Modified: 2017-08-14 17:11 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 20186 None None None 2017-06-12 21:23 EDT

  None (edit)
Description Parikshith 2017-06-06 07:08:02 EDT
Description of problem: 

Pg repair corrupts all shards if the xattr(or data) in primary shard(or the "selected" copy) is corrupted.

Note: this bz covers the 'latent' issue highlighted in https://bugzilla.redhat.com/show_bug.cgi?id=1457097#c10.

Version-Release number of selected component (if applicable): 10.2.7-24.el7cp RHEL


How reproducible:

1. Created pool. Wrote some data. 
2. Picked one of the shards (primary) and corrupt one of the objects xattr
3. Scrubbed, it find the inconsistent object
4. Did a pg repair to successfully repaired the object.
5. Ran scrub again on the pg, it found the same object inconsistent again.

2017-06-03 20:09:04.877818 osd.0 [INF] 3.17 scrub starts
2017-06-03 20:09:04.879714 osd.0 [ERR] scrub 3.17 3:e98a0d29:::benchmark_data_aircobra.lab.eng.blr.redhat.c_129845_object53:head can't decode 'snapset' attr buffer::malformed_input: void SnapSet::decode(ceph::buffer::list::iterator&) unknown encoding version > 2
2017-06-03 20:09:04.879782 osd.0 [ERR] 3.17 scrub 1 errors

6. So did a Pg repair again, it didn't repair this time

2017-06-03 20:17:57.929809 osd.0 [INF] 3.17 repair starts
2017-06-03 20:17:59.446822 osd.0 [ERR] repair 3.17 3:e98a0d29:::benchmark_data_aircobra.lab.eng.blr.redhat.c_129845_object53:head can't decode 'snapset' attr buffer::malformed_input: void SnapSet::decode(ceph::buffer::list::iterator&) unknown encoding version > 2
2017-06-03 20:17:59.446855 osd.0 [ERR] 3.17 repair 1 errors, 0 fixed

After the repair, all copies will have its xattr(or data) value of the primary shard.If somehow primary (or the "selected" copy) is corrupted all its copies will also be corrupted.

[osd.4]
getfattr -d -m . 
./benchmark\\udata\\uaircobra.lab.eng.blr.redhat.c\\u129845\\uobject53__head_94B05197__3 
# file: benchmark\134udata\134uaircobra.lab.eng.blr.redhat.c\134u129845\134uobject53__head_94B05197__3
security.selinux="system_u:object_r:ceph_var_lib_t:s0"
user.ceph._=0sDwghAQAABANdAAAAAAAAADwAAABiZW5jaG1hcmtfZGF0YV9haXJjb2JyYS5sYWIuZW5nLmJsci5yZWRoYXQuY18xMjk4NDVfb2JqZWN0NTP+/////////5dRsJQAAAAAAAMAAAAAAAAABgMcAAAAAwAAAAAAAAD/////AAAAAAAAAAD//////////wAAAAANAAAAAAAAACkAAAAAAAAAAAAAAAAAAAACAhUAAAAInBAAAAAAAAA2AAAAAAAAAAAAAAAAAEAAAAAAAD/JMllPeq4bAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==
user.ceph._@1=0sAAAAAAAAAAANAAAAAAAAAAAAAAAAAAAAADQAAAA/yTJZEXQXHCNKM4n/////
user.ceph.snapset="0000000xxxx000000000"
user.cephos.spill_out=0sMAA=

Actual results:


Expected results:


Additional info:
Comment 3 Ian Colle 2017-06-06 10:02:50 EDT
Existing bad behavior - fix in async

Note You need to log in before you can comment on or make changes to this bug.