Description of problem: Pg repair corrupts all shards if the xattr(or data) in primary shard(or the "selected" copy) is corrupted. Note: this bz covers the 'latent' issue highlighted in https://bugzilla.redhat.com/show_bug.cgi?id=1457097#c10. Version-Release number of selected component (if applicable): 10.2.7-24.el7cp RHEL How reproducible: 1. Created pool. Wrote some data. 2. Picked one of the shards (primary) and corrupt one of the objects xattr 3. Scrubbed, it find the inconsistent object 4. Did a pg repair to successfully repaired the object. 5. Ran scrub again on the pg, it found the same object inconsistent again. 2017-06-03 20:09:04.877818 osd.0 [INF] 3.17 scrub starts 2017-06-03 20:09:04.879714 osd.0 [ERR] scrub 3.17 3:e98a0d29:::benchmark_data_aircobra.lab.eng.blr.redhat.c_129845_object53:head can't decode 'snapset' attr buffer::malformed_input: void SnapSet::decode(ceph::buffer::list::iterator&) unknown encoding version > 2 2017-06-03 20:09:04.879782 osd.0 [ERR] 3.17 scrub 1 errors 6. So did a Pg repair again, it didn't repair this time 2017-06-03 20:17:57.929809 osd.0 [INF] 3.17 repair starts 2017-06-03 20:17:59.446822 osd.0 [ERR] repair 3.17 3:e98a0d29:::benchmark_data_aircobra.lab.eng.blr.redhat.c_129845_object53:head can't decode 'snapset' attr buffer::malformed_input: void SnapSet::decode(ceph::buffer::list::iterator&) unknown encoding version > 2 2017-06-03 20:17:59.446855 osd.0 [ERR] 3.17 repair 1 errors, 0 fixed After the repair, all copies will have its xattr(or data) value of the primary shard.If somehow primary (or the "selected" copy) is corrupted all its copies will also be corrupted. [osd.4] getfattr -d -m . ./benchmark\\udata\\uaircobra.lab.eng.blr.redhat.c\\u129845\\uobject53__head_94B05197__3 # file: benchmark\134udata\134uaircobra.lab.eng.blr.redhat.c\134u129845\134uobject53__head_94B05197__3 security.selinux="system_u:object_r:ceph_var_lib_t:s0" user.ceph._=0sDwghAQAABANdAAAAAAAAADwAAABiZW5jaG1hcmtfZGF0YV9haXJjb2JyYS5sYWIuZW5nLmJsci5yZWRoYXQuY18xMjk4NDVfb2JqZWN0NTP+/////////5dRsJQAAAAAAAMAAAAAAAAABgMcAAAAAwAAAAAAAAD/////AAAAAAAAAAD//////////wAAAAANAAAAAAAAACkAAAAAAAAAAAAAAAAAAAACAhUAAAAInBAAAAAAAAA2AAAAAAAAAAAAAAAAAEAAAAAAAD/JMllPeq4bAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA== user.ceph._@1=0sAAAAAAAAAAANAAAAAAAAAAAAAAAAAAAAADQAAAA/yTJZEXQXHCNKM4n///// user.ceph.snapset="0000000xxxx000000000" user.cephos.spill_out=0sMAA= Actual results: Expected results: Additional info:
Existing bad behavior - fix in async
As the linked PR got merged to the upstream release stream and this is fixed in subsequent releases, I am closing this bugzilla.