Description of problem: ======================== As part of bug verification 1403840 - [GSS]xattr 'replica.split-brain-status' shows the file is in data-splitbrain but "heal split-brain latest-mtime" fails , I came up with a testcase to sanity check resolution of splitbrain based on bigger file option. When I tried to do that it failed While the data healing happened successfully, the CLI throws the resolution of splitbrain as failed. Post that the file shows up in pending heal Ravi(Dev.) debugged and found that it is because of pending metadata heals Following is the information: [root@dhcp35-37 rep2]# gluster v heal rep2 split-brain bigger-file /testbig Healing /testbig failed: File not in split-brain. Volume heal failed. root@dhcp35-37 rep2]# gluster v heal rep2 info Brick 10.70.35.116:/rhs/brick1/rep2 /testbig - Is in split-brain [root@dhcp35-37 rep2]# gluster v heal rep2 split-brain bigger-file /testbig Healing /testbig failed: File not in split-brain. Volume heal failed. ===>because of this the file is seen as heal pending . I have tested for two files and both end up with pending heal [root@dhcp35-37 rep2]# gluster v heal rep2 info Brick 10.70.35.116:/rhs/brick1/rep2 /bigfile /testbig Status: Connected Number of entries: 2 Brick 10.70.35.239:/rhs/brick1/rep2 Status: Connected Number of entries: 0 [root@dhcp35-37 rep2]# /bigfile Status: Connected Number of entries: 2 Brick 10.70.35.239:/rhs/brick1/rep2 /testbig - Is in split-brain Status: Connected Number of entries: 1 backend bricks==>heal is successful [root@dhcp35-116 ~]# md5sum /rhs/brick1/rep2/testbig 031bf15433a0c324c3c36b03b4ea384c /rhs/brick1/rep2/testbig [root@dhcp35-239 ~]# md5sum /rhs/brick1/rep2/testbig 031bf15433a0c324c3c36b03b4ea384c /rhs/brick1/rep2/testbig Volume Name: rep2 Type: Replicate Volume ID: 778d60b1-981b-4a33-9ed7-a7c09a389fa4 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.70.35.116:/rhs/brick1/rep2 Brick2: 10.70.35.239:/rhs/brick1/rep2 Options Reconfigured: cluster.self-heal-daemon: disable cluster.entry-self-heal: off cluster.data-self-heal: off cluster.metadata-self-heal: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@dhcp35-37 rep2]# gluster v status rep2 Status of volume: rep2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.116:/rhs/brick1/rep2 49160 0 Y 18035 Brick 10.70.35.239:/rhs/brick1/rep2 49160 0 Y 15909 Task Status of Volume rep2 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-37 rep2]# Version-Release number of selected component (if applicable): ================================= 3.8.4-11 How reproducible: ==================== mostly Steps to Reproduce: 1.have a 1x2 volume 2.disble client and server side healing 3.write a file f1 with say 1 line 4. bring down b1 and now append f1 with 10 lines 5. now bring back b1, and bring down b2 6. now delete data of f1 using " >f1" 7. now bring back b2 online 8. file can be seen in splitbrain 9. try to resolve it using bigger file option the resolution fails
The RCA is given in https://bugzilla.redhat.com/show_bug.cgi?id=1403840#c13 Basically, when a file is in data split-brain *and* has pending metadata heals (but not split-brain), and the bigger-file option of the CLI is used, after healing the data split-brain, it also tries to heal the metadata but fails because it is not in split-brain. Hence the CLI that file is not in split-brain.
Based on comment 2, my understanding is that this BZ is already addressed and there's no work required on this bug? Any specific reason why this BZ is hanging around for so many months now?
It looks like Ravi has context about this bug. Moving it to him.
(In reply to Atin Mukherjee from comment #6) > Based on comment 2, my understanding is that this BZ is already addressed > and there's no work required on this bug? Any specific reason why this BZ is > hanging around for so many months now? BZ 1403840 is for a different issue and does not fix this bug.