Description of problem: ======================= (always reproducibel on a distrep volume) In a distributed-relplicated volume, when I set the favorite child policy as ctime and there is a data heal pending with metadata splitbrain, the healing of splitbrains is resolving in an inconsistent manner with some files selecting source as one brick and the remaining files as source as other brick Version-Release number of selected component (if applicable): ============ 3.8.4-13 How reproducible: ============== always on a distrep volume Steps to Reproduce: =================== 1.create a 2x2 volume with with n1 hosting b1,b3 and n2 hosting b2,b4 where b1-b2 are replica pairs and b3-b4 are replica pairs 2.set favorite child policy as ctime 3.mount volume on 3 clients say c1,c2, and c3 as below c1 can see only n1 c2 can see only n2 and c3 can see both 4. disable heal deamon 5. from c3 create ten files say file{1..10} in a directory , these files must be seen by all clients , which are distributed on both dht -subvols 6. now from c2, append all files with some data, that means now the files are with data heal pending 7.now wait for 2 minutes, and check from c1 by issueing ls or ll, the data WOULD NOT be synced as heal deamon is disabled and network partition Whereas xattrs of files on bricks will have pending data attributes on n2 marked for n1 7. now do a metadata change on all files say "chmod 0000 file*" 8. now wait for 2 minutes, and now from c2, do a metadata change say "chmod 0777 file*" this means there is a data heal pending with src as n2 and sync as n1 whereas there is metadata split brain as n1 blames n2 and vice-versa 9. now enable heal deamon and trigger the heal command. Now below is what happens: expected: ========= on all bricks of both nodes all files must have permissions and 0777 as this was the last change and hence the latest ctime and the data must be synced What is seen or current behavior: ================= while the data syncing happens well, the metadata healing based on policy is not happening as expected, with shd logs logging "conflict" as below (all files were supposed to have -rwxrwxrwx., but only some files are having) [root@dhcp35-116 ~]# ll /rhs/brick*/distrep/dir9/ /rhs/brick1/distrep/dir9/: total 5000 -rwxrwxrwx. 2 root root 1024000 Jan 28 20:46 data10 -rwxrwxrwx. 2 root root 1024000 Jan 28 20:46 data2 ----------. 2 root root 1024000 Jan 28 20:46 data3 -rwxrwxrwx. 2 root root 1024000 Jan 28 20:46 data4 ----------. 2 root root 1024000 Jan 28 20:46 data9 /rhs/brick2/distrep/dir9/: total 5000 ----------. 2 root root 1024000 Jan 28 20:46 data1 ----------. 2 root root 1024000 Jan 28 20:46 data5 ----------. 2 root root 1024000 Jan 28 20:46 data6 ----------. 2 root root 1024000 Jan 28 20:46 data7 ----------. 2 root root 1024000 Jan 28 20:46 data8 [2017-01-28 15:23:33.890828] I [MSGID: 108026] [afr-self-heal-common.c:1174:afr_log_selfheal] 0-distrep-replicate-0: Completed data selfheal on 5fa7202f-4c19-461c-9be1-1 fd620b870d9. sources=[1] sinks=0 [2017-01-28 15:23:33.890929] I [MSGID: 108026] [afr-self-heal-common.c:1174:afr_log_selfheal] 0-distrep-replicate-1: Completed data selfheal on ac459fec-0bfe-4988-be08-4 62e27091167. sources=[1] sinks=0 [2017-01-28 15:23:33.895133] W [MSGID: 108042] [afr-self-heal-common.c:828:afr_mark_split_brain_source_sinks_by_policy] 0-distrep-replicate-0: Source distrep-client-1 se lected as authentic to resolve conflicting data in file (gfid:5fa7202f-4c19-461c-9be1-1fd620b870d9) by CTIME (1024000 bytes @ 2017-01-28 20:46:59 mtime, 2017-01-28 20:53 :33 ctime). [2017-01-28 15:23:33.895851] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-distrep-replicate-0: performing metadata selfheal on 5fa7202f-4 c19-461c-9be1-1fd620b870d9 [2017-01-28 15:23:33.896491] W [MSGID: 108042] [afr-self-heal-common.c:828:afr_mark_split_brain_source_sinks_by_policy] 0-distrep-replicate-1: Source distrep-client-2 se lected as authentic to resolve conflicting data in file (gfid:64bd8fa9-3cd8-4fe7-9be8-12d735e6d996) by CTIME (1024000 bytes @ 2017-01-28 20:53:33 mtime, 2017-01-28 20:53:33 ctime). [2017-01-28 15:23:33.896857] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-distrep-replicate-1: performing metadata selfheal on 64bd8fa9-3cd8-4fe7-9be8-12d735e6d996 [2017-01-28 15:23:33.898893] I [MSGID: 108026] [afr-self-heal-common.c:1174:afr_log_selfheal] 0-distrep-replicate-0: Completed metadata selfheal on 5fa7202f-4c19-461c-9be1-1fd620b870d9. sources=[1] sinks=0 [2017-01-28 15:23:33.900558] I [MSGID: 108026] [afr-self-heal-common.c:1174:afr_log_selfheal] 0-distrep-replicate-1: Completed metadata selfheal on 64bd8fa9-3cd8-4fe7-9be8-12d735e6d996. sources=[0] sinks=1 [2017-01-28 15:23:33.909158] W [MSGID: 108042] [afr-self-heal-common.c:828:afr_mark_split_brain_source_sinks_by_policy] 0-distrep-replicate-0: Source distrep-client-0 selected as authentic to resolve conflicting data in file (gfid:fbe34602-d1ac-47c4-a2db-69cc958a75ab) by CTIME (1024000 bytes @ 2017-01-28 20:53:33 mtime, 2017-01-28 20:53:33 ctime). [2017-01-28 15:23:33.910249] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-distrep-replicate-0: performing metadata selfheal on fbe34602-d1ac-47c4-a2db-69cc958a75ab [2017-01-28 15:23:33.916371] I [MSGID: 108026] [afr-self-heal-common.c:1174:afr_log_selfheal] 0-distrep-replicate-0: Completed metadata selfheal on fbe34602-d1ac-47c4-a2db-69cc958a75ab. sources=[0] sinks=1 [2017-01-28 15:23:33.946447] I [MSGID: 108026] [afr-self-heal-common.c:1174:afr_log_selfheal] 0-distrep-replicate-1: Completed data selfheal on fefaf3f3-275f-44df-bb16-6798b7beb677. sources=[1] sinks=0 [2017-01-28 15:23:33.953614] W [MSGID: 108042] [afr-self-heal-common.c:828:afr_mark_split_brain_source_sinks_by_policy] 0-distrep-replicate-1: Source distrep-client-2 selected as authentic to resolve conflicting data in file (gfid:5bbefe9a-26db-4168-b523-68d243ddbaa3) by CTIME (1024000 bytes @ 2017-01-28 20:53:33 mtime, 2017-01-28 20:53:33 ctime). [2017-01-28 15:23:33.954040] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-distrep-replicate-1: performing metadata selfheal on 5bbefe9a-26db-4168-b523-68d243ddbaa3 [2017-01-28 15:23:33.957485] I [MSGID: 108026] [afr-self-heal-common.c:1174:afr_log_selfheal] 0-distrep-replicate-0: Completed data selfheal on d076f826-25ec-44db-8312-8f5b8828f25b. sources=[1] sinks=0 [2017-01-28 15:23:33.958104] I [MSGID: 108026] [afr-self-heal-common.c:1174:afr_log_selfheal] 0-distrep-replicate-1: Completed metadata selfheal on 5bbefe9a-26db-4168-b523-68d243ddbaa3. sources=[0] sinks=1 [2017-01-28 15:23:33.964549] W [MSGID: 108042] [afr-self-heal-common.c:828:afr_mark_split_brain_source_sinks_by_policy] 0-distrep-replicate-0: Source distrep-client-0 selected as authentic to resolve conflicting data in file (gfid:8264dcd2-1988-43f0-a060-e29bd491c101) by CTIME (1024000 bytes @ 2017-01-28 20:53:33 mtime, 2017-01-28 20:53:33 ctime). [2017-01-28 15:23:33.965289] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-distrep-replicate-0: performing metadata selfheal on 8264dcd2-1:
Created attachment 1245410 [details] test logs
Volume Name: distrep Type: Distributed-Replicate Volume ID: df5319f0-d889-4030-bb39-b8a41936a726 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.37:/rhs/brick1/distrep Brick2: 10.70.35.116:/rhs/brick1/distrep Brick3: 10.70.35.37:/rhs/brick2/distrep Brick4: 10.70.35.116:/rhs/brick2/distrep Options Reconfigured: cluster.favorite-child-policy: ctime cluster.self-heal-daemon: enable transport.address-family: inet performance.readdir-ahead: on nfs.disable: on [root@dhcp35-196 glusterfs]#
Needinfo hasn't been addressed for more than a year now. Is this bug still valid in the latest releases? Can we please have an assessment on this RHBZ?