Description of problem: Data Loss after Self Healing of files written during brick crash. Version-Release number of selected component (if applicable): 2.0u4 How reproducible: -) About 20% of the time. Steps to Reproduce: ## failure test for writing. [root@client-a1 ~]# for LOOP in {1..6}; do LC_ALL=C dd if=/dev/zero of="/glusterfs/benchmark${LOOP}.dmp" bs=64k count="$(( 1024 / 64 * 100 ))"; done; sync; 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 9.53119 s, 11.0 MB/s 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 9.59298 s, 10.9 MB/s 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 9.68409 s, 10.8 MB/s <PROCESSING> [root@brick-a1 ~]# sysctl kernel.sysrq; kernel.sysrq = 0 [root@brick-a1 ~]# sysctl -w kernel.sysrq=1; kernel.sysrq = 1 [root@brick-a1 ~]# echo c > /proc/sysrq-trigger; ## WARNING: crashes system! [root@client-a1 ~]# <STILL PROCESSING> 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 36.0099 s, 2.9 MB/s 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 9.62357 s, 10.9 MB/s 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 9.67722 s, 10.8 MB/s [root@client-a1 ~]# for LOOP in {1..6}; do LC_ALL=C dd if="/glusterfs/benchmark${LOOP}.dmp" bs=64k status=noxfer | md5sum -; done; 2f282b84e7e608d5852449ed940bfc51 - 2f282b84e7e608d5852449ed940bfc51 - 2f282b84e7e608d5852449ed940bfc51 - 2f282b84e7e608d5852449ed940bfc51 - 2f282b84e7e608d5852449ed940bfc51 - 2f282b84e7e608d5852449ed940bfc51 - [root@brick-a2 ~]# gluster volume heal 'glusterfs' info; Heal operation on volume glusterfs has been successful Brick brick-a1:/glusterfs Number of entries: 0 Brick brick-b1:/glusterfs Number of entries: 2 /benchmark5.dmp /benchmark4.dmp Brick brick-a2:/glusterfs Number of entries: 0 Brick brick-b2:/glusterfs Number of entries: 0 ## reboot the crashed storage brick and login again. [root@brick-a1 ~]# service glusterd status; glusterd is stopped [root@brick-a1 ~]# service glusterd start; Starting glusterd: [ OK ] [root@brick-a1 ~]# gluster peer status; ... ## "10.8. Triggering Self-Heal on Replicate" ## "In replicate module, previously you had to manually trigger a self-heal when a brick goes offline and comes back online, ## to bring all the replicas in sync. Now the pro-active self-heal daemon runs in the background, ## diagnoses issues and automatically initiates self-healing every 10 minutes on the files which require healing." ## https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.0/html/Administration_Guide/sect-User_Guide-Managing_Volumes-Self_heal.html ## superfluous [root@brick-a1 ~]# time gluster volume heal 'glusterfs'; Heal operation on volume glusterfs has been successful real 0m0.086s user 0m0.053s sys 0m0.010s ## non-functional? [root@brick-a1 ~]# time gluster volume heal 'glusterfs' full; Heal operation on volume glusterfs has been successful real 0m0.077s user 0m0.056s sys 0m0.007s [root@brick-a1 ~]# gluster volume heal 'glusterfs' info; Heal operation on volume glusterfs has been successful Brick brick-a1:/glusterfs Number of entries: 0 Brick brick-b1:/glusterfs Number of entries: 0 Brick brick-a2:/glusterfs Number of entries: 0 Brick brick-b2:/glusterfs Number of entries: 0 ## note that split-brain does *not* *yet* list any fails. [root@brick-a1 ~]# gluster volume heal 'glusterfs' info split-brain; Heal operation on volume glusterfs has been successful Brick brick-a1:/glusterfs Number of entries: 0 Brick brick-b1:/glusterfs Number of entries: 0 Brick brick-a2:/glusterfs Number of entries: 0 Brick brick-b2:/glusterfs Number of entries: 0 [root@brick-a1 ~]# gluster volume heal 'glusterfs' info healed; Heal operation on volume glusterfs has been successful Brick brick-a1:/glusterfs Number of entries: 1 at path on brick ----------------------------------- 2013-07-10 14:27:44 <gfid:d3cc0fb2-b834-4f0a-94ef-a89a5681fd1f> Brick brick-b1:/glusterfs Number of entries: 2 at path on brick ----------------------------------- 2013-07-10 14:28:19 /benchmark4.dmp 2013-07-10 14:28:17 /benchmark5.dmp Brick brick-a2:/glusterfs Number of entries: 0 Brick brick-b2:/glusterfs Number of entries: 0 ## TODO: I/O error bug. [root@client-a1 ~]# for LOOP in {1..6}; do LC_ALL=C dd if="/glusterfs/benchmark${LOOP}.dmp" of=/dev/null bs=64k; done; dd: opening `/glusterfs/benchmark1.dmp': Input/output error 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 9.50641 s, 11.0 MB/s 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.000238259 s, 0.0 kB/s 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 9.56144 s, 11.0 MB/s 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 9.66634 s, 10.8 MB/s 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 9.40601 s, 11.1 MB/s ## TODO: I/O error bug. [root@client-b1 ~]# for LOOP in {1..6}; do LC_ALL=C dd if="/glusterfs/benchmark${LOOP}.dmp" of=/dev/null bs=64k; done; dd: opening `/glusterfs/benchmark1.dmp': Input/output error 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 1.03918 s, 101 MB/s 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.000434988 s, 0.0 kB/s 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 0.456711 s, 230 MB/s 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 0.443782 s, 236 MB/s 1600+0 records in 1600+0 records out 104857600 bytes (105 MB) copied, 1.01025 s, 104 MB/s [root@client-a1 ~]# ls -las /glusterfs/; insgesamt 434204 0 drwxr-xr-x. 3 root root 172 10. Jul 15:28 . 4 dr-xr-xr-x. 24 root root 4096 9. Jul 16:12 .. 24600 -rw-r--r--. 1 root root 25190400 10. Jul 16:00 benchmark1.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:00 benchmark2.dmp 0 -rw-r--r--. 1 root root 0 10. Jul 16:00 benchmark3.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:28 benchmark4.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:28 benchmark5.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:02 benchmark6.dmp [root@client-b1 ~]# ls -las /glusterfs/; insgesamt 512004 0 drwxr-xr-x. 3 root root 172 10. Jul 15:28 . 4 dr-xr-xr-x. 24 root root 4096 9. Jul 15:00 .. 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:00 benchmark1.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:00 benchmark2.dmp 0 -rw-r--r--. 1 root root 0 10. Jul 16:00 benchmark3.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:28 benchmark4.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:28 benchmark5.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:02 benchmark6.dmp [root@client-a1 ~]# umount /glusterfs/; [root@client-a1 ~]# mount /glusterfs/; unknown option _netdev (ignored) [root@client-a1 ~]# ls -las /glusterfs/; insgesamt 367772 0 drwxr-xr-x. 3 root root 172 10. Jul 15:28 . 4 dr-xr-xr-x. 24 root root 4096 9. Jul 16:12 .. 24600 -rw-r--r--. 1 root root 25190400 10. Jul 16:00 benchmark1.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:00 benchmark2.dmp 0 -rw-r--r--. 1 root root 0 10. Jul 16:00 benchmark3.dmp 35968 -rw-r--r--. 1 root root 104857600 10. Jul 16:28 benchmark4.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:28 benchmark5.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:02 benchmark6.dmp [root@client-b1 ~]# umount /glusterfs/; [root@client-b1 ~]# mount /glusterfs/; unknown option _netdev (ignored) [root@client-b1 ~]# ls -las /glusterfs/; insgesamt 445572 0 drwxr-xr-x. 3 root root 172 10. Jul 15:28 . 4 dr-xr-xr-x. 24 root root 4096 9. Jul 15:00 .. 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:00 benchmark1.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:00 benchmark2.dmp 0 -rw-r--r--. 1 root root 0 10. Jul 16:00 benchmark3.dmp 35968 -rw-r--r--. 1 root root 104857600 10. Jul 16:28 benchmark4.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:28 benchmark5.dmp 102400 -rw-r--r--. 1 root root 104857600 10. Jul 16:02 benchmark6.dmp [root@client-a1 ~]# grep -i -- 'failed' /var/log/glusterfs/glusterfs-.log | tail -n10; [2013-07-10 16:01:46.060097] W [client3_1-fops.c:1545:client3_1_finodelk_cbk] 0-glusterfs-client-0: remote operation failed: Transport endpoint is not connected [2013-07-10 16:01:47.050480] E [socket.c:1715:socket_connect_finish] 0-glusterfs-client-0: connection to 10.128.61.60:24010 failed (No route to host) [2013-07-10 16:02:00.062492] E [socket.c:1715:socket_connect_finish] 0-glusterfs: connection to 10.128.61.60:24007 failed (No route to host) [2013-07-10 16:30:08.242893] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-glusterfs-replicate-0: background data gfid self-heal failed on /benchmark1.dmp [2013-07-10 16:30:08.258225] W [afr-open.c:213:afr_open] 0-glusterfs-replicate-0: failed to open as split brain seen, returning EIO [2013-07-10 16:35:15.111284] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-glusterfs-replicate-0: background data gfid self-heal failed on /benchmark1.dmp [2013-07-10 16:35:15.111672] W [afr-open.c:213:afr_open] 0-glusterfs-replicate-0: failed to open as split brain seen, returning EIO [2013-07-10 16:35:19.286342] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-glusterfs-replicate-0: background data gfid self-heal failed on /benchmark1.dmp [2013-07-10 16:38:02.270848] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-glusterfs-replicate-0: background data gfid self-heal failed on /benchmark1.dmp [2013-07-10 16:39:16.333892] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-glusterfs-replicate-0: background data gfid self-heal failed on /benchmark1.dmp ## note that NOW split-brain shows files. [root@brick-a1 ~]# gluster volume heal 'glusterfs' info split-brain; Heal operation on volume glusterfs has been successful Brick brick-a1:/glusterfs Number of entries: 0 Brick brick-b1:/glusterfs Number of entries: 2 at path on brick ----------------------------------- 2013-07-10 14:42:19 /benchmark1.dmp 2013-07-10 14:41:40 /benchmark1.dmp Brick brick-a2:/glusterfs Number of entries: 0 Brick brick-b2:/glusterfs Number of entries: 0 ## TODO: file recovery possible? nope? data loss. [root@client-a1 ~]# LC_ALL=C rm -f /glusterfs/benchmark*.dmp; [root@client-a1 ~]# for LOOP in {1..6}; do LC_ALL=C dd if=/dev/zero of="/glusterfs/benchmark${LOOP}.dmp" bs=64k count="$(( 1024 / 64 * 100 ))"; done; sync; ## when i've tried to reproduce this error not the initial storage brick crash caused the data loss/inconsistency, ## but when the auto-healing feature triggered. ## only after the crashed node re-joined and "auto-self-healed" the data, it became wrong. ## another error that i've triggered, ## after 'client-a1' has crashed some glusterfs commands fail to work, ## as the crashed storage node is never declared dead, here '10.128.61.60' 5 minutes after crash: [root@brick-a2 ~]# gluster peer status; Number of Peers: 3 Hostname: brick-b2.mydomain.tld Uuid: 21f2d087-21a9-4e62-adb6-56c6b69e43a4 State: Peer in Cluster (Connected) Hostname: brick-b1.mydomain.tld Uuid: 7dc0358d-54be-4a51-8791-0a9b6c7dcddf State: Peer in Cluster (Connected) Hostname: 10.128.61.60 Uuid: 168a3d9b-a9fd-4f44-a5d8-72ee724d24bf State: Peer in Cluster (Connected) [root@brick-a2 ~]# gluster volume heal 'glusterfs' info; operation failed ## these tests were done with small but different glusterFS server and client versions, client newer then server. [root@brick-a1 ~]# yum -C list installed glusterfs; glusterfs.x86_64 3.3.0.7rhs-1.el6rhs @anaconda-InstallationRepo-201304110052.x86_64/2.0 [root@client-a1 ~]# yum -C list installed glusterfs glusterfs-fuse; glusterfs.x86_64 3.3.0-22.el6 @/glusterfs-3.3.0-22.el6.x86_64 glusterfs-fuse.x86_64 3.3.0-22.el6 @/glusterfs-fuse-3.3.0-22.el6.x86_64 [root@brick-a2 ~]# grep -ri -- 'Server and Client lk-version numbers are not same' /var/log/glusterfs/glustershd.log | tail -n1; [2013-07-10 15:17:41.804042] I [client-handshake.c:1423:client_setvolume_cbk] 0-glusterfs-client-0: Server and Client lk-version numbers are not same, reopening the fds Actual results: -) Data Loss of Files written during Brick Crash, AFTER the bricked node returns and self healing has been triggered. Expected results: -) Crashed Brick syncs its missing files *from* its replica, not to its replica. Additional info: -) I was able to reproduce this error only in about 20% of the cases, all other self healings worked correctly, race condition?
Daniel, This seems very much like the bug we fixed for bigbend https://bugzilla.redhat.com/show_bug.cgi?id=928784. The only part we still need to figure out in that bug is why sometimes both the bricks are getting disconnected, but other than that this bug should be fixed with Bigbend. Did you get a chance to try this experiment on Bigbend. Pranith.
I tried re-creating this issue to see if I could re-create the issue. I am not able to. I am attaching the output from the run I did. After running the steps mentioned in the description. I ended up with following output. Even the md5sums matches. Moving it to ON_QA. root@pranithk-vm1 - /mnt/r2 12:36:50 :( ⚡ ls -l * -rw-r--r-- 1 root root 1048576000 Oct 2 11:54 benchmark1.dmp -rw-r--r-- 1 root root 1048576000 Oct 2 11:54 benchmark2.dmp -rw-r--r-- 1 root root 1048576000 Oct 2 11:55 benchmark3.dmp -rw-r--r-- 1 root root 1048576000 Oct 2 11:56 benchmark4.dmp -rw-r--r-- 1 root root 1048576000 Oct 2 11:56 benchmark5.dmp -rw-r--r-- 1 root root 1048576000 Oct 2 11:56 benchmark6.dmp root@pranithk-vm1 - /mnt/r2 12:36:54 :) ⚡ ls -l /brick/r2_0/ total 6144000 -rw-r--r-- 2 root root 1048576000 Oct 2 11:54 benchmark1.dmp -rw-r--r-- 2 root root 1048576000 Oct 2 11:54 benchmark2.dmp -rw-r--r-- 2 root root 1048576000 Oct 2 11:55 benchmark3.dmp -rw-r--r-- 2 root root 1048576000 Oct 2 11:56 benchmark4.dmp -rw-r--r-- 2 root root 1048576000 Oct 2 11:56 benchmark5.dmp -rw-r--r-- 2 root root 1048576000 Oct 2 11:56 benchmark6.dmp root@pranith-vm2 - ~ 00:24:29 :) ⚡ ls -l /brick/r2_1/* -rw-r--r-- 2 root root 1048576000 Oct 2 23:51 /brick/r2_1/benchmark1.dmp -rw-r--r-- 2 root root 1048576000 Oct 2 23:51 /brick/r2_1/benchmark2.dmp -rw-r--r-- 2 root root 1048576000 Oct 2 11:55 /brick/r2_1/benchmark3.dmp -rw-r--r-- 2 root root 1048576000 Oct 2 11:56 /brick/r2_1/benchmark4.dmp -rw-r--r-- 2 root root 1048576000 Oct 2 11:56 /brick/r2_1/benchmark5.dmp -rw-r--r-- 2 root root 1048576000 Oct 2 11:56 /brick/r2_1/benchmark6.dmp
Tested the case on "glusterfs 3.4.0.36rhs built on Oct 22 2013 10:56:18 " with the steps mentioned in "Description". Unable to recreate this issue. Moving it to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1769.html