991174 – Data Loss after Self Healing

Bug 991174 - Data Loss after Self Healing

Summary: Data Loss after Self Healing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Pranith Kumar K
QA Contact:	spandura
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-08-01 19:07 UTC by Daniel Peess
Modified:	2013-11-27 15:30 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.4.0.34rhs
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-11-27 15:30:54 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:1769	0	normal	SHIPPED_LIVE	Red Hat Storage 2.1 enhancement and bug fix update #1	2013-11-27 20:17:39 UTC

Description Daniel Peess 2013-08-01 19:07:16 UTC

Description of problem:
Data Loss after Self Healing of files written during brick crash.

Version-Release number of selected component (if applicable):
2.0u4

How reproducible:
-) About 20% of the time.

Steps to Reproduce:

## failure test for writing.

[root@client-a1 ~]# for LOOP in {1..6}; do LC_ALL=C dd if=/dev/zero of="/glusterfs/benchmark${LOOP}.dmp" bs=64k count="$(( 1024 / 64 * 100 ))"; done; sync;
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 9.53119 s, 11.0 MB/s
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 9.59298 s, 10.9 MB/s
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 9.68409 s, 10.8 MB/s
<PROCESSING>

[root@brick-a1 ~]# sysctl kernel.sysrq;
kernel.sysrq = 0
[root@brick-a1 ~]# sysctl -w kernel.sysrq=1;
kernel.sysrq = 1
[root@brick-a1 ~]# echo c > /proc/sysrq-trigger; ## WARNING: crashes system!

[root@client-a1 ~]# <STILL PROCESSING>
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 36.0099 s, 2.9 MB/s
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 9.62357 s, 10.9 MB/s
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 9.67722 s, 10.8 MB/s

[root@client-a1 ~]# for LOOP in {1..6}; do LC_ALL=C dd if="/glusterfs/benchmark${LOOP}.dmp" bs=64k status=noxfer | md5sum -; done;
2f282b84e7e608d5852449ed940bfc51  -
2f282b84e7e608d5852449ed940bfc51  -
2f282b84e7e608d5852449ed940bfc51  -
2f282b84e7e608d5852449ed940bfc51  -
2f282b84e7e608d5852449ed940bfc51  -
2f282b84e7e608d5852449ed940bfc51  -

[root@brick-a2 ~]# gluster volume heal 'glusterfs' info;
Heal operation on volume glusterfs has been successful

Brick brick-a1:/glusterfs
Number of entries: 0

Brick brick-b1:/glusterfs
Number of entries: 2
/benchmark5.dmp
/benchmark4.dmp

Brick brick-a2:/glusterfs
Number of entries: 0

Brick brick-b2:/glusterfs
Number of entries: 0

## reboot the crashed storage brick and login again.
[root@brick-a1 ~]# service glusterd status;
glusterd is stopped
[root@brick-a1 ~]# service glusterd start;
Starting glusterd:                                         [  OK  ]
[root@brick-a1 ~]# gluster peer status;
...

## "10.8. Triggering Self-Heal on Replicate"
## "In replicate module, previously you had to manually trigger a self-heal when a brick goes offline and comes back online,
## to bring all the replicas in sync. Now the pro-active self-heal daemon runs in the background, 
## diagnoses issues and automatically initiates self-healing every 10 minutes on the files which require healing."
## https://access.redhat.com/site/documentation/en-US/Red_Hat_Storage/2.0/html/Administration_Guide/sect-User_Guide-Managing_Volumes-Self_heal.html

## superfluous
[root@brick-a1 ~]# time gluster volume heal 'glusterfs';
Heal operation on volume glusterfs has been successful
real    0m0.086s
user    0m0.053s
sys     0m0.010s

## non-functional?
[root@brick-a1 ~]# time gluster volume heal 'glusterfs' full;
Heal operation on volume glusterfs has been successful
real    0m0.077s
user    0m0.056s
sys     0m0.007s

[root@brick-a1 ~]# gluster volume heal 'glusterfs' info;
Heal operation on volume glusterfs has been successful

Brick brick-a1:/glusterfs
Number of entries: 0

Brick brick-b1:/glusterfs
Number of entries: 0

Brick brick-a2:/glusterfs
Number of entries: 0

Brick brick-b2:/glusterfs
Number of entries: 0

## note that split-brain does *not* *yet* list any fails.
[root@brick-a1 ~]# gluster volume heal 'glusterfs' info split-brain;
Heal operation on volume glusterfs has been successful

Brick brick-a1:/glusterfs
Number of entries: 0

Brick brick-b1:/glusterfs
Number of entries: 0

Brick brick-a2:/glusterfs
Number of entries: 0

Brick brick-b2:/glusterfs
Number of entries: 0

[root@brick-a1 ~]# gluster volume heal 'glusterfs' info healed;
Heal operation on volume glusterfs has been successful

Brick brick-a1:/glusterfs
Number of entries: 1
at                    path on brick
-----------------------------------
2013-07-10 14:27:44 <gfid:d3cc0fb2-b834-4f0a-94ef-a89a5681fd1f>

Brick brick-b1:/glusterfs
Number of entries: 2
at                    path on brick
-----------------------------------
2013-07-10 14:28:19 /benchmark4.dmp
2013-07-10 14:28:17 /benchmark5.dmp

Brick brick-a2:/glusterfs
Number of entries: 0

Brick brick-b2:/glusterfs
Number of entries: 0

## TODO: I/O error bug.
[root@client-a1 ~]# for LOOP in {1..6}; do LC_ALL=C dd if="/glusterfs/benchmark${LOOP}.dmp" of=/dev/null bs=64k; done;
dd: opening `/glusterfs/benchmark1.dmp': Input/output error
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 9.50641 s, 11.0 MB/s
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000238259 s, 0.0 kB/s
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 9.56144 s, 11.0 MB/s
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 9.66634 s, 10.8 MB/s
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 9.40601 s, 11.1 MB/s

## TODO: I/O error bug.
[root@client-b1 ~]# for LOOP in {1..6}; do LC_ALL=C dd if="/glusterfs/benchmark${LOOP}.dmp" of=/dev/null bs=64k; done;
dd: opening `/glusterfs/benchmark1.dmp': Input/output error
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 1.03918 s, 101 MB/s
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000434988 s, 0.0 kB/s
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 0.456711 s, 230 MB/s
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 0.443782 s, 236 MB/s
1600+0 records in
1600+0 records out
104857600 bytes (105 MB) copied, 1.01025 s, 104 MB/s

[root@client-a1 ~]# ls -las /glusterfs/;
insgesamt 434204
     0 drwxr-xr-x.  3 root root       172 10. Jul 15:28 .
     4 dr-xr-xr-x. 24 root root      4096  9. Jul 16:12 ..
 24600 -rw-r--r--.  1 root root  25190400 10. Jul 16:00 benchmark1.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:00 benchmark2.dmp
     0 -rw-r--r--.  1 root root         0 10. Jul 16:00 benchmark3.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:28 benchmark4.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:28 benchmark5.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:02 benchmark6.dmp

[root@client-b1 ~]# ls -las /glusterfs/;
insgesamt 512004
     0 drwxr-xr-x.  3 root root       172 10. Jul 15:28 .
     4 dr-xr-xr-x. 24 root root      4096  9. Jul 15:00 ..
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:00 benchmark1.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:00 benchmark2.dmp
     0 -rw-r--r--.  1 root root         0 10. Jul 16:00 benchmark3.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:28 benchmark4.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:28 benchmark5.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:02 benchmark6.dmp

[root@client-a1 ~]# umount /glusterfs/;
[root@client-a1 ~]# mount /glusterfs/;
unknown option _netdev (ignored)
[root@client-a1 ~]# ls -las /glusterfs/;
insgesamt 367772
     0 drwxr-xr-x.  3 root root       172 10. Jul 15:28 .
     4 dr-xr-xr-x. 24 root root      4096  9. Jul 16:12 ..
 24600 -rw-r--r--.  1 root root  25190400 10. Jul 16:00 benchmark1.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:00 benchmark2.dmp
     0 -rw-r--r--.  1 root root         0 10. Jul 16:00 benchmark3.dmp
 35968 -rw-r--r--.  1 root root 104857600 10. Jul 16:28 benchmark4.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:28 benchmark5.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:02 benchmark6.dmp

[root@client-b1 ~]# umount /glusterfs/;
[root@client-b1 ~]# mount /glusterfs/;
unknown option _netdev (ignored)
[root@client-b1 ~]# ls -las /glusterfs/;
insgesamt 445572
     0 drwxr-xr-x.  3 root root       172 10. Jul 15:28 .
     4 dr-xr-xr-x. 24 root root      4096  9. Jul 15:00 ..
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:00 benchmark1.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:00 benchmark2.dmp
     0 -rw-r--r--.  1 root root         0 10. Jul 16:00 benchmark3.dmp
 35968 -rw-r--r--.  1 root root 104857600 10. Jul 16:28 benchmark4.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:28 benchmark5.dmp
102400 -rw-r--r--.  1 root root 104857600 10. Jul 16:02 benchmark6.dmp

[root@client-a1 ~]# grep -i -- 'failed' /var/log/glusterfs/glusterfs-.log | tail -n10;
[2013-07-10 16:01:46.060097] W [client3_1-fops.c:1545:client3_1_finodelk_cbk] 0-glusterfs-client-0: remote operation failed: Transport endpoint is not connected
[2013-07-10 16:01:47.050480] E [socket.c:1715:socket_connect_finish] 0-glusterfs-client-0: connection to 10.128.61.60:24010 failed (No route to host)
[2013-07-10 16:02:00.062492] E [socket.c:1715:socket_connect_finish] 0-glusterfs: connection to 10.128.61.60:24007 failed (No route to host)
[2013-07-10 16:30:08.242893] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-glusterfs-replicate-0: background  data gfid self-heal failed on /benchmark1.dmp
[2013-07-10 16:30:08.258225] W [afr-open.c:213:afr_open] 0-glusterfs-replicate-0: failed to open as split brain seen, returning EIO
[2013-07-10 16:35:15.111284] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-glusterfs-replicate-0: background  data gfid self-heal failed on /benchmark1.dmp
[2013-07-10 16:35:15.111672] W [afr-open.c:213:afr_open] 0-glusterfs-replicate-0: failed to open as split brain seen, returning EIO
[2013-07-10 16:35:19.286342] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-glusterfs-replicate-0: background  data gfid self-heal failed on /benchmark1.dmp
[2013-07-10 16:38:02.270848] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-glusterfs-replicate-0: background  data gfid self-heal failed on /benchmark1.dmp
[2013-07-10 16:39:16.333892] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-glusterfs-replicate-0: background  data gfid self-heal failed on /benchmark1.dmp

## note that NOW split-brain shows files.
[root@brick-a1 ~]# gluster volume heal 'glusterfs' info split-brain;
Heal operation on volume glusterfs has been successful

Brick brick-a1:/glusterfs
Number of entries: 0

Brick brick-b1:/glusterfs
Number of entries: 2
at                    path on brick
-----------------------------------
2013-07-10 14:42:19 /benchmark1.dmp
2013-07-10 14:41:40 /benchmark1.dmp

Brick brick-a2:/glusterfs
Number of entries: 0

Brick brick-b2:/glusterfs
Number of entries: 0

## TODO: file recovery possible? nope? data loss.
[root@client-a1 ~]# LC_ALL=C rm -f /glusterfs/benchmark*.dmp;
[root@client-a1 ~]# for LOOP in {1..6}; do LC_ALL=C dd if=/dev/zero of="/glusterfs/benchmark${LOOP}.dmp" bs=64k count="$(( 1024 / 64 * 100 ))"; done; sync;

## when i've tried to reproduce this error not the initial storage brick crash caused the data loss/inconsistency,
## but when the auto-healing feature triggered.
## only after the crashed node re-joined and "auto-self-healed" the data, it became wrong.

## another error that i've triggered, 
## after 'client-a1' has crashed some glusterfs commands fail to work,
## as the crashed storage node is never declared dead, here '10.128.61.60' 5 minutes after crash:

[root@brick-a2 ~]# gluster peer status;
Number of Peers: 3

Hostname: brick-b2.mydomain.tld
Uuid: 21f2d087-21a9-4e62-adb6-56c6b69e43a4
State: Peer in Cluster (Connected)

Hostname: brick-b1.mydomain.tld
Uuid: 7dc0358d-54be-4a51-8791-0a9b6c7dcddf
State: Peer in Cluster (Connected)

Hostname: 10.128.61.60
Uuid: 168a3d9b-a9fd-4f44-a5d8-72ee724d24bf
State: Peer in Cluster (Connected)

[root@brick-a2 ~]# gluster volume heal 'glusterfs' info;
operation failed

## these tests were done with small but different glusterFS server and client versions, client newer then server.
[root@brick-a1 ~]# yum -C list installed glusterfs;
glusterfs.x86_64                   3.3.0.7rhs-1.el6rhs                    @anaconda-InstallationRepo-201304110052.x86_64/2.0
[root@client-a1 ~]# yum -C list installed glusterfs glusterfs-fuse;
glusterfs.x86_64                                3.3.0-22.el6                            @/glusterfs-3.3.0-22.el6.x86_64
glusterfs-fuse.x86_64                           3.3.0-22.el6                            @/glusterfs-fuse-3.3.0-22.el6.x86_64
[root@brick-a2 ~]# grep -ri -- 'Server and Client lk-version numbers are not same' /var/log/glusterfs/glustershd.log | tail -n1;
[2013-07-10 15:17:41.804042] I [client-handshake.c:1423:client_setvolume_cbk] 0-glusterfs-client-0: Server and Client lk-version numbers are not same, reopening the fds


Actual results:
-) Data Loss of Files written during Brick Crash, AFTER the bricked node returns and self healing has been triggered.

Expected results:
-) Crashed Brick syncs its missing files *from* its replica, not to its replica.

Additional info:
-) I was able to reproduce this error only in about 20% of the cases, all other self healings worked correctly, race condition?

Comment 4 Pranith Kumar K 2013-09-26 06:54:36 UTC

Daniel,
      This seems very much like the bug we fixed for bigbend https://bugzilla.redhat.com/show_bug.cgi?id=928784. The only part we still need to figure out in that bug is why sometimes both the bricks are getting disconnected, but other than that this bug should be fixed with Bigbend. Did you get a chance to try this experiment on Bigbend.

Pranith.

Comment 5 Pranith Kumar K 2013-10-03 07:29:52 UTC

I tried re-creating this issue to see if I could re-create the issue. I am not able to. I am attaching the output from the run I did.

After running the steps mentioned in the description. I ended up with following output. Even the md5sums matches. Moving it to ON_QA.

root@pranithk-vm1 - /mnt/r2 
12:36:50 :( ⚡ ls -l *
-rw-r--r-- 1 root root 1048576000 Oct  2 11:54 benchmark1.dmp
-rw-r--r-- 1 root root 1048576000 Oct  2 11:54 benchmark2.dmp
-rw-r--r-- 1 root root 1048576000 Oct  2 11:55 benchmark3.dmp
-rw-r--r-- 1 root root 1048576000 Oct  2 11:56 benchmark4.dmp
-rw-r--r-- 1 root root 1048576000 Oct  2 11:56 benchmark5.dmp
-rw-r--r-- 1 root root 1048576000 Oct  2 11:56 benchmark6.dmp

root@pranithk-vm1 - /mnt/r2 
12:36:54 :) ⚡ ls -l /brick/r2_0/
total 6144000
-rw-r--r-- 2 root root 1048576000 Oct  2 11:54 benchmark1.dmp
-rw-r--r-- 2 root root 1048576000 Oct  2 11:54 benchmark2.dmp
-rw-r--r-- 2 root root 1048576000 Oct  2 11:55 benchmark3.dmp
-rw-r--r-- 2 root root 1048576000 Oct  2 11:56 benchmark4.dmp
-rw-r--r-- 2 root root 1048576000 Oct  2 11:56 benchmark5.dmp
-rw-r--r-- 2 root root 1048576000 Oct  2 11:56 benchmark6.dmp

root@pranith-vm2 - ~ 
00:24:29 :) ⚡ ls -l /brick/r2_1/*
-rw-r--r-- 2 root root 1048576000 Oct  2 23:51 /brick/r2_1/benchmark1.dmp
-rw-r--r-- 2 root root 1048576000 Oct  2 23:51 /brick/r2_1/benchmark2.dmp
-rw-r--r-- 2 root root 1048576000 Oct  2 11:55 /brick/r2_1/benchmark3.dmp
-rw-r--r-- 2 root root 1048576000 Oct  2 11:56 /brick/r2_1/benchmark4.dmp
-rw-r--r-- 2 root root 1048576000 Oct  2 11:56 /brick/r2_1/benchmark5.dmp
-rw-r--r-- 2 root root 1048576000 Oct  2 11:56 /brick/r2_1/benchmark6.dmp

Comment 6 spandura 2013-10-24 09:00:58 UTC

Tested the case on "glusterfs 3.4.0.36rhs built on Oct 22 2013 10:56:18 " with the steps mentioned in "Description". Unable to recreate this issue. Moving it to verified state.

Comment 8 errata-xmlrpc 2013-11-27 15:30:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html

Note You need to log in before you can comment on or make changes to this bug.