Bug 764681 (GLUSTER-2949) - self-heal hangs
Summary: self-heal hangs
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-2949
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.2.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-26 15:03 UTC by Raghavendra Bhat
Modified: 2014-02-04 17:28 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Raghavendra Bhat 2011-05-26 12:05:29 UTC
This is due to the blocked locks issue. Since there are 2 clients one of the replicate clients will get lock on one of the servers and gets blocked on the other. The same case with the other client, which gets the lock on the 2nd server and gets blocked on the 1st server. This is the statedump information from 2 servers which shows the blocked locks.



[xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1]
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.gfid=92e77d0e-a64e-493b-9754-f6640e05711f
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.nlookup=7
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ref=3
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ino=1687888
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ia_type=2
xlator.feature.locks.inode.1687888.mandatory=0
xlator.feature.locks.entrylk-count.1687888.entrylk-count=2
xlator.feature.locks.lock-dump.domain.domain=mirror-replicate-0
xlator.feature.locks.lock-dump.domain.entrylk.entrylk[0](ACTIVE)= ENTRYLK_WRLCK on (null) owner=47614754272372, transport=0xc1ced00
xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on (null) owner=47940836762464, transport=0xc1cd220, state = Blocked



[xlator.protocol.server.conn.1.bound_xl./export/mirror.active.8]
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.8.gfid=92e77d0e-a64e-493b-9754-f6640e05711f
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.8.nlookup=7
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.8.ref=7
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.8.ino=2326709
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.8.ia_type=2
xlator.feature.locks.inode.2326709.mandatory=0
xlator.feature.locks.entrylk-count.2326709.entrylk-count=2
xlator.feature.locks.lock-dump.domain.domain=mirror-replicate-0
xlator.feature.locks.lock-dump.domain.entrylk.entrylk[0](ACTIVE)= ENTRYLK_WRLCK on (null) owner=47940836762464, transport=0x1ad5b0a0
xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on (null) owner=47614754272372, transport=0x1adb6900, state = Blocked

Comment 1 Raghavendra Bhat 2011-05-26 15:03:10 UTC
This is the one reported in community.gluster.com:



I've been able to duplicate it a few times now. I'm running 4 v3.2.0-1 servers in a distributed-replicated setup. I'm testing from 2 clients. One is one of the gls servers, and the other is some other host on the same network.

Here's my basic setup.

host1:/var/gluster/data <-> host2:/var/gluster/data in one replicated pair.
host3:/var/gluster/data <-> host4:/var/gluster/data in one replicated pair.

Mount point on the clients = /mnt/gluster
Clients: host1, test1
Client type: gluster native cilent
OS: debian linux
kernel: 2.6.32 for all hosts

To test:
*umount /mnt/gluster on host1, stop its glusterd, kill its glusterfs procs.
*rm -rf /var/gluster/data/test_dir
*start glusterd, mount -t glusterfs host1:/vol /mnt/gluster
*Then from host1, perform a find /mnt/gluster -type f to trigger autoheal.
*At the same time, from test1, perform the same find command, because in production, multiple clients may be performing concurrent finds, even while self-healing.

Usually this isn't a problem, but 3 times now in the past 5 of testing, I've seen the finds hang. There are no obvious error messages in the logs. The clients just seem to be waiting for something. Uptime and top both don't show spikes in loads. It looks to me like the entire host is just blocked and waiting.

To recover from this, on host1, umount /mnt/gluster -f. Once this is done, the find on host1 dies, and the find on test1 continues. I actually need to umount a second time on host1 to really unmount it, because if I don't, then df -h shows transport not connected.

from host1, on glusterfs proc:
Here is strace output when a client hangs:

epoll_wait(3, {{EPOLLIN, {u32=7, u64=17179869191}}}, 257, 4294967295) = 1
readv(7, [{"\200\0\0 "..., 4}], 1)      = 4
readv(7, [{"\0\16\254D\0\0\0\1"..., 8}], 1) = 8
readv(7, [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377\0\0\0\v"..., 24}], 1) = 24
writev(7, [{"\200\0\0\354"..., 4}, {"\0\16\254F\0\0\0\0\0\0\0\2\0\23\320\5\0\0\0016\0\0\0\35\0\0\0\5\0\0\0X\0"..., 128}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0\1\0\0\0\1\0\0\0\0\177"..., 108}], 3) = 240
epoll_wait(3, {{EPOLLIN, {u32=7, u64=17179869191}}}, 257, 4294967295) = 1
readv(7, [{"\200\0\0 "..., 4}], 1)      = 4
readv(7, [{"\0\16\254E\0\0\0\1"..., 8}], 1) = 8
readv(7, [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 24}], 1) = 24
epoll_wait(3, {{EPOLLIN, {u32=7, u64=17179869191}}}, 257, 4294967295) = 1
readv(7, [{"\200\0\0 "..., 4}], 1)      = 4
readv(7, [{"\0\16\254G\0\0\0\1"..., 8}], 1) = 8
readv(7, [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 24}], 1) = 24
epoll_wait(3, {{EPOLLIN, {u32=7, u64=17179869191}}}, 257, 4294967295) = 1
readv(7, [{"\200\0\0 "..., 4}], 1)      = 4
readv(7, [{"\0\16\254H\0\0\0\1"..., 8}], 1) = 8
readv(7, [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 24}], 1) = 24
epoll_wait(3, {{EPOLLIN, {u32=7, u64=17179869191}}}, 257, 4294967295) = 1
readv(7, [{"\200\0\0 "..., 4}], 1)      = 4
readv(7, [{"\0\16\254I\0\0\0\1"..., 8}], 1) = 8
readv(7, [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 24}], 1) = 24
epoll_wait(3,

from mnt-gluster.log
[2011-05-06 19:09:33.316525] W [dict.c:437:dict_ref] (-->/usr/lib/glusterfs/3.2.0/xlator/protocol/client.so(client3_1_fstat_cbk+0xf3) [0x7f8b7a2d7a93] (-->/usr/lib/glusterfs/3.2.0/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x17d) [0x7f8b7a08529d] (-->/usr/lib/glusterfs/3.2.0/xlator/cluster/replicate.so(afr_sh_data_fix+0x228) [0x7f8b7a084f68]))) 0-dict: dict is NULL
[2011-05-06 19:09:33.318185] I [afr-common.c:581:afr_lookup_collect_xattr] 0-data-replicate-1: data self-heal is pending for /test2/0/test_0.log.gz.
[2011-05-06 19:09:33.318351] I [dht-common.c:739:dht_lookup_everywhere_cbk] 0-data-dht: deleting stale linkfile /test2/0/test_0.log.gz on data-replicate-1
[2011-05-06 19:09:33.322903] I [client3_1-fops.c:1226:client3_1_inodelk_cbk] 0-data-client-3: remote operation failed: No such file or directory
[2011-05-06 19:09:33.322958] I [afr-lk-common.c:568:afr_unlock_inodelk_cbk] 0-data-replicate-1: /test2/0/test_0.log.gz: unlock failed No such file or directory
[2011-05-06 19:09:33.325567] I [client3_1-fops.c:1226:client3_1_inodelk_cbk] 0-data-client-2: remote operation failed: No such file or directory
[2011-05-06 19:09:33.325607] I [afr-lk-common.c:568:afr_unlock_inodelk_cbk] 0-data-replicate-1: /test2/0/test_0.log.gz: unlock failed No such file or directory
[2011-05-06 19:09:33.326709] I [client3_1-fops.c:1640:client3_1_setattr_cbk] 0-data-client-3: remote operation failed: No such file or directory
[2011-05-06 19:09:33.326750] I [afr-self-heal-data.c:106:afr_sh_data_flush_cbk] 0-data-replicate-1: flush or setattr failed on /test2/0/test_0.log.gz on subvolume data-client-3: No such file or directory
[2011-05-06 19:09:33.327193] I [client3_1-fops.c:1640:client3_1_setattr_cbk] 0-data-client-2: remote operation failed: No such file or directory
[2011-05-06 19:09:33.327225] I [afr-self-heal-data.c:106:afr_sh_data_flush_cbk] 0-data-replicate-1: flush or setattr failed on /test2/0/test_0.log.gz on subvolume data-client-2: No such file or directory

strace on the client machine when it hangs:
epoll_wait(3, {{EPOLLIN, {u32=11, u64=11}}}, 257, 4294967295) = 1
readv(11, [{"\200\1\0\210"..., 4}], 1)  = 4
readv(11, [{"\0g\372\244\0\0\0\1"..., 8}], 1) = 8
readv(11, [{"\0\0\0\0"..., 4}], 1)      = 4
readv(11, [{"\0\0\0\0\0\0\0\0"..., 8}], 1) = 8
readv(11, [{"\0\0\0\0"..., 4}], 1)      = 4
readv(11, [{"\0\1\0\0\0\0\0\0\213\"]D\247-MH\237\345\245\2015-r\340\0\0\0\0\4\223\300\n\0"..., 112}], 1) = 112
readv(11, [{"'\355(lG\311\30\333w\243<\353]O\245\337H\360\v\363\257-?\331i89\5\201\267\362\\\263"..., 65536}], 1) = 38240
readv(11, 0xb26a88, 1)                  = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, {{EPOLLIN, {u32=11, u64=11}}}, 257, 4294967295) = 1
readv(11, [{"\3232~B\37\6=rq\31\377\256Q/\241\212\37\0329\213[i\315hnA\1\232M\351\323\250\251"..., 27296}], 1) = 8688
readv(11, 0xb26a88, 1)                  = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, {{EPOLLIN, {u32=11, u64=11}}}, 257, 4294967295) = 1
readv(11, [{"\364\303y\271\265S\263\2\251.\226m-\245\253\217\177\337\226\223q=\17>\247N\257\224\250J[\302W"..., 18608}], 1) = 18608
epoll_wait(3, {{EPOLLIN, {u32=11, u64=11}}}, 257, 4294967295) = 1
readv(11, [{"\200\1\0\210"..., 4}], 1)  = 4
readv(11, [{"\0g\372\246\0\0\0\1"..., 8}], 1) = 8
readv(11, [{"\0\0\0\0"..., 4}], 1)      = 4
readv(11, [{"\0\0\0\0\0\0\0\0"..., 8}], 1) = 8
readv(11, [{"\0\0\0\0"..., 4}], 1)      = 4
readv(11, [{"\0\1\0\0\0\0\0\0\213\"]D\247-MH\237\345\245\2015-r\340\0\0\0\0\4\223\300\n\0"..., 112}], 1) = 112
readv(11, [{"U'i^\5\335\36\2778\356\235\34\5/\333\234\347ij2i#A\3008\216vK\300\251\307\371g"..., 65536}], 1) = 17452
readv(11, 0xb26a88, 1)                  = -1 EAGAIN (Resource temporarily unavailable)






It has been reproduced in house also with methods mentioned in the report.

Comment 2 Anand Avati 2011-05-30 11:23:19 UTC
PATCH: http://patches.gluster.com/patch/7305 in master (cluster/afr: Send Non-blocking lock in non-blocking entrylk)

Comment 3 Anand Avati 2011-05-30 11:23:24 UTC
PATCH: http://patches.gluster.com/patch/7306 in master (features/locks: Unwind with SUCCESS on non-blocking entrylk success)

Comment 4 Anand Avati 2011-05-31 13:11:13 UTC
PATCH: http://patches.gluster.com/patch/7318 in release-3.1 (cluster/afr: Send Non-blocking lock in non-blocking entrylk)

Comment 5 Anand Avati 2011-05-31 13:11:18 UTC
PATCH: http://patches.gluster.com/patch/7319 in release-3.1 (features/locks: Unwind with SUCCESS on non-blocking entrylk success)

Comment 6 Anand Avati 2011-05-31 13:12:24 UTC
PATCH: http://patches.gluster.com/patch/7314 in release-3.2 (cluster/afr: Send Non-blocking lock in non-blocking entrylk)

Comment 7 Anand Avati 2011-05-31 13:12:29 UTC
PATCH: http://patches.gluster.com/patch/7315 in release-3.2 (features/locks: Unwind with SUCCESS on non-blocking entrylk success)

Comment 8 Pranith Kumar K 2011-06-01 00:13:27 UTC
The bug is observed because of a dead-lock between blocking lock-attempts. 
Blocking lock apis in afr avoid this situation by trying the lock on each brick one by one from "0 to child_count". Only non-blocking locks are attempted in parallel to all the nodes, but because they are not blocking a dead-lock wont happen.
   This dead-lock is observed because the entry-lk non-blocking api is trying a blocking lock inside.

The fix is to try non-blocking lock inside non-blocking entrylk.

Comment 9 Raghavendra Bhat 2011-06-01 07:36:27 UTC
Now find command on 2 glusterfs clients does not hang.

Comment 10 Anand Avati 2011-06-20 00:41:20 UTC
PATCH: http://patches.gluster.com/patch/7269 in master (print the owner and the transport for the blocked locks in server statedump)


Note You need to log in before you can comment on or make changes to this bug.