After some tests on glusterfs-3.3.0qa5 upgraded to latest git pull , killed all glusterd glusterfs glusterfsd and restarted the glusterd. 2 mount points both fuse. Tried to empty the volume by rm -rfing on both the mount points. One of the rm -rf hung. And on the other every time rm -rf was saying it is removing the same directory. Statedump of the servers indicated blocked entrylks. xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = Blocked xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = Blocked xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = Blocked xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = Blocked Logs indicate that the entries of the a directlry are failed to get self-healed. afr might have held the lock on the directory and did not unlock it after its entries are failed to get self-healed.
(In reply to comment #0) > After some tests on glusterfs-3.3.0qa5 upgraded to latest git pull , killed all > glusterd glusterfs glusterfsd and restarted the glusterd. 2 mount points both > fuse. Tried to empty the volume by rm -rfing on both the mount points. One of > the rm -rf hung. And on the other every time rm -rf was saying it is removing > the same directory. > > Statedump of the servers indicated blocked entrylks. > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > Blocked > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > Blocked > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > Blocked > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > Blocked > > > Logs indicate that the entries of the a directlry are failed to get > self-healed. afr might have held the lock on the directory and did not unlock > it after its entries are failed to get self-healed. Looking at the logs, this seems to be the issue: *) Entrylk was sent without a gfid [2011-08-22 08:03:28.420914] I [server-resolve.c:571:server_resolve] 0-mirror-server: pure path resolution for /playground/run14600/linux-2.6.31.1/drivers/firewire (ENTRYLK) *) Entrylk failed on the server [2011-08-22 08:03:28.451662] I [server3_1-fops.c:341:server_entrylk_cbk] 0-mirror-server: 77698: ENTRYLK /playground/run14600/linux-2.6.31.1/drivers/firewire (0) ==> -1 (No such file or directory) *) /75/glusterfs/client-.log:[2011-08-22 08:03:28.451776] E [afr-lk-common.c:668:afr_unlock_entrylk_cbk] 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire: unlock failed on 0, reason: No such file or directory *) According to the client log, the entrylk that failed is unlock *) The client on 76 has entrylk failures on this entry after 08:03:28 time. *) brick-0 has the following stale lock: [xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1] xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.gfid=9835035b-e6a7-400c-b138-7c2bd8d873cf xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.nlookup=2 xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ref=3 xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ino=-1978601478 xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ia_type=2 xlator.feature.locks.inode.-1978601478.mandatory=0 xlator.feature.locks.entrylk-count.-1978601478.entrylk-count=2 xlator.feature.locks.lock-dump.domain.domain=mirror-replicate-0 xlator.feature.locks.lock-dump.domain.entrylk.entrylk[0](ACTIVE)= ENTRYLK_WRLCK on (null) owner=47512409645496, transport=0xf2cba0 xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, stat e = Blocked *) Observe that there is a blocked lock contending for net.c pranithk @ ~/code-reading/kernel 15:26:36 :) $ find . -name net.c ./drivers/vhost/net.c ./drivers/staging/pohmelfs/net.c ./drivers/firewire/net.c <<-------- ./net/tipc/net.c All this happened because of entrylk without gfid. So most probably entrylk lock happened with one gfid where as unlock with another gfid.
(In reply to comment #1) > (In reply to comment #0) > > After some tests on glusterfs-3.3.0qa5 upgraded to latest git pull , killed all > > glusterd glusterfs glusterfsd and restarted the glusterd. 2 mount points both > > fuse. Tried to empty the volume by rm -rfing on both the mount points. One of > > the rm -rf hung. And on the other every time rm -rf was saying it is removing > > the same directory. > > > > Statedump of the servers indicated blocked entrylks. > > > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > > Blocked > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > > Blocked > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > > Blocked > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > > Blocked > > > > > > Logs indicate that the entries of the a directlry are failed to get > > self-healed. afr might have held the lock on the directory and did not unlock > > it after its entries are failed to get self-healed. > > Looking at the logs, this seems to be the issue: > *) Entrylk was sent without a gfid > [2011-08-22 08:03:28.420914] I [server-resolve.c:571:server_resolve] > 0-mirror-server: pure path resolution for > /playground/run14600/linux-2.6.31.1/drivers/firewire (ENTRYLK) > > *) Entrylk failed on the server [2011-08-22 08:03:28.451662] I > [server3_1-fops.c:341:server_entrylk_cbk] 0-mirror-server: 77698: ENTRYLK > /playground/run14600/linux-2.6.31.1/drivers/firewire (0) ==> -1 (No such file > or directory) > > *) /75/glusterfs/client-.log:[2011-08-22 08:03:28.451776] E > [afr-lk-common.c:668:afr_unlock_entrylk_cbk] 0-mirror-replicate-0: > /playground/run14600/linux-2.6.31.1/drivers/firewire: unlock failed on 0, > reason: No such file or directory > *) According to the client log, the entrylk that failed is unlock > > *) The client on 76 has entrylk failures on this entry after 08:03:28 time. > *) brick-0 has the following stale lock: > [xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1] > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.gfid=9835035b-e6a7-400c-b138-7c2bd8d873cf > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.nlookup=2 > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ref=3 > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ino=-1978601478 > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ia_type=2 > xlator.feature.locks.inode.-1978601478.mandatory=0 > xlator.feature.locks.entrylk-count.-1978601478.entrylk-count=2 > xlator.feature.locks.lock-dump.domain.domain=mirror-replicate-0 > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[0](ACTIVE)= ENTRYLK_WRLCK > on (null) owner=47512409645496, transport=0xf2cba0 > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, stat > e = Blocked > > *) Observe that there is a blocked lock contending for net.c > pranithk @ ~/code-reading/kernel > 15:26:36 :) $ find . -name net.c > ./drivers/vhost/net.c > ./drivers/staging/pohmelfs/net.c > ./drivers/firewire/net.c <<-------- > ./net/tipc/net.c > > All this happened because of entrylk without gfid. So most probably entrylk > lock happened with one gfid where as unlock with another gfid. This confirms my assumption: Looking more at the client log on 75: entry self-heal on the same directory is triggered consecutively: The first one starts failing after the entry lock is taken, but unlock fails, THe second self-heal fails because it fails to acquire the locks. [2011-08-22 08:03:28.417360] I [afr-common.c:1225:afr_launch_self_heal] 0-mirror-replicate-0: background meta-data entry self-heal triggered. path: /playground/run14600/linux-2.6.31.1/drivers/firewire [2011-08-22 08:03:28.447512] E [client3_1-fops.c:232:client3_1_mknod_cbk] 0-mirror-client-0: remote operation failed: No such file or directory [2011-08-22 08:03:28.447553] E [afr-self-heal-entry.c:1108:afr_sh_entry_impunge_newfile_cbk] 0-mirror-replicate-0: creation of /playground/run14600/linux-2.6.31.1/drivers/firewire/core-device.c on mirror-client-0 failed (No such file or directory) .... .... heal-entry.c:130:afr_sh_entry_erase_pending_cbk] 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire: failed to erase pending xattrs on mirror-client-0 (No such file or directory) [2011-08-22 08:03:28.451750] I [client3_1-fops.c:1303:client3_1_entrylk_cbk] 0-mirror-client-0: remote operation failed: No such file or directory [2011-08-22 08:03:28.451776] E [afr-lk-common.c:668:afr_unlock_entrylk_cbk] 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire: unlock failed on 0, reason: No such file or directory [2011-08-22 08:03:28.451831] I [afr-self-heal-common.c:2022:afr_self_heal_completion_cbk] 0-mirror-replicate-0: background meta-data entry self-heal completed on /playground/run14600/linux-2.6.31.1/drivers/firewire Second self-heal: [2011-08-22 08:03:28.452415] I [afr-dir-read.c:124:afr_examine_dir_readdir_cbk] 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire: failed to do opendir on mirror-client-0 [2011-08-22 08:03:28.452876] I [afr-dir-read.c:175:afr_examine_dir_readdir_cbk] 0-mirror-replicate-0: entry self-heal triggered. path: /playground/run14600/linux-2.6.31.1/drivers/firewire, reason: checksums of directory differ, forced merge option set [2011-08-22 08:03:28.452909] I [afr-common.c:1225:afr_launch_self_heal] 0-mirror-replicate-0: background entry self-heal triggered. path: /playground/run14600/linux-2.6.31.1/drivers/firewire [2011-08-22 08:03:28.453344] E [afr-self-heal-entry.c:2319:afr_sh_post_nonblocking_entry_cbk] 0-mirror-replicate-0: Non Blocking entrylks failed for /playground/run14600/linux-2.6.31.1/drivers/firewire. [2011-08-22 08:03:28.453381] E [afr-self-heal-common.c:2019:afr_self_heal_completion_cbk] 0-mirror-replicate-0: background entry self-heal failed on /playground/run14600/linux-2.6.31.1/drivers/firewire
(In reply to comment #2) > (In reply to comment #1) > > (In reply to comment #0) > > > After some tests on glusterfs-3.3.0qa5 upgraded to latest git pull , killed all > > > glusterd glusterfs glusterfsd and restarted the glusterd. 2 mount points both > > > fuse. Tried to empty the volume by rm -rfing on both the mount points. One of > > > the rm -rf hung. And on the other every time rm -rf was saying it is removing > > > the same directory. > > > > > > Statedump of the servers indicated blocked entrylks. > > > > > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > > > Blocked > > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > > > Blocked > > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > > > Blocked > > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = > > > Blocked > > > > > > > > > Logs indicate that the entries of the a directlry are failed to get > > > self-healed. afr might have held the lock on the directory and did not unlock > > > it after its entries are failed to get self-healed. > > > > Looking at the logs, this seems to be the issue: > > *) Entrylk was sent without a gfid > > [2011-08-22 08:03:28.420914] I [server-resolve.c:571:server_resolve] > > 0-mirror-server: pure path resolution for > > /playground/run14600/linux-2.6.31.1/drivers/firewire (ENTRYLK) > > > > *) Entrylk failed on the server [2011-08-22 08:03:28.451662] I > > [server3_1-fops.c:341:server_entrylk_cbk] 0-mirror-server: 77698: ENTRYLK > > /playground/run14600/linux-2.6.31.1/drivers/firewire (0) ==> -1 (No such file > > or directory) > > > > *) /75/glusterfs/client-.log:[2011-08-22 08:03:28.451776] E > > [afr-lk-common.c:668:afr_unlock_entrylk_cbk] 0-mirror-replicate-0: > > /playground/run14600/linux-2.6.31.1/drivers/firewire: unlock failed on 0, > > reason: No such file or directory > > *) According to the client log, the entrylk that failed is unlock > > > > *) The client on 76 has entrylk failures on this entry after 08:03:28 time. > > *) brick-0 has the following stale lock: > > [xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1] > > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.gfid=9835035b-e6a7-400c-b138-7c2bd8d873cf > > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.nlookup=2 > > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ref=3 > > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ino=-1978601478 > > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ia_type=2 > > xlator.feature.locks.inode.-1978601478.mandatory=0 > > xlator.feature.locks.entrylk-count.-1978601478.entrylk-count=2 > > xlator.feature.locks.lock-dump.domain.domain=mirror-replicate-0 > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[0](ACTIVE)= ENTRYLK_WRLCK > > on (null) owner=47512409645496, transport=0xf2cba0 > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, stat > > e = Blocked > > > > *) Observe that there is a blocked lock contending for net.c > > pranithk @ ~/code-reading/kernel > > 15:26:36 :) $ find . -name net.c > > ./drivers/vhost/net.c > > ./drivers/staging/pohmelfs/net.c > > ./drivers/firewire/net.c <<-------- > > ./net/tipc/net.c > > > > All this happened because of entrylk without gfid. So most probably entrylk > > lock happened with one gfid where as unlock with another gfid. > > This confirms my assumption: > Looking more at the client log on 75: > entry self-heal on the same directory is triggered consecutively: > The first one starts failing after the entry lock is taken, but unlock fails, > THe second self-heal fails because it fails to acquire the locks. > > [2011-08-22 08:03:28.417360] I [afr-common.c:1225:afr_launch_self_heal] > 0-mirror-replicate-0: background meta-data entry self-heal triggered. path: > /playground/run14600/linux-2.6.31.1/drivers/firewire > [2011-08-22 08:03:28.447512] E [client3_1-fops.c:232:client3_1_mknod_cbk] > 0-mirror-client-0: remote operation failed: No such file or directory > [2011-08-22 08:03:28.447553] E > [afr-self-heal-entry.c:1108:afr_sh_entry_impunge_newfile_cbk] > 0-mirror-replicate-0: creation of > /playground/run14600/linux-2.6.31.1/drivers/firewire/core-device.c on > mirror-client-0 failed (No such file or directory) > .... > .... > heal-entry.c:130:afr_sh_entry_erase_pending_cbk] 0-mirror-replicate-0: > /playground/run14600/linux-2.6.31.1/drivers/firewire: failed to erase pending > xattrs on mirror-client-0 (No such file or directory) > [2011-08-22 08:03:28.451750] I [client3_1-fops.c:1303:client3_1_entrylk_cbk] > 0-mirror-client-0: remote operation failed: No such file or directory > [2011-08-22 08:03:28.451776] E [afr-lk-common.c:668:afr_unlock_entrylk_cbk] > 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire: > unlock failed on 0, reason: No such file or directory > [2011-08-22 08:03:28.451831] I > [afr-self-heal-common.c:2022:afr_self_heal_completion_cbk] > 0-mirror-replicate-0: background meta-data entry self-heal completed on > /playground/run14600/linux-2.6.31.1/drivers/firewire > > > Second self-heal: > [2011-08-22 08:03:28.452415] I [afr-dir-read.c:124:afr_examine_dir_readdir_cbk] > 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire: > failed to do opendir on mirror-client-0 > [2011-08-22 08:03:28.452876] I [afr-dir-read.c:175:afr_examine_dir_readdir_cbk] > 0-mirror-replicate-0: entry self-heal triggered. path: > /playground/run14600/linux-2.6.31.1/drivers/firewire, reason: checksums of > directory differ, forced merge option set > [2011-08-22 08:03:28.452909] I [afr-common.c:1225:afr_launch_self_heal] > 0-mirror-replicate-0: background entry self-heal triggered. path: > /playground/run14600/linux-2.6.31.1/drivers/firewire > [2011-08-22 08:03:28.453344] E > [afr-self-heal-entry.c:2319:afr_sh_post_nonblocking_entry_cbk] > 0-mirror-replicate-0: Non Blocking entrylks failed for > /playground/run14600/linux-2.6.31.1/drivers/firewire. > [2011-08-22 08:03:28.453381] E > [afr-self-heal-common.c:2019:afr_self_heal_completion_cbk] > 0-mirror-replicate-0: background entry self-heal failed on > /playground/run14600/linux-2.6.31.1/drivers/firewire Here is what is happening: 1) On client 1, ENtry self-heal is triggered on the directory 'firewire'. 2) Self-heal completes taking the lock on the entry firewire for NULL. 3) Now from client 2, rmdir comes for this directory. 4) On server1 the rmdir succeeds, on server2 it fails because the directory is not empty. 5) On server1, since rmdir succeeded the dentry gets unset, inode_unlink happens. 6) Now On client1 self-heal fails because the directory no longer exists on the server1. 7) self-heal tries to UNLOCK the self-heal lock (firewire, NULL), this succeeds on server2 but fails on server1 because the gfid was absent in the loc, loc->inode, it tries to do pure path resolution which fails, because the entry exists no more. So on brick1 there is a stale lock (firewire, NULL). All further operations on this entry which supply the gfid in the inode/loc get blocked because inode-find still finds the unlinked inode on server1 and there is a stale lock.
CHANGE: http://review.gluster.com/770 (protocol/client: Be strict about gfids in fop req) merged in master by Vijay Bellur (vijay)
This bug was already fixed as part of bug GLUSTER-3760. I added the asserts in the code path so it wont repeat again.
Checked with glusterfs-3.3.0qa40 and rm -rf from 2 clients is not hanging.