Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 765198 (GLUSTER-3466)

Summary:	[glusterfs-3.3.0qa6]: blocked locks in the server
Product:	[Community] GlusterFS	Reporter:	Raghavendra Bhat <rabhat>
Component:	replicate	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	pre-release	CC:	gluster-bugs
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.4.0	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-07-24 17:34:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:	glusterfs-3.3.0qa40	Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	817967

Description Raghavendra Bhat 2011-08-22 16:44:01 UTC

After some tests on glusterfs-3.3.0qa5 upgraded to latest git pull , killed all glusterd glusterfs glusterfsd and restarted the glusterd. 2 mount points both fuse. Tried to empty the volume by rm -rfing on both the mount points. One of the rm -rf hung. And on the other every time rm -rf was saying it is removing the same directory.

Statedump of the servers indicated blocked entrylks.

xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = Blocked
xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = Blocked
xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = Blocked
xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state = Blocked


Logs indicate that the entries of the a directlry are failed to get self-healed. afr might have held the lock on the directory and did not unlock it after its entries are failed to get self-healed.

Comment 1 Pranith Kumar K 2011-12-01 07:21:03 UTC

(In reply to comment #0)
> After some tests on glusterfs-3.3.0qa5 upgraded to latest git pull , killed all
> glusterd glusterfs glusterfsd and restarted the glusterd. 2 mount points both
> fuse. Tried to empty the volume by rm -rfing on both the mount points. One of
> the rm -rf hung. And on the other every time rm -rf was saying it is removing
> the same directory.
> 
> Statedump of the servers indicated blocked entrylks.
> 
> xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> Blocked
> xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> Blocked
> xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> Blocked
> xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> Blocked
> 
> 
> Logs indicate that the entries of the a directlry are failed to get
> self-healed. afr might have held the lock on the directory and did not unlock
> it after its entries are failed to get self-healed.

Looking at the logs, this seems to be the issue:
*) Entrylk was sent without a gfid
[2011-08-22 08:03:28.420914] I [server-resolve.c:571:server_resolve] 0-mirror-server: pure path resolution for /playground/run14600/linux-2.6.31.1/drivers/firewire (ENTRYLK)

*) Entrylk failed on the server [2011-08-22 08:03:28.451662] I [server3_1-fops.c:341:server_entrylk_cbk] 0-mirror-server: 77698: ENTRYLK /playground/run14600/linux-2.6.31.1/drivers/firewire (0) ==> -1 (No such file or directory)

*) /75/glusterfs/client-.log:[2011-08-22 08:03:28.451776] E [afr-lk-common.c:668:afr_unlock_entrylk_cbk] 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire: unlock failed on 0, reason: No such file or directory
*) According to the client log, the entrylk that failed is unlock

*) The client on 76 has entrylk failures on this entry after 08:03:28 time.
*) brick-0 has the following stale lock:
[xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1]
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.gfid=9835035b-e6a7-400c-b138-7c2bd8d873cf
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.nlookup=2
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ref=3
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ino=-1978601478
xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ia_type=2
xlator.feature.locks.inode.-1978601478.mandatory=0
xlator.feature.locks.entrylk-count.-1978601478.entrylk-count=2
xlator.feature.locks.lock-dump.domain.domain=mirror-replicate-0
xlator.feature.locks.lock-dump.domain.entrylk.entrylk[0](ACTIVE)= ENTRYLK_WRLCK on (null) owner=47512409645496, transport=0xf2cba0
xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)= ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, stat
e = Blocked

*) Observe that there is a blocked lock contending for net.c
pranithk @ ~/code-reading/kernel
15:26:36 :) $ find . -name net.c
./drivers/vhost/net.c
./drivers/staging/pohmelfs/net.c
./drivers/firewire/net.c <<--------
./net/tipc/net.c

All this happened because of entrylk without gfid. So most probably entrylk lock happened with one gfid where as unlock with another gfid.

Comment 2 Pranith Kumar K 2011-12-01 07:33:07 UTC

(In reply to comment #1)
> (In reply to comment #0)
> > After some tests on glusterfs-3.3.0qa5 upgraded to latest git pull , killed all
> > glusterd glusterfs glusterfsd and restarted the glusterd. 2 mount points both
> > fuse. Tried to empty the volume by rm -rfing on both the mount points. One of
> > the rm -rf hung. And on the other every time rm -rf was saying it is removing
> > the same directory.
> > 
> > Statedump of the servers indicated blocked entrylks.
> > 
> > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> > Blocked
> > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> > Blocked
> > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> > Blocked
> > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> > Blocked
> > 
> > 
> > Logs indicate that the entries of the a directlry are failed to get
> > self-healed. afr might have held the lock on the directory and did not unlock
> > it after its entries are failed to get self-healed.
> 
> Looking at the logs, this seems to be the issue:
> *) Entrylk was sent without a gfid
> [2011-08-22 08:03:28.420914] I [server-resolve.c:571:server_resolve]
> 0-mirror-server: pure path resolution for
> /playground/run14600/linux-2.6.31.1/drivers/firewire (ENTRYLK)
> 
> *) Entrylk failed on the server [2011-08-22 08:03:28.451662] I
> [server3_1-fops.c:341:server_entrylk_cbk] 0-mirror-server: 77698: ENTRYLK
> /playground/run14600/linux-2.6.31.1/drivers/firewire (0) ==> -1 (No such file
> or directory)
> 
> *) /75/glusterfs/client-.log:[2011-08-22 08:03:28.451776] E
> [afr-lk-common.c:668:afr_unlock_entrylk_cbk] 0-mirror-replicate-0:
> /playground/run14600/linux-2.6.31.1/drivers/firewire: unlock failed on 0,
> reason: No such file or directory
> *) According to the client log, the entrylk that failed is unlock
> 
> *) The client on 76 has entrylk failures on this entry after 08:03:28 time.
> *) brick-0 has the following stale lock:
> [xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1]
> xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.gfid=9835035b-e6a7-400c-b138-7c2bd8d873cf
> xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.nlookup=2
> xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ref=3
> xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ino=-1978601478
> xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ia_type=2
> xlator.feature.locks.inode.-1978601478.mandatory=0
> xlator.feature.locks.entrylk-count.-1978601478.entrylk-count=2
> xlator.feature.locks.lock-dump.domain.domain=mirror-replicate-0
> xlator.feature.locks.lock-dump.domain.entrylk.entrylk[0](ACTIVE)= ENTRYLK_WRLCK
> on (null) owner=47512409645496, transport=0xf2cba0
> xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, stat
> e = Blocked
> 
> *) Observe that there is a blocked lock contending for net.c
> pranithk @ ~/code-reading/kernel
> 15:26:36 :) $ find . -name net.c
> ./drivers/vhost/net.c
> ./drivers/staging/pohmelfs/net.c
> ./drivers/firewire/net.c <<--------
> ./net/tipc/net.c
> 
> All this happened because of entrylk without gfid. So most probably entrylk
> lock happened with one gfid where as unlock with another gfid.

This confirms my assumption:
Looking more at the client log on 75:
entry self-heal on the same directory is triggered consecutively:
The first one starts failing after the entry lock is taken, but unlock fails,
THe second self-heal fails because it fails to acquire the locks.

[2011-08-22 08:03:28.417360] I [afr-common.c:1225:afr_launch_self_heal] 0-mirror-replicate-0: background  meta-data entry self-heal triggered. path: /playground/run14600/linux-2.6.31.1/drivers/firewire
[2011-08-22 08:03:28.447512] E [client3_1-fops.c:232:client3_1_mknod_cbk] 0-mirror-client-0: remote operation failed: No such file or directory
[2011-08-22 08:03:28.447553] E [afr-self-heal-entry.c:1108:afr_sh_entry_impunge_newfile_cbk] 0-mirror-replicate-0: creation of /playground/run14600/linux-2.6.31.1/drivers/firewire/core-device.c on mirror-client-0 failed (No such file or directory)
....
....
heal-entry.c:130:afr_sh_entry_erase_pending_cbk] 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire: failed to erase pending xattrs on mirror-client-0 (No such file or directory)
[2011-08-22 08:03:28.451750] I [client3_1-fops.c:1303:client3_1_entrylk_cbk] 0-mirror-client-0: remote operation failed: No such file or directory
[2011-08-22 08:03:28.451776] E [afr-lk-common.c:668:afr_unlock_entrylk_cbk] 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire: unlock failed on 0, reason: No such file or directory
[2011-08-22 08:03:28.451831] I [afr-self-heal-common.c:2022:afr_self_heal_completion_cbk] 0-mirror-replicate-0: background  meta-data entry self-heal completed on /playground/run14600/linux-2.6.31.1/drivers/firewire


Second self-heal:
[2011-08-22 08:03:28.452415] I [afr-dir-read.c:124:afr_examine_dir_readdir_cbk] 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire: failed to do opendir on mirror-client-0
[2011-08-22 08:03:28.452876] I [afr-dir-read.c:175:afr_examine_dir_readdir_cbk] 0-mirror-replicate-0:  entry self-heal triggered. path: /playground/run14600/linux-2.6.31.1/drivers/firewire, reason: checksums of directory differ, forced merge option set
[2011-08-22 08:03:28.452909] I [afr-common.c:1225:afr_launch_self_heal] 0-mirror-replicate-0: background  entry self-heal triggered. path: /playground/run14600/linux-2.6.31.1/drivers/firewire
[2011-08-22 08:03:28.453344] E [afr-self-heal-entry.c:2319:afr_sh_post_nonblocking_entry_cbk] 0-mirror-replicate-0: Non Blocking entrylks failed for /playground/run14600/linux-2.6.31.1/drivers/firewire.
[2011-08-22 08:03:28.453381] E [afr-self-heal-common.c:2019:afr_self_heal_completion_cbk] 0-mirror-replicate-0: background  entry self-heal failed on /playground/run14600/linux-2.6.31.1/drivers/firewire

Comment 3 Pranith Kumar K 2011-12-05 07:27:29 UTC

(In reply to comment #2)
> (In reply to comment #1)
> > (In reply to comment #0)
> > > After some tests on glusterfs-3.3.0qa5 upgraded to latest git pull , killed all
> > > glusterd glusterfs glusterfsd and restarted the glusterd. 2 mount points both
> > > fuse. Tried to empty the volume by rm -rfing on both the mount points. One of
> > > the rm -rf hung. And on the other every time rm -rf was saying it is removing
> > > the same directory.
> > > 
> > > Statedump of the servers indicated blocked entrylks.
> > > 
> > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> > > Blocked
> > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> > > Blocked
> > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> > > Blocked
> > > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> > > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, state =
> > > Blocked
> > > 
> > > 
> > > Logs indicate that the entries of the a directlry are failed to get
> > > self-healed. afr might have held the lock on the directory and did not unlock
> > > it after its entries are failed to get self-healed.
> > 
> > Looking at the logs, this seems to be the issue:
> > *) Entrylk was sent without a gfid
> > [2011-08-22 08:03:28.420914] I [server-resolve.c:571:server_resolve]
> > 0-mirror-server: pure path resolution for
> > /playground/run14600/linux-2.6.31.1/drivers/firewire (ENTRYLK)
> > 
> > *) Entrylk failed on the server [2011-08-22 08:03:28.451662] I
> > [server3_1-fops.c:341:server_entrylk_cbk] 0-mirror-server: 77698: ENTRYLK
> > /playground/run14600/linux-2.6.31.1/drivers/firewire (0) ==> -1 (No such file
> > or directory)
> > 
> > *) /75/glusterfs/client-.log:[2011-08-22 08:03:28.451776] E
> > [afr-lk-common.c:668:afr_unlock_entrylk_cbk] 0-mirror-replicate-0:
> > /playground/run14600/linux-2.6.31.1/drivers/firewire: unlock failed on 0,
> > reason: No such file or directory
> > *) According to the client log, the entrylk that failed is unlock
> > 
> > *) The client on 76 has entrylk failures on this entry after 08:03:28 time.
> > *) brick-0 has the following stale lock:
> > [xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1]
> > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.gfid=9835035b-e6a7-400c-b138-7c2bd8d873cf
> > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.nlookup=2
> > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ref=3
> > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ino=-1978601478
> > xlator.protocol.server.conn.1.bound_xl./export/mirror.active.1.ia_type=2
> > xlator.feature.locks.inode.-1978601478.mandatory=0
> > xlator.feature.locks.entrylk-count.-1978601478.entrylk-count=2
> > xlator.feature.locks.lock-dump.domain.domain=mirror-replicate-0
> > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[0](ACTIVE)= ENTRYLK_WRLCK
> > on (null) owner=47512409645496, transport=0xf2cba0
> > xlator.feature.locks.lock-dump.domain.entrylk.entrylk[1](BLOCKED)=
> > ENTRYLK_WRLCK on net.c owner=47512409627632, transport=0xf2cba0, stat
> > e = Blocked
> > 
> > *) Observe that there is a blocked lock contending for net.c
> > pranithk @ ~/code-reading/kernel
> > 15:26:36 :) $ find . -name net.c
> > ./drivers/vhost/net.c
> > ./drivers/staging/pohmelfs/net.c
> > ./drivers/firewire/net.c <<--------
> > ./net/tipc/net.c
> > 
> > All this happened because of entrylk without gfid. So most probably entrylk
> > lock happened with one gfid where as unlock with another gfid.
> 
> This confirms my assumption:
> Looking more at the client log on 75:
> entry self-heal on the same directory is triggered consecutively:
> The first one starts failing after the entry lock is taken, but unlock fails,
> THe second self-heal fails because it fails to acquire the locks.
> 
> [2011-08-22 08:03:28.417360] I [afr-common.c:1225:afr_launch_self_heal]
> 0-mirror-replicate-0: background  meta-data entry self-heal triggered. path:
> /playground/run14600/linux-2.6.31.1/drivers/firewire
> [2011-08-22 08:03:28.447512] E [client3_1-fops.c:232:client3_1_mknod_cbk]
> 0-mirror-client-0: remote operation failed: No such file or directory
> [2011-08-22 08:03:28.447553] E
> [afr-self-heal-entry.c:1108:afr_sh_entry_impunge_newfile_cbk]
> 0-mirror-replicate-0: creation of
> /playground/run14600/linux-2.6.31.1/drivers/firewire/core-device.c on
> mirror-client-0 failed (No such file or directory)
> ....
> ....
> heal-entry.c:130:afr_sh_entry_erase_pending_cbk] 0-mirror-replicate-0:
> /playground/run14600/linux-2.6.31.1/drivers/firewire: failed to erase pending
> xattrs on mirror-client-0 (No such file or directory)
> [2011-08-22 08:03:28.451750] I [client3_1-fops.c:1303:client3_1_entrylk_cbk]
> 0-mirror-client-0: remote operation failed: No such file or directory
> [2011-08-22 08:03:28.451776] E [afr-lk-common.c:668:afr_unlock_entrylk_cbk]
> 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire:
> unlock failed on 0, reason: No such file or directory
> [2011-08-22 08:03:28.451831] I
> [afr-self-heal-common.c:2022:afr_self_heal_completion_cbk]
> 0-mirror-replicate-0: background  meta-data entry self-heal completed on
> /playground/run14600/linux-2.6.31.1/drivers/firewire
> 
> 
> Second self-heal:
> [2011-08-22 08:03:28.452415] I [afr-dir-read.c:124:afr_examine_dir_readdir_cbk]
> 0-mirror-replicate-0: /playground/run14600/linux-2.6.31.1/drivers/firewire:
> failed to do opendir on mirror-client-0
> [2011-08-22 08:03:28.452876] I [afr-dir-read.c:175:afr_examine_dir_readdir_cbk]
> 0-mirror-replicate-0:  entry self-heal triggered. path:
> /playground/run14600/linux-2.6.31.1/drivers/firewire, reason: checksums of
> directory differ, forced merge option set
> [2011-08-22 08:03:28.452909] I [afr-common.c:1225:afr_launch_self_heal]
> 0-mirror-replicate-0: background  entry self-heal triggered. path:
> /playground/run14600/linux-2.6.31.1/drivers/firewire
> [2011-08-22 08:03:28.453344] E
> [afr-self-heal-entry.c:2319:afr_sh_post_nonblocking_entry_cbk]
> 0-mirror-replicate-0: Non Blocking entrylks failed for
> /playground/run14600/linux-2.6.31.1/drivers/firewire.
> [2011-08-22 08:03:28.453381] E
> [afr-self-heal-common.c:2019:afr_self_heal_completion_cbk]
> 0-mirror-replicate-0: background  entry self-heal failed on
> /playground/run14600/linux-2.6.31.1/drivers/firewire

Here is what is happening:
1) On client 1, ENtry self-heal is triggered on the directory 'firewire'.
2) Self-heal completes taking the lock on the entry firewire for NULL.
3) Now from client 2, rmdir comes for this directory.
4) On server1 the rmdir succeeds, on server2 it fails because the directory is not empty.
5) On server1, since rmdir succeeded the dentry gets unset, inode_unlink happens.
6) Now On client1 self-heal fails because the directory no longer exists on the server1.
7) self-heal tries to UNLOCK the self-heal lock (firewire, NULL), this succeeds on server2 but fails on server1 because the gfid was absent in the loc, loc->inode, it tries to do pure path resolution which fails, because the entry exists no more. So on brick1 there is a stale lock (firewire, NULL).
All further operations on this entry which supply the gfid in the inode/loc get blocked because inode-find still finds the unlinked inode on server1 and there is a stale lock.

Comment 4 Anand Avati 2011-12-15 10:50:06 UTC

CHANGE: http://review.gluster.com/770 (protocol/client: Be strict about gfids in fop req) merged in master by Vijay Bellur (vijay)

Comment 5 Pranith Kumar K 2011-12-16 09:38:40 UTC

This bug was already fixed as part of bug GLUSTER-3760. I added the asserts in the code path so it wont repeat again.

Comment 6 Raghavendra Bhat 2012-05-10 10:53:52 UTC

Checked with glusterfs-3.3.0qa40 and rm -rf from 2 clients is not hanging.