Description of problem: replicate volume with replica count 2. Then added one more replica copy to the volume (thus making the replica count 3). 1 fuse client and 1 nfs client. fuse client was running rdd and nfs client was running sanity script which runs different tools. rdd was hung. The server statedump indicates there are 10 active locks and 1 blocked lock. grep ACTIVE /tmp/export-xfs-mirror.17583.dump | sort | uniq inodelk.inodelk[0](ACTIVE)=type=WRITE, start=0, len=0, pid=553248952, lk-owner=ffffff20047f0000, transport=0x734fc90, granted at Fri Feb 3 02 inodelk.inodelk[0](ACTIVE)=type=WRITE, start=0, len=0, pid=553255204, lk-owner=2401ff20047f0000, transport=0x734fc90, granted at Fri Feb 3 02 inodelk.inodelk[0](ACTIVE)=type=WRITE, start=0, len=0, pid=553255204, lk-owner=2401ff20047f0000, transport=0x734fc90, granted at Fri Feb 3 02 inodelk.inodelk[0](ACTIVE)=type=WRITE, start=0, len=0, pid=553267708, lk-owner=ff31ff20047f0000, transport=0x734fc90, granted at Fri Feb 3 02 inodelk.inodelk[0](ACTIVE)=type=WRITE, start=0, len=0, pid=553269792, lk-owner=203aff20047f0000, transport=0x734fc90, granted at Fri Feb 3 02 inodelk.inodelk[0](ACTIVE)=type=WRITE, start=0, len=0, pid=553292716, lk-owner=ffffff20047f0000, transport=0x734fc90, granted at Fri Feb 3 02 inodelk.inodelk[0](ACTIVE)=type=WRITE, start=0, len=0, pid=553326060, lk-owner=ff15ff20047f0000, transport=0x734fc90, granted at Fri Feb 3 02 inodelk.inodelk[0](ACTIVE)=type=WRITE, start=0, len=0, pid=553326060, lk-owner=ff15ff20047f0000, transport=0x734fc90, granted at Fri Feb 3 02 inodelk.inodelk[0](ACTIVE)=type=WRITE, start=0, len=0, pid=553344816, lk-owner=305fff20047f0000, transport=0x734fc90, granted at Fri Feb 3 02 inodelk.inodelk[0](ACTIVE)=type=WRITE, start=3841196032, len=131072, pid=18446744073709551615, lk-owner=4cff70ffff7f0000, transport=0x95e350,2 [root@node131 ~]# grep ACTIVE /tmp/export-xfs-mirror.17583.dump | sort | uniq | wc -l 10 inodelk.inodelk[1](BLOCKED)=type=WRITE, start=0, len=0, pid=553346900, lk-owner=5467ff20047f0000, transport=0x734fc90, blocked at Fri Feb 02 The transport of the last active lock (inodelk.inodelk[0](ACTIVE)=type=WRITE, start=3841196032, len=131072, pid=18446744073709551615, lk-owner=4cff70ffff7f0000, transport=0x95e350,2), is freed. It seems to be a race in this part of code. LOCK (<able->lock); { list_splice_init (<able->inodelk_lockers, &inodelk_lockers); list_splice_init (<able->entrylk_lockers, &entrylk_lockers); } UNLOCK (<able->lock); GF_FREE (ltable); After splicing the list of inodelks and entrylks and then unlocking the ltable, some locks might have been added before the next statement (i.e freeing ltable). Thus leading to other locks getting blocked. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: the contents of the freed transport. p (server_connection_t *)0x95e350 $4 = (server_connection_t *) 0x95e350 (gdb) p *$4 $5 = {list = {next = 0x0, prev = 0x0}, id = 0x0, ref = 33, lock = {__data = {__lock = 553255204, __count = 32516, __owner = -1034375448, __nusers = 52, __kind = 32, __spins = 0, __list = {__prev = 0x31, __next = 0x100000000}}, __size = "$\001\372 \004\177\000\000\350\256X\302\064\000\000\000 \000\000\000\000\000\000\000\061", '\000' <repeats 11 times>, "\001\000} (gdb) p *$4->ltable Cannot access memory at address 0x600000004 (gdb) p *$4->xl There is no member named xl. (gdb) p *$4->bound_xl Cannot access memory at address 0xa (gdb) And this is the statedump information showing the number of allocations of the locker structure which is same as the number of the active locks. [protocol/server.mirror-server - usage-type 88 memusage] size=11280 num_allocs=10 max_size=4104792 max_num_allocs=3639 total_allocs=2065069
KP, either you, Pranith or Junaid have a look on it. Let me know if its still valid.
CHANGE: http://review.gluster.com/2957 (protocol/server: Avoid race in add/del locker, connection_cleanup) merged in master by Anand Avati (avati)
Checked with glusterfs-3.3.0qa40. This does not happen now.