Description of problem: 3 replica volume. 3 fuse and 3 nfs clients running different tests such as ping_pong, fs-perf test, rdd, sanity etc. All the tests were stopped and tried to empty the mount point by doing rm -rf simultaneously from all the mount points. rm -rf hung. statedump of one of the hung fuse clients is attached. Statedumps of other hung clients are also similar to the attached one. Blocked locks are found in one of the servers. [xlator.features.locks.mirror-locks.inode] path=/in mandatory=0 inodelk-count=21760 lock-dump.domain.domain=mirror-replicate-0 inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=278396928, len=65536, pid = 573247788, owner=2c112b2200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012 , granted at Tue Apr 10 20:23:59 2012 inodelk.inodelk[1](ACTIVE)=type=WRITE, whence=0, start=278331392, len=65536, pid = 573216236, owner=ec952a2200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012 , granted at Tue Apr 10 20:23:59 2012 inodelk.inodelk[2](ACTIVE)=type=WRITE, whence=0, start=278265856, len=65536, pid = 573184684, owner=ac1a2a2200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012 , granted at Tue Apr 10 20:23:59 2012 inodelk.inodelk[3](ACTIVE)=type=WRITE, whence=0, start=278200320, len=65536, pid = 573153132, owner=6c9f292200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012 , granted at Tue Apr 10 20:23:59 2012 . . . . . . . granted at Mon Apr 9 11:19:34 2012 inodelk.inodelk[238](ACTIVE)=type=WRITE, whence=0, start=279576576, len=65536, pid = 635224364, owner=2cc1dc2500000000, transport=0x1071440, , blocked at Mon Apr 9 11:19:34 2012 , granted at Mon Apr 9 11:19:34 2012 inodelk.inodelk[239](ACTIVE)=type=WRITE, whence=0, start=279511040, len=65536, pid = 635193660, owner=3c49dc2500000000, transport=0x1071440, , blocked at Mon Apr 9 11:19:11 2012 , granted at Mon Apr 9 11:19:11 2012 inodelk.inodelk[240](BLOCKED)=type=WRITE, whence=0, start=272171008, len=65536, pid = 462329244, owner=9c958e1b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012 inodelk.inodelk[241](BLOCKED)=type=WRITE, whence=0, start=272302080, len=65536, pid = 462360668, owner=5c108f1b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012 inodelk.inodelk[242](BLOCKED)=type=WRITE, whence=0, start=272236544, len=65536, pid = 462392220, owner=9c8b8f1b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012 inodelk.inodelk[243](BLOCKED)=type=WRITE, whence=0, start=272433152, len=65536, pid = 462423772, owner=dc06901b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012 inodelk.inodelk[244](BLOCKED)=type=WRITE, whence=0, start=272367616, len=65536, pid = 462455324, owner=1c82901b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012 inodelk.inodelk[245](BLOCKED)=type=WRITE, whence=0, start=272826368, len=65536, pid = 462486876, owner=5cfd901b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012 inodelk.inodelk[246](BLOCKED)=type=WRITE, whence=0, start=272760832, len=65536, pid = 462518428, owner=9c78911b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012 inodelk.inodelk[247](BLOCKED)=type=WRITE, whence=0, start=272695296, len=65536, pid = 462549980, owner=dcf3911b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012 inodelk.inodelk[248](BLOCKED)=type=WRITE, whence=0, start=272629760, len=65536, pid = 462581532, owner=1c6f921b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012 inodelk.inodelk[249](BLOCKED)=type=WRITE, whence=0, start=272564224, len=65536, pid = 462613084, owner=5cea921b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012 inodelk.inodelk[250](BLOCKED)=type=WRITE, whence=0, start=272498688, len=65536, pid = 462644636, owner=9c65931b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012 inodelk.inodelk[251](BLOCKED)=type=WRITE, whence=0, start=272957440, len=65536, pid = 462676188, owner=dce0931b00000000, transport=0x1071440, : Blocked locks are on the file /in. Its xattrs from different bricks. 10.16.156.9: getfattr -d -m . -e hex /export/mirror/in getfattr: Removing leading '/' from absolute path names # file: export/mirror/in security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000 trusted.afr.mirror-client-0=0x000000040000000000000000 trusted.afr.mirror-client-1=0x000000040000000000000000 trusted.afr.mirror-client-2=0x000000040000000000000000 trusted.gfid=0x1b3d8252959f4d5496e84b743a3788f7 10.16.156.12: getfattr -d -m . -e hex /export/mirror/in getfattr: Removing leading '/' from absolute path names # file: export/mirror/in security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000 trusted.afr.mirror-client-0=0x000006c50000000000000000 trusted.afr.mirror-client-1=0x000000000000000000000000 trusted.afr.mirror-client-2=0x000000000000000000000000 trusted.gfid=0x1b3d8252959f4d5496e84b743a3788f7 10.16.156.15: getfattr -d -m . -e hex /export/mirror/in getfattr: Removing leading '/' from absolute path names # file: export/mirror/in security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000 trusted.afr.mirror-client-0=0x000006c50000000000000000 trusted.afr.mirror-client-1=0x000000000000000000000000 trusted.afr.mirror-client-2=0x000000000000000000000000 trusted.gfid=0x1b3d8252959f4d5496e84b743a3788f7 log file entries indicating split brain on one of the clients. [2012-04-05 21:18:36.064423] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154211189: LOOKUP() /run2381/pd/l6 => -1 (Input/output err or) [2012-04-05 21:19:00.215765] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possi ble split-brain [2012-04-05 21:19:00.215824] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154224473: LOOKUP() /run2381/pd/l8 => -1 (Input/output err or) [2012-04-05 21:19:00.218262] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possi ble split-brain [2012-04-05 21:19:00.218387] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154224474: LOOKUP() /run2381/pd/l6 => -1 (Input/output err or) [2012-04-05 23:25:03.194912] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possi ble split-brain [2012-04-05 23:25:03.194980] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961907: LOOKUP() /run2381/pd/l8 => -1 (Input/output err or) [2012-04-05 23:25:03.202947] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possi ble split-brain [2012-04-05 23:25:03.202987] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961909: LOOKUP() /run2381/pd/l8 => -1 (Input/output err or) [2012-04-05 23:25:03.207872] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possi ble split-brain [2012-04-05 23:25:03.207929] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961911: LOOKUP() /run2381/pd/l6 => -1 (Input/output err or) [2012-04-05 23:25:03.210765] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possi ble split-brain [2012-04-05 23:25:03.210805] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961913: LOOKUP() /run2381/pd/l6 => -1 (Input/output err or) [2012-04-05 23:28:04.030693] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possible split-brain [2012-04-05 23:28:04.030763] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154972434: LOOKUP() /run2381/pd/l8 => -1 (Input/output error) [2012-04-05 23:28:04.031765] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possible split-brain [2012-04-05 23:28:04.031799] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154972435: LOOKUP() /run2381/pd/l6 => -1 (Input/output error) [2012-04-10 22:03:17.608616] C [client-handshake.c:135:rpc_client_ping_timer_expired] 0-mirror-client-0: server 10.16.156.9:24009 has not responded in the last 42 seconds, disconnecting. [2012-04-10 22:03:17.612761] E [rpc-clnt.c:382:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x13c) [0x7f8f125fcc02] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x155) [0x7f8f125fc121] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f) [0x7f8f125fbb67]))) 0-mirror-client-0: forced unwinding frame type(GlusterFS 3.1) op(ENTRYLK(31)) called at 2012-04-10 21:37:02.172024 (xid=0x122886139x) [2012-04-10 22:03:17.612793] W [client3_1-fops.c:1589:client3_1_entrylk_cbk] 0-mirror-client-0: remote operation failed: Transport endpoint is not connected [2012-04-10 22:03:17.612911] E [rpc-clnt.c:382:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x13c) [0x7f8f125fcc02] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x155) [0x7f8f125fc121] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f: [2012-04-05 15:22:26.522674] I [afr-common.c:1178:afr_detect_self_heal_by_iatt] 0-mirror-replicate-0: size differs for /in [2012-04-05 15:22:26.522718] I [afr-common.c:1178:afr_detect_self_heal_by_iatt] 0-mirror-replicate-0: size differs for /in [2012-04-05 06:38:23.560282] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 119986583: LOOKUP() /run2381/pd/l8 => -1 (Input/output error) [2012-04-05 06:38:23.561196] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possible split-brain [2012-04-05 06:38:23.561238] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 119986584: LOOKUP() /run2381/pd/l6 => -1 (Input/output error) [2012-04-05 06:38:29.290984] I [afr-common.c:1329:afr_launch_self_heal] 0-mirror-replicate-0: background data self-heal triggered. path: /in, reason: lookup detected pending operations Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: gluster v i Volume Name: mirror Type: Replicate Volume ID: e6423147-ee12-453f-bcf6-2fb09a9087c5 Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.16.156.9:/export/mirror Brick2: 10.16.156.12:/export/mirror Brick3: 10.16.156.15:/export/mirror Options Reconfigured: features.lock-heal: off performance.flush-behind: off performance.stat-prefetch: off performance.client-io-threads: on cluster.self-heal-daemon: on [root@gqas004 ~]#
According to the following log, it seems like the server epoll thread hung. There was a dead-lock bug in statedump. We need to check if this is due to that. Let me know if you have the statedumps of the necessary brick. [client-handshake.c:135:rpc_client_ping_timer_expired] 0-mirror-client-0: server 10.16.156.9:24009 has not responded in the last 42 seconds, disconnecting.
Not consistently reproducible. Removing it from the blocker list.
Closing the bug as we are not able to re-create the problem anymore.