Bug 811551 - [glusterfs-3.3.0qa33]: rm -rf hung on the mount point
Summary: [glusterfs-3.3.0qa33]: rm -rf hung on the mount point
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-04-11 11:57 UTC by Raghavendra Bhat
Modified: 2012-06-11 11:16 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-11 11:16:59 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Raghavendra Bhat 2012-04-11 11:57:07 UTC
Description of problem:
3 replica volume. 3 fuse and 3 nfs clients running different tests such as ping_pong, fs-perf test, rdd, sanity etc. All the tests were stopped and tried to empty the mount point by doing rm -rf simultaneously from all the mount points. rm -rf hung.

statedump of one of the hung fuse clients is attached. Statedumps of other hung clients are also similar to the attached one.

Blocked locks are found in one of the servers.

[xlator.features.locks.mirror-locks.inode]
path=/in
mandatory=0
inodelk-count=21760
lock-dump.domain.domain=mirror-replicate-0
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=278396928, len=65536, pid = 573247788, owner=2c112b2200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012
, granted at Tue Apr 10 20:23:59 2012

inodelk.inodelk[1](ACTIVE)=type=WRITE, whence=0, start=278331392, len=65536, pid = 573216236, owner=ec952a2200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012
, granted at Tue Apr 10 20:23:59 2012

inodelk.inodelk[2](ACTIVE)=type=WRITE, whence=0, start=278265856, len=65536, pid = 573184684, owner=ac1a2a2200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012
, granted at Tue Apr 10 20:23:59 2012

inodelk.inodelk[3](ACTIVE)=type=WRITE, whence=0, start=278200320, len=65536, pid = 573153132, owner=6c9f292200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012
, granted at Tue Apr 10 20:23:59 2012

.
.
.
.
.
.
.
 granted at Mon Apr  9 11:19:34 2012

inodelk.inodelk[238](ACTIVE)=type=WRITE, whence=0, start=279576576, len=65536, pid = 635224364, owner=2cc1dc2500000000, transport=0x1071440, ,
 blocked at Mon Apr  9 11:19:34 2012
, granted at Mon Apr  9 11:19:34 2012

inodelk.inodelk[239](ACTIVE)=type=WRITE, whence=0, start=279511040, len=65536, pid = 635193660, owner=3c49dc2500000000, transport=0x1071440, ,
 blocked at Mon Apr  9 11:19:11 2012
, granted at Mon Apr  9 11:19:11 2012

inodelk.inodelk[240](BLOCKED)=type=WRITE, whence=0, start=272171008, len=65536, pid = 462329244, owner=9c958e1b00000000, transport=0x1071440, 
, blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[241](BLOCKED)=type=WRITE, whence=0, start=272302080, len=65536, pid = 462360668, owner=5c108f1b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[242](BLOCKED)=type=WRITE, whence=0, start=272236544, len=65536, pid = 462392220, owner=9c8b8f1b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[243](BLOCKED)=type=WRITE, whence=0, start=272433152, len=65536, pid = 462423772, owner=dc06901b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[244](BLOCKED)=type=WRITE, whence=0, start=272367616, len=65536, pid = 462455324, owner=1c82901b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[245](BLOCKED)=type=WRITE, whence=0, start=272826368, len=65536, pid = 462486876, owner=5cfd901b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[246](BLOCKED)=type=WRITE, whence=0, start=272760832, len=65536, pid = 462518428, owner=9c78911b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[247](BLOCKED)=type=WRITE, whence=0, start=272695296, len=65536, pid = 462549980, owner=dcf3911b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[248](BLOCKED)=type=WRITE, whence=0, start=272629760, len=65536, pid = 462581532, owner=1c6f921b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[249](BLOCKED)=type=WRITE, whence=0, start=272564224, len=65536, pid = 462613084, owner=5cea921b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[250](BLOCKED)=type=WRITE, whence=0, start=272498688, len=65536, pid = 462644636, owner=9c65931b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[251](BLOCKED)=type=WRITE, whence=0, start=272957440, len=65536, pid = 462676188, owner=dce0931b00000000, transport=0x1071440, :


Blocked locks are on the file /in. Its xattrs from different bricks.

10.16.156.9:
getfattr -d -m . -e hex /export/mirror/in
getfattr: Removing leading '/' from absolute path names
# file: export/mirror/in
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.mirror-client-0=0x000000040000000000000000
trusted.afr.mirror-client-1=0x000000040000000000000000
trusted.afr.mirror-client-2=0x000000040000000000000000
trusted.gfid=0x1b3d8252959f4d5496e84b743a3788f7

10.16.156.12:
getfattr -d -m . -e hex /export/mirror/in
getfattr: Removing leading '/' from absolute path names
# file: export/mirror/in
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.mirror-client-0=0x000006c50000000000000000
trusted.afr.mirror-client-1=0x000000000000000000000000
trusted.afr.mirror-client-2=0x000000000000000000000000
trusted.gfid=0x1b3d8252959f4d5496e84b743a3788f7

10.16.156.15:
getfattr -d -m . -e hex /export/mirror/in
getfattr: Removing leading '/' from absolute path names
# file: export/mirror/in
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.mirror-client-0=0x000006c50000000000000000
trusted.afr.mirror-client-1=0x000000000000000000000000
trusted.afr.mirror-client-2=0x000000000000000000000000
trusted.gfid=0x1b3d8252959f4d5496e84b743a3788f7

log file entries indicating split brain on one of the clients.
[2012-04-05 21:18:36.064423] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154211189: LOOKUP() /run2381/pd/l6 => -1 (Input/output err
or)
[2012-04-05 21:19:00.215765] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possi
ble split-brain
[2012-04-05 21:19:00.215824] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154224473: LOOKUP() /run2381/pd/l8 => -1 (Input/output err
or)
[2012-04-05 21:19:00.218262] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possi
ble split-brain
[2012-04-05 21:19:00.218387] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154224474: LOOKUP() /run2381/pd/l6 => -1 (Input/output err
or)
[2012-04-05 23:25:03.194912] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possi
ble split-brain
[2012-04-05 23:25:03.194980] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961907: LOOKUP() /run2381/pd/l8 => -1 (Input/output err
or)
[2012-04-05 23:25:03.202947] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possi
ble split-brain
[2012-04-05 23:25:03.202987] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961909: LOOKUP() /run2381/pd/l8 => -1 (Input/output err
or)
[2012-04-05 23:25:03.207872] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possi
ble split-brain
[2012-04-05 23:25:03.207929] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961911: LOOKUP() /run2381/pd/l6 => -1 (Input/output err
or)
[2012-04-05 23:25:03.210765] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possi
ble split-brain
[2012-04-05 23:25:03.210805] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961913: LOOKUP() /run2381/pd/l6 => -1 (Input/output err
or)
[2012-04-05 23:28:04.030693] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possible split-brain
[2012-04-05 23:28:04.030763] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154972434: LOOKUP() /run2381/pd/l8 => -1 (Input/output error)
[2012-04-05 23:28:04.031765] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possible split-brain
[2012-04-05 23:28:04.031799] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154972435: LOOKUP() /run2381/pd/l6 => -1 (Input/output error)
[2012-04-10 22:03:17.608616] C [client-handshake.c:135:rpc_client_ping_timer_expired] 0-mirror-client-0: server 10.16.156.9:24009 has not responded in the last 42 seconds, disconnecting.
[2012-04-10 22:03:17.612761] E [rpc-clnt.c:382:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x13c) [0x7f8f125fcc02] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x155) [0x7f8f125fc121] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f) [0x7f8f125fbb67]))) 0-mirror-client-0: forced unwinding frame type(GlusterFS 3.1) op(ENTRYLK(31)) called at 2012-04-10 21:37:02.172024 (xid=0x122886139x)
[2012-04-10 22:03:17.612793] W [client3_1-fops.c:1589:client3_1_entrylk_cbk] 0-mirror-client-0: remote operation failed: Transport endpoint is not connected
[2012-04-10 22:03:17.612911] E [rpc-clnt.c:382:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x13c) [0x7f8f125fcc02] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x155) [0x7f8f125fc121] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f:

[2012-04-05 15:22:26.522674] I [afr-common.c:1178:afr_detect_self_heal_by_iatt] 0-mirror-replicate-0: size differs for /in 
[2012-04-05 15:22:26.522718] I [afr-common.c:1178:afr_detect_self_heal_by_iatt] 0-mirror-replicate-0: size differs for /in 

[2012-04-05 06:38:23.560282] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 119986583: LOOKUP() /run2381/pd/l8 => -1 (Input/output error)
[2012-04-05 06:38:23.561196] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possible split-brain
[2012-04-05 06:38:23.561238] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 119986584: LOOKUP() /run2381/pd/l6 => -1 (Input/output error)
[2012-04-05 06:38:29.290984] I [afr-common.c:1329:afr_launch_self_heal] 0-mirror-replicate-0: background  data self-heal triggered. path: /in, reason: lookup detected pending operations

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
 gluster v i
 
Volume Name: mirror
Type: Replicate
Volume ID: e6423147-ee12-453f-bcf6-2fb09a9087c5
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.16.156.9:/export/mirror
Brick2: 10.16.156.12:/export/mirror
Brick3: 10.16.156.15:/export/mirror
Options Reconfigured:
features.lock-heal: off
performance.flush-behind: off
performance.stat-prefetch: off
performance.client-io-threads: on
cluster.self-heal-daemon: on
[root@gqas004 ~]#

Comment 1 Pranith Kumar K 2012-05-05 11:02:03 UTC
According to the following log, it seems like the server epoll thread hung.
There was a dead-lock bug in statedump. We need to check if this is due to that.
Let me know if you have the statedumps of the necessary brick.
[client-handshake.c:135:rpc_client_ping_timer_expired] 0-mirror-client-0:
server 10.16.156.9:24009 has not responded in the last 42 seconds,
disconnecting.

Comment 2 Vijay Bellur 2012-05-18 12:45:29 UTC
Not consistently reproducible. Removing it from the blocker list.

Comment 3 Pranith Kumar K 2012-06-11 11:16:59 UTC
Closing the bug as we are not able to re-create the problem anymore.


Note You need to log in before you can comment on or make changes to this bug.