Bug 811551 - [glusterfs-3.3.0qa33]: rm -rf hung on the mount point
[glusterfs-3.3.0qa33]: rm -rf hung on the mount point
Status: CLOSED WORKSFORME
Product: GlusterFS
Classification: Community
Component: replicate (Show other bugs)
mainline
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Pranith Kumar K
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-11 07:57 EDT by Raghavendra Bhat
Modified: 2012-06-11 07:16 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-06-11 07:16:59 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Raghavendra Bhat 2012-04-11 07:57:07 EDT
Description of problem:
3 replica volume. 3 fuse and 3 nfs clients running different tests such as ping_pong, fs-perf test, rdd, sanity etc. All the tests were stopped and tried to empty the mount point by doing rm -rf simultaneously from all the mount points. rm -rf hung.

statedump of one of the hung fuse clients is attached. Statedumps of other hung clients are also similar to the attached one.

Blocked locks are found in one of the servers.

[xlator.features.locks.mirror-locks.inode]
path=/in
mandatory=0
inodelk-count=21760
lock-dump.domain.domain=mirror-replicate-0
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=278396928, len=65536, pid = 573247788, owner=2c112b2200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012
, granted at Tue Apr 10 20:23:59 2012

inodelk.inodelk[1](ACTIVE)=type=WRITE, whence=0, start=278331392, len=65536, pid = 573216236, owner=ec952a2200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012
, granted at Tue Apr 10 20:23:59 2012

inodelk.inodelk[2](ACTIVE)=type=WRITE, whence=0, start=278265856, len=65536, pid = 573184684, owner=ac1a2a2200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012
, granted at Tue Apr 10 20:23:59 2012

inodelk.inodelk[3](ACTIVE)=type=WRITE, whence=0, start=278200320, len=65536, pid = 573153132, owner=6c9f292200000000, transport=0x1071440, , blocked at Tue Apr 10 20:23:59 2012
, granted at Tue Apr 10 20:23:59 2012

.
.
.
.
.
.
.
 granted at Mon Apr  9 11:19:34 2012

inodelk.inodelk[238](ACTIVE)=type=WRITE, whence=0, start=279576576, len=65536, pid = 635224364, owner=2cc1dc2500000000, transport=0x1071440, ,
 blocked at Mon Apr  9 11:19:34 2012
, granted at Mon Apr  9 11:19:34 2012

inodelk.inodelk[239](ACTIVE)=type=WRITE, whence=0, start=279511040, len=65536, pid = 635193660, owner=3c49dc2500000000, transport=0x1071440, ,
 blocked at Mon Apr  9 11:19:11 2012
, granted at Mon Apr  9 11:19:11 2012

inodelk.inodelk[240](BLOCKED)=type=WRITE, whence=0, start=272171008, len=65536, pid = 462329244, owner=9c958e1b00000000, transport=0x1071440, 
, blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[241](BLOCKED)=type=WRITE, whence=0, start=272302080, len=65536, pid = 462360668, owner=5c108f1b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[242](BLOCKED)=type=WRITE, whence=0, start=272236544, len=65536, pid = 462392220, owner=9c8b8f1b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[243](BLOCKED)=type=WRITE, whence=0, start=272433152, len=65536, pid = 462423772, owner=dc06901b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[244](BLOCKED)=type=WRITE, whence=0, start=272367616, len=65536, pid = 462455324, owner=1c82901b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[245](BLOCKED)=type=WRITE, whence=0, start=272826368, len=65536, pid = 462486876, owner=5cfd901b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[246](BLOCKED)=type=WRITE, whence=0, start=272760832, len=65536, pid = 462518428, owner=9c78911b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[247](BLOCKED)=type=WRITE, whence=0, start=272695296, len=65536, pid = 462549980, owner=dcf3911b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[248](BLOCKED)=type=WRITE, whence=0, start=272629760, len=65536, pid = 462581532, owner=1c6f921b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[249](BLOCKED)=type=WRITE, whence=0, start=272564224, len=65536, pid = 462613084, owner=5cea921b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[250](BLOCKED)=type=WRITE, whence=0, start=272498688, len=65536, pid = 462644636, owner=9c65931b00000000, transport=0x1071440, , blocked at Wed Apr 11 01:18:35 2012

inodelk.inodelk[251](BLOCKED)=type=WRITE, whence=0, start=272957440, len=65536, pid = 462676188, owner=dce0931b00000000, transport=0x1071440, :


Blocked locks are on the file /in. Its xattrs from different bricks.

10.16.156.9:
getfattr -d -m . -e hex /export/mirror/in
getfattr: Removing leading '/' from absolute path names
# file: export/mirror/in
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.mirror-client-0=0x000000040000000000000000
trusted.afr.mirror-client-1=0x000000040000000000000000
trusted.afr.mirror-client-2=0x000000040000000000000000
trusted.gfid=0x1b3d8252959f4d5496e84b743a3788f7

10.16.156.12:
getfattr -d -m . -e hex /export/mirror/in
getfattr: Removing leading '/' from absolute path names
# file: export/mirror/in
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.mirror-client-0=0x000006c50000000000000000
trusted.afr.mirror-client-1=0x000000000000000000000000
trusted.afr.mirror-client-2=0x000000000000000000000000
trusted.gfid=0x1b3d8252959f4d5496e84b743a3788f7

10.16.156.15:
getfattr -d -m . -e hex /export/mirror/in
getfattr: Removing leading '/' from absolute path names
# file: export/mirror/in
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.mirror-client-0=0x000006c50000000000000000
trusted.afr.mirror-client-1=0x000000000000000000000000
trusted.afr.mirror-client-2=0x000000000000000000000000
trusted.gfid=0x1b3d8252959f4d5496e84b743a3788f7

log file entries indicating split brain on one of the clients.
[2012-04-05 21:18:36.064423] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154211189: LOOKUP() /run2381/pd/l6 => -1 (Input/output err
or)
[2012-04-05 21:19:00.215765] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possi
ble split-brain
[2012-04-05 21:19:00.215824] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154224473: LOOKUP() /run2381/pd/l8 => -1 (Input/output err
or)
[2012-04-05 21:19:00.218262] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possi
ble split-brain
[2012-04-05 21:19:00.218387] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154224474: LOOKUP() /run2381/pd/l6 => -1 (Input/output err
or)
[2012-04-05 23:25:03.194912] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possi
ble split-brain
[2012-04-05 23:25:03.194980] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961907: LOOKUP() /run2381/pd/l8 => -1 (Input/output err
or)
[2012-04-05 23:25:03.202947] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possi
ble split-brain
[2012-04-05 23:25:03.202987] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961909: LOOKUP() /run2381/pd/l8 => -1 (Input/output err
or)
[2012-04-05 23:25:03.207872] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possi
ble split-brain
[2012-04-05 23:25:03.207929] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961911: LOOKUP() /run2381/pd/l6 => -1 (Input/output err
or)
[2012-04-05 23:25:03.210765] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possi
ble split-brain
[2012-04-05 23:25:03.210805] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154961913: LOOKUP() /run2381/pd/l6 => -1 (Input/output err
or)
[2012-04-05 23:28:04.030693] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l8: Possible split-brain
[2012-04-05 23:28:04.030763] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154972434: LOOKUP() /run2381/pd/l8 => -1 (Input/output error)
[2012-04-05 23:28:04.031765] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possible split-brain
[2012-04-05 23:28:04.031799] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 154972435: LOOKUP() /run2381/pd/l6 => -1 (Input/output error)
[2012-04-10 22:03:17.608616] C [client-handshake.c:135:rpc_client_ping_timer_expired] 0-mirror-client-0: server 10.16.156.9:24009 has not responded in the last 42 seconds, disconnecting.
[2012-04-10 22:03:17.612761] E [rpc-clnt.c:382:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x13c) [0x7f8f125fcc02] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x155) [0x7f8f125fc121] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f) [0x7f8f125fbb67]))) 0-mirror-client-0: forced unwinding frame type(GlusterFS 3.1) op(ENTRYLK(31)) called at 2012-04-10 21:37:02.172024 (xid=0x122886139x)
[2012-04-10 22:03:17.612793] W [client3_1-fops.c:1589:client3_1_entrylk_cbk] 0-mirror-client-0: remote operation failed: Transport endpoint is not connected
[2012-04-10 22:03:17.612911] E [rpc-clnt.c:382:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x13c) [0x7f8f125fcc02] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x155) [0x7f8f125fc121] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f:

[2012-04-05 15:22:26.522674] I [afr-common.c:1178:afr_detect_self_heal_by_iatt] 0-mirror-replicate-0: size differs for /in 
[2012-04-05 15:22:26.522718] I [afr-common.c:1178:afr_detect_self_heal_by_iatt] 0-mirror-replicate-0: size differs for /in 

[2012-04-05 06:38:23.560282] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 119986583: LOOKUP() /run2381/pd/l8 => -1 (Input/output error)
[2012-04-05 06:38:23.561196] W [afr-self-heal-data.c:823:afr_lookup_select_read_child_by_txn_type] 0-mirror-replicate-0: /run2381/pd/l6: Possible split-brain
[2012-04-05 06:38:23.561238] W [fuse-bridge.c:272:fuse_entry_cbk] 0-glusterfs-fuse: 119986584: LOOKUP() /run2381/pd/l6 => -1 (Input/output error)
[2012-04-05 06:38:29.290984] I [afr-common.c:1329:afr_launch_self_heal] 0-mirror-replicate-0: background  data self-heal triggered. path: /in, reason: lookup detected pending operations

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
 gluster v i
 
Volume Name: mirror
Type: Replicate
Volume ID: e6423147-ee12-453f-bcf6-2fb09a9087c5
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.16.156.9:/export/mirror
Brick2: 10.16.156.12:/export/mirror
Brick3: 10.16.156.15:/export/mirror
Options Reconfigured:
features.lock-heal: off
performance.flush-behind: off
performance.stat-prefetch: off
performance.client-io-threads: on
cluster.self-heal-daemon: on
[root@gqas004 ~]#
Comment 1 Pranith Kumar K 2012-05-05 07:02:03 EDT
According to the following log, it seems like the server epoll thread hung.
There was a dead-lock bug in statedump. We need to check if this is due to that.
Let me know if you have the statedumps of the necessary brick.
[client-handshake.c:135:rpc_client_ping_timer_expired] 0-mirror-client-0:
server 10.16.156.9:24009 has not responded in the last 42 seconds,
disconnecting.
Comment 2 Vijay Bellur 2012-05-18 08:45:29 EDT
Not consistently reproducible. Removing it from the blocker list.
Comment 3 Pranith Kumar K 2012-06-11 07:16:59 EDT
Closing the bug as we are not able to re-create the problem anymore.

Note You need to log in before you can comment on or make changes to this bug.