Bug 922809 - Rename fails in dbench when a brick is taken down.
Summary: Rename fails in dbench when a brick is taken down.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.3.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-03-18 15:05 UTC by Pranith Kumar K
Modified: 2014-04-17 19:21 UTC (History)
1 user (show)

Fixed In Version: glusterfs-3.3.2
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-04-17 19:21:39 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Pranith Kumar K 2013-03-18 15:05:34 UTC
Description of problem:
Hi,

We just ran into drench dying on one of our test runs.
We execute a dbench each on 2 machines.
We use the following parameters: dbench 6 -t 60 -D $DIRECTORY (host specific, they each write in a separate one)
The directories are on a mountpoint connected using glusterfs 3.3.1 (3.3.1-ubuntu1~lucid8 from https://launchpad.net/~semiosis/+archive/ubuntu-glusterfs-3.3)

This is how dbench died:

I, [2013-03-16T05:34:03.176890 #13121]  INFO -- : [710] rename /mnt/gfs/something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/NEWPCB.PPT /mnt/gfs/something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/PPTB1E4.TMP failed (No such file or directory) - expected NT_STATUS_OK


These are the logs at the time. They are a bit noisy, the matching message is emphasised using *****:

[2013-03-16 05:34:03.082813] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f274
[2013-03-16 05:34:03.082813] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.082813] W [client3_1-fops.c:418:client3_1_open_cbk] 0-remote9: remote operation failed: No such file or directory. Path: /something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/PPTB1E4.TMP (b49d6051-93f6-4eca-b161-865a5bea964b)
[2013-03-16 05:34:03.082813] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f4cc
[2013-03-16 05:34:03.082813] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f6c0
[2013-03-16 05:34:03.082813] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f468
[2013-03-16 05:34:03.082813] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f33c
[2013-03-16 05:34:03.082813] W [client3_1-fops.c:881:client3_1_flush_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.092814] W [client3_1-fops.c:418:client3_1_open_cbk] 0-remote9: remote operation failed: No such file or directory. Path: /something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/ZD16.BMP (73e3b099-48cd-4e76-8049-c64bf8f63500)
[2013-03-16 05:34:03.092814] W [client3_1-fops.c:418:client3_1_open_cbk] 0-remote9: remote operation failed: No such file or directory. Path: /something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/NEWPCB.PPT (ba53fb9f-0648-4794-aaa9-bba9331b52cb)
[2013-03-16 05:34:03.092814] W [client3_1-fops.c:418:client3_1_open_cbk] 0-remote9: remote operation failed: No such file or directory. Path: /something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/PCBENCHM.PPT (a0c96e9a-4d4a-4984-9892-ff0b2ecbb7e3)
[2013-03-16 05:34:03.092814] W [client3_1-fops.c:418:client3_1_open_cbk] 0-remote9: remote operation failed: No such file or directory. Path: /something.example.com_1363412031/clients/client4/~dmtmp/PWRPNT/PPTB1E4.TMP (2b8f1677-6376-4286-a381-8f4897bc9f4a)
[2013-03-16 05:34:03.092814] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f594
[2013-03-16 05:34:03.092814] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f3a0
[2013-03-16 05:34:03.092814] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f2d8
[2013-03-16 05:34:03.112816] W [client3_1-fops.c:418:client3_1_open_cbk] 0-remote9: remote operation failed: No such file or directory. Path: /something.example.com_1363412031/clients/client4/~dmtmp/PWRPNT/ZD16.BMP (eafa5f6a-fe12-4b9c-a5b9-386f2ff2123f)
[2013-03-16 05:34:03.112816] W [client3_1-fops.c:418:client3_1_open_cbk] 0-remote9: remote operation failed: No such file or directory. Path: /something.example.com_1363412031/clients/client4/~dmtmp/PWRPNT/NEWPCB.PPT (8c99ede1-3782-49f0-b544-00f4ec3beb9b)
[2013-03-16 05:34:03.112816] W [client3_1-fops.c:418:client3_1_open_cbk] 0-remote9: remote operation failed: No such file or directory. Path: /something.example.com_1363412031/clients/client4/~dmtmp/PWRPNT/PCBENCHM.PPT (a725ede8-bc10-42a1-9622-55afad13f9f7)
[2013-03-16 05:34:03.112816] W [client3_1-fops.c:881:client3_1_flush_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.112816] W [client3_1-fops.c:881:client3_1_flush_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.112816] W [client3_1-fops.c:881:client3_1_flush_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.112816] W [client3_1-fops.c:881:client3_1_flush_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.112816] W [client3_1-fops.c:881:client3_1_flush_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.112816] W [client3_1-fops.c:881:client3_1_flush_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.112816] W [client3_1-fops.c:881:client3_1_flush_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.132819] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.132819] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.132819] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.132819] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.142820] W [client3_1-fops.c:418:client3_1_open_cbk] 0-remote9: remote operation failed: No such file or directory. Path: /something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/NEWPCB.PPT (ba53fb9f-0648-4794-aaa9-bba9331b52cb)
[2013-03-16 05:34:03.142820] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f788
[2013-03-16 05:34:03.142820] W [client3_1-fops.c:418:client3_1_open_cbk] 0-remote9: remote operation failed: No such file or directory. Path: /something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/NEWPCB.PPT (ba53fb9f-0648-4794-aaa9-bba9331b52cb)
[2013-03-16 05:34:03.142820] W [client3_1-fops.c:881:client3_1_flush_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.142820] W [client3_1-fops.c:2546:client3_1_opendir_cbk] 0-remote9: remote operation failed: No such file or directory. Path: /something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT (6512393c-65b8-4d86-ae78-8a12eb2be395)
[2013-03-16 05:34:03.172824] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-remote9: remote operation failed: No such file or directory
[2013-03-16 05:34:03.172824] W [fuse-bridge.c:1516:fuse_rename_cbk] 0-glusterfs-fuse: 11218: /something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/NEWPCB.PPT -> /something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/PPTB1E4.TMP => -1 (No such file or directory)
[2013-03-16 05:34:03.232831] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f404
[2013-03-16 05:34:03.242832] I [afr-open.c:318:afr_openfd_fix_open_cbk] 0-replicate0: fd for /something.example.com_1363412031/clients/client0/~dmtmp/PWRPNT/PCBENCHM.PPT opened successfully on subvolume remote9
[2013-03-16 05:34:03.252834] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f5f8
[2013-03-16 05:34:03.262835] I [afr-open.c:318:afr_openfd_fix_open_cbk] 0-replicate0: fd for /something.example.com_1363412031/clients/client3/~dmtmp/PWRPNT/PCBENCHM.PPT opened successfully on subvolume remote9

***********************
[2013-03-16 05:34:03.172824] W [fuse-bridge.c:1516:fuse_rename_cbk] 0-glusterfs-fuse: 11218: /something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/NEWPCB.PPT -> /something.example.com_1363412031/clients/client2/~dmtmp/PWRPNT/PPTB1E4.TMP => -1 (No such file or directory)
***********************

[2013-03-16 05:34:03.232831] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f404
[2013-03-16 05:34:03.242832] I [afr-open.c:318:afr_openfd_fix_open_cbk] 0-replicate0: fd for /something.example.com_1363412031/clients/client0/~dmtmp/PWRPNT/PCBENCHM.PPT opened successfully on subvolume remote9
[2013-03-16 05:34:03.252834] I [afr-inode-write.c:428:afr_open_fd_fix] 0-replicate0: Opening fd 0x7f1adb67f5f8
[2013-03-16 05:34:03.262835] I [afr-open.c:318:afr_openfd_fix_open_cbk] 0-replicate0: fd for /something.example.com_1363412031/clients/client3/~dmtmp/PWRPNT/PCBENCHM.PPT opened successfully on subvolume remote9
[2013-03-16 05:36:21.547011] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-remote8: server 10.245.15.65:24007 has not responded in the last 42 seconds, disconnecting.
[2013-03-16 05:36:21.547011] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-remote9: server 10.196.239.242:24007 has not responded in the last 42 seconds, disconnecting.
[2013-03-16 05:36:21.547011] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1adab0a048] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1adab09d00] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1adab0976e]))) 0-remote8: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2013-03-16 05:35:10.750385 (xid=0x18942x)
[2013-03-16 05:36:21.547011] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-remote8: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2013-03-16 05:36:21.547011] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1adab0a048] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1adab09d00] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1adab0976e]))) 0-remote8: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2013-03-16 05:35:18.191110 (xid=0x18943x)
[2013-03-16 05:36:21.547011] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-remote8: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2013-03-16 05:36:21.547011] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1adab0a048] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1adab09d00] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1adab0976e]))) 0-remote8: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2013-03-16 05:35:39.543151 (xid=0x18944x)

Cheers,
Marc

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Anand Avati 2013-04-10 08:26:36 UTC
COMMIT: http://review.gluster.org/4689 committed in release-3.3 by Vijay Bellur (vbellur) 
------
commit a216f5f44675bfe189c318171dbc50e1c19bfc26
Author: Anand Avati <avati>
Date:   Tue Apr 9 17:22:01 2013 -0700

    tests: fix further issues with bug-874498.t
    
    The failure of bug-874498.t seems to be a "bug" in glustershd.
    The situation seems to be when both subvolumes of a replica are
    "local" to glustershd, and in such cases glustershd is sensitive
    to the order in which the subvols come up.
    
    The core of the issue itself is that, without the patch (#4784),
    self-heal daemon completes the processing of index and no entries
    are left inside the xattrop index after a few seconds of volume
    start force. However with the patch, the stale "backing file"
    (against which index performs link()) is left. The likely reason
    is that an "INDEX" based crawl is not happening against the subvol
    when this patch is applied.
    
    Before #4784 patch, the order in which subvols came up was :
    
      [2013-04-09 22:55:35.117679] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-0: Connected to 10.3.129.13:49156, attached to remote volume '/d/backends/brick1'.
      ...
      [2013-04-09 22:55:35.118399] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-1: Connected to 10.3.129.13:49157, attached to remote volume '/d/backends/brick2'.
    
    However, with the patch, the order is reversed:
    
      [2013-04-09 22:53:34.945370] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-1: Connected to 10.3.129.13:49153, attached to remote volume '/d/backends/brick2'.
      ...
      [2013-04-09 22:53:34.950966] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-0: Connected to 10.3.129.13:49152, attached to remote volume '/d/backends/brick1'.
    
    The index in brick2 has the list of files/gfid to heal. It appears
    to be the case that when brick1 is the first subvol to be detected
    as coming up, somehow an INDEX based crawl is clearing all the
    index entries in brick2, but if brick2 comes up as the first subvol,
    then the backing file is left stale.
    
    Also, doing a "gluster volume heal full" seems to leave out stale
    backing files too. As the crawl is performed on the namespace and
    the backing file is never encountered there to get cleared out.
    
    So the interim (possibly permanent) fix is to have the script issue
    a regular self-heal command (and not a "full" one).
    
    The failure of the script itself is non-critical. The data files are
    all healed, and it is just the backing file which is left behind. The
    stale backing file too gets cleared in the next index based healing,
    either triggered manually or after 10mins.
    
    Change-Id: I5deb79652ef449b7e88684311e804a8a2aa4725d
    BUG: 874498
    Signed-off-by: Anand Avati <avati>
    Reviewed-on: http://review.gluster.org/4798
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 3 Niels de Vos 2014-04-17 19:21:39 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.3.2, please reopen this bug report.

The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.3.2. In the same line the recent release i.e. glusterfs-3.4.3 [1] and glusterfs-3.5.0 [2] likely to have the fix. You can verify this by reading the comments in this bug report and checking for comments mentioning "committed in release-3.4" and/or "committed in release-3.5".

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137


Note You need to log in before you can comment on or make changes to this bug.