Bug 1139535 - nfs: coredump for lookup after trying to resolve gfid mismatch for a file
Summary: nfs: coredump for lookup after trying to resolve gfid mismatch for a file
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: gluster-nfs
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
: RHGS 3.0.1
Assignee: Shyamsundar
QA Contact: Saurabh
URL:
Whiteboard:
Depends On: 1139213 1142654
Blocks: 1139774
TreeView+ depends on / blocked
 
Reported: 2014-09-09 07:21 UTC by Saurabh
Modified: 2016-01-19 06:14 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.6.0.29-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-09-29 07:46:39 UTC
Embargoed:


Attachments (Terms of Use)
coredump of nfs server (3.27 MB, application/octet-stream)
2014-09-09 07:21 UTC, Saurabh
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:1315 0 normal SHIPPED_LIVE Red Hat Storage 3.0 enhancement and bug fix update #1 2014-09-29 11:46:00 UTC

Description Saurabh 2014-09-09 07:21:04 UTC
Created attachment 935559 [details]
coredump of nfs server

Description of problem:
I tried to resolve the gfid mismatch for a file and then did "ls" for that same file. The ls command hung and nfs server reported crash.

gluster volume info,
[root@nfs4 ~]# gluster volume info dist-rep
 
Volume Name: dist-rep
Type: Distributed-Replicate
Volume ID: 9a80c514-ddd4-45cb-b2b6-d745fd438db8
Status: Started
Snap Volume: no
Number of Bricks: 7 x 2 = 14
Transport-type: tcp
Bricks:
Brick1: 10.70.37.156:/bricks/d1r1
Brick2: 10.70.37.81:/bricks/d1r2
Brick3: 10.70.37.50:/bricks/d2r1
Brick4: 10.70.37.95:/bricks/d2r2
Brick5: 10.70.37.156:/bricks/d3r1
Brick6: 10.70.37.81:/bricks/d3r2
Brick7: 10.70.37.50:/bricks/d4r1
Brick8: 10.70.37.95:/bricks/d4r2
Brick9: 10.70.37.156:/bricks/d5r1
Brick10: 10.70.37.81:/bricks/d5r2
Brick11: 10.70.37.50:/bricks/d6r1
Brick12: 10.70.37.95:/bricks/d6r2
Brick13: 10.70.37.156:/bricks/d1r1-add3
Brick14: 10.70.37.81:/bricks/d1r2-add3
Options Reconfigured:
features.quota-deem-statfs: on
features.quota: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable



procedure folllowed for overcoming gfid mismatch is this,

1. find the gfid on the replica pair using getfattr command.

2. if the gfid is different for the file. then on brick of the pair 
   a. first delete the gfid entry from .glusterfs/indices/xattrop dir
   b. second delete the file itself from the same brick where step 2a is executed

3. check the gfid again using getfattr. Found that the gfid now matches


after above the above steps I have done ls <filename> on the mount-point.

Version-Release number of selected component (if applicable):
glusterfs-3.6.0.28-1.el6rhs.x86_64

How reproducible:
seen this time

Actual results:

nfs.log,

signal received: 11
time of crash: 
2014-09-08 20:11:48
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.6.0.28
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x391c21ff06]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x391c23a59f]
/lib64/libc.so.6[0x3c6be326b0]
/usr/lib64/glusterfs/3.6.0.28/xlator/nfs/server.so(nfs_fop_lookup_cbk+0x225)[0x7fc08287f2e5]
/usr/lib64/glusterfs/3.6.0.28/xlator/debug/io-stats.so(io_stats_lookup_cbk+0x113)[0x7fc082ad5e73]
/usr/lib64/glusterfs/3.6.0.28/xlator/cluster/distribute.so(dht_lookup_everywhere_done+0x668)[0x7fc082f1cba8]
/usr/lib64/glusterfs/3.6.0.28/xlator/cluster/distribute.so(dht_lookup_everywhere_cbk+0x403)[0x7fc082f215b3]
/usr/lib64/glusterfs/3.6.0.28/xlator/cluster/replicate.so(afr_lookup_cbk+0x558)[0x7fc0831a4438]
/usr/lib64/glusterfs/3.6.0.28/xlator/protocol/client.so(client3_3_lookup_cbk+0x647)[0x7fc0833e3267]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x391ca0e9c5]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f)[0x391ca0fe4f]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x391ca0b668]
/usr/lib64/glusterfs/3.6.0.28/rpc-transport/socket.so(+0x9275)[0x7fc08839e275]
/usr/lib64/glusterfs/3.6.0.28/rpc-transport/socket.so(+0xac5d)[0x7fc08839fc5d]
/usr/lib64/libglusterfs.so.0[0x391c276367]
/usr/sbin/glusterfs(main+0x603)[0x407e93]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3c6be1ed5d]
/usr/sbin/glusterfs[0x4049a9]


(gdb) bt
#0  nfs_fop_lookup_cbk (frame=0x7fc09085b740, cookie=0x1f38790, this=<value optimized out>, op_ret=<value optimized out>, 
    op_errno=<value optimized out>, inode=<value optimized out>, buf=0x7fc07fd7916c, xattr=0x2919ecc, postparent=0x7fc07fd7939c) at nfs-fops.c:427
#1  0x00007fc082ad5e73 in io_stats_lookup_cbk (frame=0x7fc090a2635c, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=0, 
    inode=0x0, buf=0x7fc07fd7916c, xdata=0x2919ecc, postparent=0x7fc07fd7939c) at io-stats.c:1505
#2  0x00007fc082f1cba8 in dht_lookup_everywhere_done (frame=0x7fc090a243c8, this=<value optimized out>) at dht-common.c:1370
#3  0x00007fc082f215b3 in dht_lookup_everywhere_cbk (frame=0x7fc090a243c8, cookie=<value optimized out>, this=0x1f34bc0, 
    op_ret=<value optimized out>, op_errno=<value optimized out>, inode=0x7fc081be3b98, buf=0x7fc07f36ba58, xattr=0x2919ecc, 
    postparent=0x7fc07f36bac8) at dht-common.c:1577
#4  0x00007fc0831a4438 in afr_lookup_done (frame=0x7fbfffffffc6, cookie=0x7fff5390a5f8, this=0x1f332e0, op_ret=<value optimized out>, 
    op_errno=43234112, inode=0x8, buf=0x7fff5390a850, xattr=0x2919ecc, postparent=0x7fff5390a7e0) at afr-common.c:2319
#5  afr_lookup_cbk (frame=0x7fbfffffffc6, cookie=0x7fff5390a5f8, this=0x1f332e0, op_ret=<value optimized out>, op_errno=43234112, inode=0x8, 
    buf=0x7fff5390a850, xattr=0x2919ecc, postparent=0x7fff5390a7e0) at afr-common.c:2560
#6  0x00007fc0833e3267 in client3_3_lookup_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, 
    myframe=0x7fc090a37424) at client-rpc-fops.c:2767
#7  0x000000391ca0e9c5 in rpc_clnt_handle_reply (clnt=0x1feefd0, pollin=0x29091b0) at rpc-clnt.c:763
#8  0x000000391ca0fe4f in rpc_clnt_notify (trans=<value optimized out>, mydata=0x1fef000, event=<value optimized out>, data=<value optimized out>)
    at rpc-clnt.c:891
#9  0x000000391ca0b668 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:512
#10 0x00007fc08839e275 in socket_event_poll_in (this=0x1ffebd0) at socket.c:2127
#11 0x00007fc08839fc5d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1ffebd0, poll_in=1, poll_out=0, 
    poll_err=0) at socket.c:2240
#12 0x000000391c276367 in event_dispatch_epoll_handler (event_pool=0x1efb4d0) at event-epoll.c:384
#13 event_dispatch_epoll (event_pool=0x1efb4d0) at event-epoll.c:445
#14 0x0000000000407e93 in main (argc=11, argv=0x7fff5390bce8) at glusterfsd.c:2023


Expected results:
nfs should not have crashed.

Additional info:

coredump is attached

Comment 2 Saurabh 2014-09-09 07:35:13 UTC
Here is the filename  it is fileOM70 and the operations done in order to resolve the gfid mismatch and one check after the operations are done on node1 the gfid for the file fileOM70 are same on both nodes,


node1,

[root@nfs1 xattrop]# getfattr -m . -e hex -d /bricks/d5r1/qa1dir/dir1/fileOM70
getfattr: Removing leading '/' from absolute path names
# file: bricks/d5r1/qa1dir/dir1/fileOM70
trusted.afr.dist-rep-client-8=0x000000000000000000000000
trusted.afr.dist-rep-client-9=0x000000030000000500000000
trusted.gfid=0x863e04a5d3f84fd1b051aa2af8f01629
trusted.glusterfs.dht.linkto=0x646973742d7265702d7265706c69636174652d3500
trusted.glusterfs.quota.f44114ea-d488-4e03-9e78-319f2782e6cc.contri=0x0000000000000000
trusted.pgfid.f44114ea-d488-4e03-9e78-319f2782e6cc=0x00000001

[root@nfs1 xattrop]# 
[root@nfs1 xattrop]# ls 86
863e04a5-d3f8-4fd1-b051-aa2af8f01629  86f3261a-d704-48e0-84ac-6d228116d679  
[root@nfs1 xattrop]# cd 86
-bash: cd: 86: No such file or directory
[root@nfs1 xattrop]# ls -l 863e04a5-d3f8-4fd1-b051-aa2af8f01629
---------- 129 root root 0 Sep  8 01:59 863e04a5-d3f8-4fd1-b051-aa2af8f01629
[root@nfs1 xattrop]# rm 863e04a5-d3f8-4fd1-b051-aa2af8f01629
rm: remove regular empty file `863e04a5-d3f8-4fd1-b051-aa2af8f01629'? y
[root@nfs1 xattrop]# cd /bricks/d5r1/qa1dir/dir1/
[root@nfs1 dir1]# rm fileOM70
rm: remove regular file `fileOM70'? y
[root@nfs1 dir1]# pwd
/bricks/d5r1/qa1dir/dir1
[root@nfs1 dir1]# getfattr -m . -e hex -d fileOM70
# file: fileOM70
trusted.afr.dist-rep-client-8=0x000000000000000000000000
trusted.afr.dist-rep-client-9=0x000000000000000000000000
trusted.gfid=0x4cebf288b3ee4b7a8b362cb348ac6a08
trusted.glusterfs.quota.f44114ea-d488-4e03-9e78-319f2782e6cc.contri=0x0000000000000000
trusted.pgfid.f44114ea-d488-4e03-9e78-319f2782e6cc=0x00000001






node2,


[root@nfs2 dir1]# getfattr -m . -e hex -d fileOM70
# file: fileOM70
trusted.afr.dist-rep-client-8=0x000000010000000100000000
trusted.afr.dist-rep-client-9=0x000000000000000000000000
trusted.gfid=0x4cebf288b3ee4b7a8b362cb348ac6a08
trusted.glusterfs.dht.linkto=0x646973742d7265702d7265706c69636174652d3500
trusted.glusterfs.quota.f44114ea-d488-4e03-9e78-319f2782e6cc.contri=0x0000000000000000
trusted.pgfid.f44114ea-d488-4e03-9e78-319f2782e6cc=0x00000001



Sharing this to give more information

Comment 3 Saurabh 2014-09-09 10:13:47 UTC
(gdb) 
#0  nfs_fop_lookup_cbk (frame=0x7f48afdf76b4, cookie=0x20a04f0, this=<value optimized out>, op_ret=<value optimized out>, 
    op_errno=<value optimized out>, inode=<value optimized out>, buf=0x7f489f3fa1f4, xattr=0x7f48af9c33f8, postparent=0x7f489f3fa424)
    at nfs-fops.c:427
#1  0x00007f48a20d4e73 in io_stats_lookup_cbk (frame=0x7f48affc6000, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=0, 
    inode=0x0, buf=0x7f489f3fa1f4, xdata=0x7f48af9c33f8, postparent=0x7f489f3fa424) at io-stats.c:1505
#2  0x00007f48a251bba8 in dht_lookup_everywhere_done (frame=0x7f48affc5dfc, this=<value optimized out>) at dht-common.c:1370
#3  0x00007f48a25205b3 in dht_lookup_everywhere_cbk (frame=0x7f48affc5dfc, cookie=<value optimized out>, this=0x209c920, 
    op_ret=<value optimized out>, op_errno=<value optimized out>, inode=0x7f48a1104268, buf=0x7f489e92a444, xattr=0x7f48af9c33f8, 
    postparent=0x7f489e92a4b4) at dht-common.c:1577
#4  0x00007f48a27a3438 in afr_lookup_done (frame=0x7f47ffffffc6, cookie=0x7fff9396f418, this=0x209b040, op_ret=<value optimized out>, 
    op_errno=36828304, inode=0x8, buf=0x7fff9396f670, xattr=0x7f48af9c33f8, postparent=0x7fff9396f600) at afr-common.c:2319
#5  afr_lookup_cbk (frame=0x7f47ffffffc6, cookie=0x7fff9396f418, this=0x209b040, op_ret=<value optimized out>, op_errno=36828304, inode=0x8, 
    buf=0x7fff9396f670, xattr=0x7f48af9c33f8, postparent=0x7fff9396f600) at afr-common.c:2560
#6  0x00007f48a29e2267 in client3_3_lookup_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, 
    myframe=0x7f48affc5f54) at client-rpc-fops.c:2767
#7  0x000000391ca0e9c5 in rpc_clnt_handle_reply (clnt=0x217d370, pollin=0x206d770) at rpc-clnt.c:763
#8  0x000000391ca0fe4f in rpc_clnt_notify (trans=<value optimized out>, mydata=0x217d3a0, event=<value optimized out>, data=<value optimized out>)
    at rpc-clnt.c:891
#9  0x000000391ca0b668 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:512
#10 0x00007f48a833f275 in socket_event_poll_in (this=0x218cf70) at socket.c:2127
#11 0x00007f48a8340c5d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x218cf70, poll_in=1, poll_out=0, 
    poll_err=0) at socket.c:2240
#12 0x000000391c276367 in event_dispatch_epoll_handler (event_pool=0x20624d0) at event-epoll.c:384
#13 event_dispatch_epoll (event_pool=0x20624d0) at event-epoll.c:445
#14 0x0000000000407e93 in main (argc=11, argv=0x7fff93970b08) at glusterfsd.c:2023

(gdb) frame 0 
#0  nfs_fop_lookup_cbk (frame=0x7f48afdf76b4, cookie=0x20a04f0, this=<value optimized out>, op_ret=<value optimized out>, 
    op_errno=<value optimized out>, inode=<value optimized out>, buf=0x7f489f3fa1f4, xattr=0x7f48af9c33f8, postparent=0x7f489f3fa424)
    at nfs-fops.c:427
427	        nfs_fop_restore_root_ino (local, op_ret, buf, NULL, NULL, postparent);

(gdb) print frame 
$11 = (call_frame_t *) 0x7f48afdf76b4

(gdb) print *$
$12 = {root = 0x7f48afdf704c, parent = 0x0, next = 0x0, prev = 0x0, local = 0x0, this = 0x20a1860, ret = 0, ref_count = -1, lock = 1, cookie = 0x0, 
  complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x0, wind_to = 0x0, 
  unwind_from = 0x0, unwind_to = 0x0}

(gdb) print frame.local
$13 = (void *) 0x0

int32_t
nfs_fop_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
                    int32_t op_ret, int32_t op_errno, inode_t *inode,
                    struct iatt *buf, dict_t *xattr, struct iatt *postparent)
{
        struct nfs_fop_local    *local = NULL;
        fop_lookup_cbk_t        progcbk;

        if (op_ret == 0) {
                nfs_fix_generation(this,inode);
        }

        nfl_to_prog_data (local, progcbk, frame);
====>>> nfs_fop_restore_root_ino (local, op_ret, buf, NULL, NULL, postparent);
        if (progcbk)
                progcbk (frame, cookie, this, op_ret, op_errno, inode, buf,
                         xattr, postparent);

        nfs_stack_destroy (local, frame);
        return 0;
}


/* Use the state saved by the previous macro to funge the ino in the appropriate
 * structure.
 */
#define nfs_fop_restore_root_ino(locl, fopret, preattr, postattr, prepar, postpar)  \
        do {                                                                \
                if (fopret == -1)                                           \
                        break;                                              \
                if ((locl)->rootinode) {                                    \
                        if ((preattr)) {                                    \
                                ((struct iatt *)(preattr))->ia_ino = 1;     \
                                ((struct iatt *)(preattr))->ia_dev = 0;     \
                        }                                                   \
                        if ((postattr)) {                                   \
                                ((struct iatt *)(postattr))->ia_ino = 1;    \
                                ((struct iatt *)(postattr))->ia_dev = 0;    \
                        }                                                   \
                } else if ((locl)->rootparentinode) {                       \
                        if ((prepar)) {                                     \
                                ((struct iatt *)(prepar))->ia_ino = 1;      \
                                ((struct iatt *)(prepar))->ia_dev = 0;      \
                        }                                                   \
                        if ((postpar)) {                                    \
                                ((struct iatt *)(postpar))->ia_ino = 1;     \
                                ((struct iatt *)(postpar))->ia_dev = 0;     \
                        }                                                   \
                }                                                           \
        } while (0)          

Basically local being NULL [(void *)0] causes the SEGV. 

This is strange frame.local being NULL. It needs more investigation. This is in callback code path, so frame.local being NULL being caused by AFR/DHT. Not sure though.

Comment 4 Shyamsundar 2014-09-09 16:48:42 UTC
Root caused this as follows:

In dht_lookup_everywhere_done

Line: 1194 we call DHT_STACK_UNWIND
and in the same if condition we go ahead and call, goto
unwind_hashed_and_cached; which at Line 1371 calls another UNWIND.

Pointers to the problem:
- If you see in the cores, the ref count for the frames become -1 which
means double dereference of sorts.

From (gdb) f 2  if the frame, frame->parent, frame->parent->parent is
inspected, we can see this ref count.

- In all the cores, the statement that caused the UNWIND is occuring at
"unwind_hashed_and_cached:" at dht_lookup_everywhere_done, and we can
get here only from that loop as mentioned above.

So the problem is a double UNWIND in a particular code path that causes this core dump.

BTW, bug #1139213 has the same root cause.

Comment 6 Anush Shetty 2014-09-18 07:05:30 UTC
See this issue over RHS-Hadoop fuse mount as well; this makes the mount unusable. Remounting the client didn't help.


(gdb) bt
#0  0x00007f099f2f6c16 in dht_lookup_everywhere_done (frame=0x7f09ad230a60, this=0x155a370) at dht-common.c:1234
#1  0x00007f099f2fb5b3 in dht_lookup_everywhere_cbk (frame=0x7f09ad230a60, cookie=<value optimized out>, this=0x155a370, op_ret=<value optimized out>, op_errno=<value optimized out>, 
    inode=0x7f099d2b98c0, buf=0x7f099de20444, xattr=0x7f09acc2bf9c, postparent=0x7f099de204b4) at dht-common.c:1577
#2  0x00007f099f57e438 in afr_lookup_done (frame=0x7f08ffffffc6, cookie=0x7fff70f87cf8, this=0x1558950, op_ret=<value optimized out>, op_errno=24135216, inode=0x8, buf=0x7fff70f87f50, 
    xattr=0x7f09acc2bf9c, postparent=0x7fff70f87ee0) at afr-common.c:2319
#3  afr_lookup_cbk (frame=0x7f08ffffffc6, cookie=0x7fff70f87cf8, this=0x1558950, op_ret=<value optimized out>, op_errno=24135216, inode=0x8, buf=0x7fff70f87f50, xattr=0x7f09acc2bf9c, 
    postparent=0x7fff70f87ee0) at afr-common.c:2560
#4  0x00007f099f7bd267 in client3_3_lookup_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f09ad2311c4) at client-rpc-fops.c:2767
#5  0x0000003b5280e9c5 in rpc_clnt_handle_reply (clnt=0x1614640, pollin=0x16f8490) at rpc-clnt.c:763
#6  0x0000003b5280fe4f in rpc_clnt_notify (trans=<value optimized out>, mydata=0x1614670, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:891
#7  0x0000003b5280b668 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:512
#8  0x00007f099fdf6275 in socket_event_poll_in (this=0x1624240) at socket.c:2127
#9  0x00007f099fdf7c5d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1624240, poll_in=1, poll_out=0, poll_err=0) at socket.c:2240
#10 0x0000003b52476367 in event_dispatch_epoll_handler (event_pool=0x15204d0) at event-epoll.c:384
#11 event_dispatch_epoll (event_pool=0x15204d0) at event-epoll.c:445
#12 0x0000000000407e93 in main (argc=7, argv=0x7fff70f893e8) at glusterfsd.c:2023

Comment 8 Pranith Kumar K 2014-09-20 11:03:39 UTC
1. Created the following volume:
Volume Name: r2d7
Type: Distributed-Replicate
Volume ID: 30b55ed4-bc0e-4dbc-bb51-6389ee26c51e
Status: Started
Snap Volume: no
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: 10.16.159.65:/rhs/brick1/c
Brick2: 10.16.159.117:/rhs/brick1/c
Brick3: 10.16.159.65:/rhs/brick2/c
Brick4: 10.16.159.117:/rhs/brick2/c
Brick5: 10.16.159.65:/rhs/brick3/c
Brick6: 10.16.159.117:/rhs/brick3/c
Brick7: 10.16.159.65:/rhs/brick4/c
Brick8: 10.16.159.117:/rhs/brick4/c
Options Reconfigured:
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

2. Set quota options
[root@dhcp159-65 ~]# gluster volume quota r2d7 enable
volume quota : success
[root@dhcp159-65 ~]# gluster volume quota r2d7 limit-usage / 10GB
volume quota : success
[root@dhcp159-65 ~]# gluster volume set r2d7 features.quota-deem-statfs on
volume set: success

3. mount nfs mount.
4. Create files in gfid split-brain:
killall glusterd glusterfsd glusterfs - on 10.16.159.65
Execute the following commands
for i in {1..100}; do touch $i; done
for i in {1..100}; do mv $i ${i}_1; done

killall glusterd glusterfsd glusterfs - on 10.16.159.117
Execute 'glusterd' on 10.16.159.65
Execute the following commands on the nfs mount
for i in {1..100}; do touch $i; done
for i in {1..100}; do mv $i ${i}_1; done

Execute 'glusterd' on 10.16.159.117

We will be getting following errors when the files are accessed:
ls: cannot access 13_1: Input/output error
ls: cannot access 24_1: Input/output error
ls: cannot access 26_1: Input/output error

Following errors are seen in the logs:
[2014-09-20 10:48:20.208246] I [afr-self-heal-common.c:2016:afr_sh_find_fresh_parents] 0-r2d7-replicate-2: /: Performing conservative merge
[2014-09-20 10:48:20.209835] E [afr-self-heal-common.c:2869:afr_log_self_heal_completion_status] 0-r2d7-replicate-2:  gfid or missing entry self heal  failed,   on /100_1
[2014-09-20 10:48:20.209870] W [nfs3.c:1230:nfs3svc_lookup_cbk] 0-nfs: e05da27d: /100_1 => -1 (Input/output error)
[2014-09-20 10:48:20.209893] W [nfs3-helpers.c:3470:nfs3_log_newfh_res] 0-nfs-nfsv3: XID: e05da27d, LOOKUP: NFS: 5(I/O error), POSIX: 5(Input/output error), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000

5. Pick a file and resolve gfid-split-brain as mentioned in the bug:
on 10.16.159.65
[root@dhcp159-65 ~]# getfattr -d -m. -e hex /rhs/brick1/c/98_1
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/c/98_1
trusted.afr.r2d7-client-0=0x000000000000000000000000
trusted.afr.r2d7-client-1=0x000000010000000200000000
trusted.gfid=0x8f8bc8dc633d4f2d8a7c69c26fa9bb9f
trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri=0x0000000000000000

[root@dhcp159-65 ~]# rm -f /rhs/brick1/c/.glusterfs/8f/8b/8f8bc8dc-633d-4f2d-8a7c-69c26fa9bb9f 
[root@dhcp159-65 ~]# rm -f /rhs/brick1/c/.glusterfs/indices/xattrop/8f8bc8dc-633d-4f2d-8a7c-69c26fa9bb9f 
rm -f /rhs/brick1/c/98_1

[root@dhcp159-116 mnt]# ls -l 98_1
-rw-r--r--. 1 root root 0 Sep 20 06:37 98_1
[root@dhcp159-116 mnt]# ls -l 98_1
-rw-r--r--. 1 root root 0 Sep 20 06:37 98_1
[root@dhcp159-116 mnt]# ls -l 98_1
-rw-r--r--. 1 root root 0 Sep 20 06:37 98_1
[root@dhcp159-116 mnt]# ls -l 98_1
-rw-r--r--. 1 root root 0 Sep 20 06:37 98_1
[root@dhcp159-116 mnt]# ls -l 98_1
-rw-r--r--. 1 root root 0 Sep 20 06:37 98_1
[root@dhcp159-116 mnt]# ls -l 98_1
-rw-r--r--. 1 root root 0 Sep 20 06:37 98_1

The file above was present only one one of the subvolume. Some files are present on more than 1 subvol (dht-link-file) Lets do the same procedure for that as well.

[root@dhcp159-65 ~]# ls /rhs/brick{1,2}/c
/rhs/brick1/c:
1_1   16_1  25_1  31_1  4_1   47_1  5_1   53_1  59_1  62_1  66_1  73_1  8_1   9_1   96_1
11_1  20_1  26_1  33_1  41_1  48_1  51_1  54_1  6_1   64_1  70_1  77_1  86_1  91_1  97_1
13_1  24_1  29_1  36_1  46_1  49_1  52_1  58_1  61_1  65_1  72_1  78_1  87_1  94_1  98_1

/rhs/brick2/c:
10_1  12_1  20_1  23_1  30_1  33_1  37_1  43_1  49_1  62_1  69_1  72_1  79_1  81_1  85_1  93_1
1_1   18_1  2_1   27_1  3_1   34_1  38_1  44_1  51_1  63_1  7_1   76_1  80_1  82_1  86_1  94_1
11_1  19_1  21_1  28_1  32_1  35_1  40_1  46_1  57_1  68_1  71_1  78_1  8_1   83_1  88_1  99_1

1_1 is what we will choose.

[root@dhcp159-65 ~]# rm -f /rhs/brick1/c/.glusterfs/6a/4a/6a4ae6ce-27c3-41be-9cb1-6cda3698c3c1 
[root@dhcp159-65 ~]# rm -f /rhs/brick1/c/.glusterfs/indices/xattrop/6a4ae6ce-27c3-41be-9cb1-6cda3698c3c1 
[root@dhcp159-65 ~]# rm -f /rhs/brick1/c/1_1
[root@dhcp159-65 ~]# 

[root@dhcp159-117 ~]# rm -f /rhs/brick2/c/1_1
[root@dhcp159-117 ~]# rm -f /rhs/brick2/c/.glusterfs/c7/0e/c70e2c16-95f4-45cd-82ab-1ff9036be419 
[root@dhcp159-117 ~]# rm -f /rhs/brick2/c/.glusterfs/indices/xattrop/c70e2c16-95f4-45cd-82ab-1ff9036be419 
[root@dhcp159-117 ~]# 

Access the file on the mount:
[root@dhcp159-116 mnt]# ls -l 1_1
ls: cannot access 1_1: Stale file handle

Following logs are seen on accessing the file;
[2014-09-20 10:57:46.901117] I [afr-self-heal-common.c:2016:afr_sh_find_fresh_parents] 0-r2d7-replicate-1: /: Performing conservative merge
[2014-09-20 10:57:46.902842] E [afr-self-heal-common.c:2869:afr_log_self_heal_completion_status] 0-r2d7-replicate-1:  gfid or missing entry self heal  failed,   on /1_1
[2014-09-20 10:57:46.902889] W [nfs3.c:1230:nfs3svc_lookup_cbk] 0-nfs: e95da27d: /1_1 => -1 (Input/output error)
[2014-09-20 10:57:46.902917] W [nfs3-helpers.c:3470:nfs3_log_newfh_res] 0-nfs-nfsv3: XID: e95da27d, LOOKUP: NFS: 5(I/O error), POSIX: 5(Input/output error), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000
[2014-09-20 11:02:25.473839] W [MSGID: 109009] [dht-common.c:1691:dht_lookup_linkfile_cbk] 0-r2d7-dht: /1_1: gfid different on data file on r2d7-replicate-0, gfid local = 00000000-0000-0000-0000-000000000000, gfid node = c70e2c16-95f4-45cd-82ab-1ff9036be419 
[2014-09-20 11:02:25.477010] W [MSGID: 109009] [dht-common.c:1433:dht_lookup_everywhere_cbk] 0-r2d7-dht: /1_1: gfid differs on subvolume r2d7-replicate-0, gfid local = 6a4ae6ce-27c3-41be-9cb1-6cda3698c3c1, gfid node = c70e2c16-95f4-45cd-82ab-1ff9036be419
[2014-09-20 11:02:25.477226] W [MSGID: 109009] [dht-common.c:1433:dht_lookup_everywhere_cbk] 0-r2d7-dht: /1_1: gfid differs on subvolume r2d7-replicate-1, gfid local = c70e2c16-95f4-45cd-82ab-1ff9036be419, gfid node = 6a4ae6ce-27c3-41be-9cb1-6cda3698c3c1
[2014-09-20 11:02:25.477655] W [nfs3.c:1230:nfs3svc_lookup_cbk] 0-nfs: eb5da27d: /1_1 => -1 (Stale file handle)
[2014-09-20 11:02:25.477724] W [nfs3-helpers.c:3470:nfs3_log_newfh_res] 0-nfs-nfsv3: XID: eb5da27d, LOOKUP: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000
[2014-09-20 11:02:25.482840] W [nfs3.c:1230:nfs3svc_lookup_cbk] 0-nfs: ec5da27d: /1_1 => -1 (Stale file handle)
[2014-09-20 11:02:25.482913] W [nfs3-helpers.c:3470:nfs3_log_newfh_res] 0-nfs-nfsv3: XID: ec5da27d, LOOKUP: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000
[2014-09-20 11:02:25.482913] W [nfs3-helpers.c:3470:nfs3_log_newfh_res] 0-nfs-nfsv3: XID: ec5da27d, LOOKUP: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000
[2014-09-20 11:02:25.485264] I [afr-self-heal-common.c:2869:afr_log_self_heal_completion_status] 0-r2d7-replicate-1:  metadata self heal  is successfully completed, backgroung data self heal  is successfully completed,  data self heal from r2d7-client-2  to sinks  r2d7-client-3, with 0 bytes on r2d7-client-2, 0 bytes on r2d7-client-3,  data - Pending matrix:  [ [ 0 2 ] [ 0 0 ] ]  metadata self heal from source r2d7-client-2 to r2d7-client-3,  metadata - Pending matrix:  [ [ 0 4 ] [ 0 0 ] ], on /1_1

No cores are seen.

Comment 9 Saurabh 2014-09-23 09:28:26 UTC
as per 
https://bugzilla.redhat.com/show_bug.cgi?id=1139535#c8

this verfication is been done, credit for verification to Pranith.

Thanks,
Saurabh

Comment 11 errata-xmlrpc 2014-09-29 07:46:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1315.html


Note You need to log in before you can comment on or make changes to this bug.