Bug 864401

Summary: [glusterfs-3.3.1qa3]: glusterfs client asserted
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: posixAssignee: Raghavendra Bhat <rabhat>
Status: CLOSED CURRENTRELEASE QA Contact: Sachidananda Urs <surs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: pre-releaseCC: esandeen, fharshav, gluster-bugs, jahernan, sdharane, shaines
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 895841 (view as bug list) Environment:
Last Closed: 2013-07-24 17:30:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 895841    

Description Raghavendra Bhat 2012-10-09 10:57:59 UTC
Description of problem:
2x2 distributed replicate volume. ran some tests on fuse and nfs mounts. unmounted nfs mount. was cded into fuse mount and the mount point was idle for 3-4 days. Later when mount point was accessed the fuse client process asserted with the below backtrace.

Core was generated by `/usr/local/sbin/glusterfs --volfile-id=mirror --volfile-server=10.70.36.4 /mnt/'.
Program terminated with signal 6, Aborted.
#0  0x0000003a68432885 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.12.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.5.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0  0x0000003a68432885 in raise () from /lib64/libc.so.6
#1  0x0000003a68434065 in abort () from /lib64/libc.so.6
#2  0x0000003a6842b9fe in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003a6842bac0 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f99a8e7d68f in afr_lookup_update_lk_counts (local=0x7f99a390ec40, this=0xb3ddb0, child_index=1, xattr=0x0)
    at ../../../../../xlators/cluster/afr/src/afr-common.c:1122
#5  0x00007f99a8e80520 in afr_lookup_handle_success (local=0x7f99a390ec40, this=0xb3ddb0, child_index=1, op_ret=0, 
    op_errno=0, inode=0x7f99a2a8304c, buf=0x7fffe9b6d450, xattr=0x0, postparent=0x7fffe9b6d3e0)
    at ../../../../../xlators/cluster/afr/src/afr-common.c:2005
#6  0x00007f99a8e8062d in afr_lookup_cbk (frame=0x7f99ac6c7cf8, cookie=0x1, this=0xb3ddb0, op_ret=0, op_errno=0, 
    inode=0x7f99a2a8304c, buf=0x7fffe9b6d450, xattr=0x0, postparent=0x7fffe9b6d3e0)
    at ../../../../../xlators/cluster/afr/src/afr-common.c:2034
#7  0x00007f99a90ca44a in client3_1_lookup_cbk (req=0x7f99a327b4dc, iov=0x7f99a327b51c, count=1, myframe=0x7f99ac6c845c)
    at ../../../../../xlators/protocol/client/src/client3_1-fops.c:2636
#8  0x00007f99ad665a3a in rpc_clnt_handle_reply (clnt=0xb73b90, pollin=0x1b84430)
    at ../../../../rpc/rpc-lib/src/rpc-clnt.c:786
#9  0x00007f99ad665dd7 in rpc_clnt_notify (trans=0xb83720, mydata=0xb73bc0, event=RPC_TRANSPORT_MSG_RECEIVED, 
    data=0x1b84430) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:905
#10 0x00007f99ad661ec8 in rpc_transport_notify (this=0xb83720, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x1b84430)
    at ../../../../rpc/rpc-lib/src/rpc-transport.c:489
#11 0x00007f99a9f0f250 in socket_event_poll_in (this=0xb83720) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1677
#12 0x00007f99a9f0f7d4 in socket_event_handler (fd=13, idx=4, data=0xb83720, poll_in=1, poll_out=0, poll_err=0)
    at ../../../../../rpc/rpc-transport/socket/src/socket.c:1792
#13 0x00007f99ad8bcd78 in event_dispatch_epoll_handler (event_pool=0xb25510, events=0xb335e0, i=0)
    at ../../../libglusterfs/src/event.c:785
#14 0x00007f99ad8bcf9b in event_dispatch_epoll (event_pool=0xb25510) at ../../../libglusterfs/src/event.c:847
#15 0x00007f99ad8bd326 in event_dispatch (event_pool=0xb25510) at ../../../libglusterfs/src/event.c:947
#16 0x00000000004085a7 in main (argc=4, argv=0x7fffe9b6db48) at ../../../glusterfsd/src/glusterfsd.c:1689
(gdb)  f 4
#4  0x00007f99a8e7d68f in afr_lookup_update_lk_counts (local=0x7f99a390ec40, this=0xb3ddb0, child_index=1, xattr=0x0)
    at ../../../../../xlators/cluster/afr/src/afr-common.c:1122
1122            GF_ASSERT (xattr);
(gdb) p xattr
$1 = (dict_t *) 0x0
(gdb) f 7
#7  0x00007f99a90ca44a in client3_1_lookup_cbk (req=0x7f99a327b4dc, iov=0x7f99a327b51c, count=1, myframe=0x7f99ac6c845c)
    at ../../../../../xlators/protocol/client/src/client3_1-fops.c:2636
2636            CLIENT_STACK_UNWIND (lookup, frame, rsp.op_ret, rsp.op_errno, inode,
(gdb) l
2631                    else
2632                            gf_log (this->name, GF_LOG_TRACE, "not found on remote node");
2633
2634            }
2635
2636            CLIENT_STACK_UNWIND (lookup, frame, rsp.op_ret, rsp.op_errno, inode,
2637                                 &stbuf, xdata, &postparent);
2638
2639            if (xdata)
2640                    dict_unref (xdata);
(gdb) p xdata
$2 = (dict_t *) 0x0
(gdb) 




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:

glusterfs client asserted due to NULL xattr

Expected results:

glusterfs client should not crash

Additional info:

 gluster volume info

Volume Name: mirror
Type: Distributed-Replicate
Volume ID: c059eea3-74a5-496e-bd17-db651fb910c6
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.36.3:/mnt/export/mirror
Brick2: 10.70.36.4:/mnt/export/mirror
Brick3: 10.70.36.3:/mnt/export2/mirror
Brick4: 10.70.36.4:/mnt/export2/mirror


[2012-10-03 21:26:25.960328] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-mirror-client-0: remote operation failed: No such
 file or directory
[2012-10-03 21:26:25.960537] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-mirror-client-3: remote operation failed: No such
 file or directory
[2012-10-03 21:26:25.960723] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-mirror-client-1: remote operation failed: No such
 file or directory
[2012-10-03 21:26:25.960890] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-mirror-client-2: remote operation failed: No such
 file or directory
[2012-10-07 03:16:02.424353] I [glusterfsd.c:889:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2012-10-07 03:16:02.444727] I [glusterfsd-mgmt.c:1569:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)

patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash: 2012-10-08 19:47:30
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.1qa3

Comment 1 Raghavendra Bhat 2012-10-10 08:51:55 UTC
The reason for the crash is because of some problem in the backend xfs filesystem. upon lstat()  was returning 5 instead of 0 or -1. Thus posix xlator did not create the dictionary which contains the xattrs information of lookup and thus returned NULL dictionary.


This is the o/p of dmesg:

XFS (dm-8): xfs_log_force: error 5 returned.
XFS (dm-7): xfs_log_force: error 5 returned.
XFS (dm-8): xfs_log_force: error 5 returned.
XFS (dm-7): xfs_log_force: error 5 returned.
XFS (dm-8): xfs_log_force: error 5 returned.
XFS (dm-7): xfs_log_force: error 5 returned.
XFS (dm-8): xfs_log_force: error 5 returned.
XFS (dm-7): xfs_log_force: error 5 returned.
XFS (dm-8): xfs_log_force: error 5 returned.
XFS (dm-7): xfs_log_force: error 5 returned.
XFS (dm-8): xfs_log_force: error 5 returned.
XFS (dm-7): xfs_log_force: error 5 returned.
XFS (dm-8): xfs_log_force: error 5 returned.
XFS (dm-7): xfs_log_force: error 5 returned.
XFS (dm-8): xfs_log_force: error 5 returned.
XFS (dm-7): xfs_log_force: error 5 returned.
XFS (dm-8): xfs_log_force: error 5 returned.
XFS (dm-7): xfs_log_force: error 5 returned.
XFS (dm-8): xfs_log_force: error 5 returned.
XFS (dm-7): xfs_log_force: error 5 returned.

Comment 2 Vijay Bellur 2012-10-11 02:13:48 UTC
CHANGE: http://review.gluster.org/4056 (storage/posix: return -1 if lstat call returns non zero value apart from -1) merged in master by Anand Avati (avati)

Comment 3 Vijay Bellur 2012-10-12 01:08:30 UTC
CHANGE: http://review.gluster.org/4054 (storage/posix: return -1 if lstat call returns non zero value apart from -1) merged in release-3.3 by Anand Avati (avati)

Comment 4 Harshavardhana 2013-02-08 06:01:57 UTC
(In reply to comment #1)
> The reason for the crash is because of some problem in the backend xfs
> filesystem. upon lstat()  was returning 5 instead of 0 or -1. Thus posix
> xlator did not create the dictionary which contains the xattrs information
> of lookup and thus returned NULL dictionary.
> 
> 
> This is the o/p of dmesg:
> 
> XFS (dm-8): xfs_log_force: error 5 returned.
> XFS (dm-7): xfs_log_force: error 5 returned.
> XFS (dm-8): xfs_log_force: error 5 returned.
> XFS (dm-7): xfs_log_force: error 5 returned.
> XFS (dm-8): xfs_log_force: error 5 returned.
> XFS (dm-7): xfs_log_force: error 5 returned.
> XFS (dm-8): xfs_log_force: error 5 returned.
> XFS (dm-7): xfs_log_force: error 5 returned.
> XFS (dm-8): xfs_log_force: error 5 returned.
> XFS (dm-7): xfs_log_force: error 5 returned.
> XFS (dm-8): xfs_log_force: error 5 returned.
> XFS (dm-7): xfs_log_force: error 5 returned.
> XFS (dm-8): xfs_log_force: error 5 returned.
> XFS (dm-7): xfs_log_force: error 5 returned.
> XFS (dm-8): xfs_log_force: error 5 returned.
> XFS (dm-7): xfs_log_force: error 5 returned.
> XFS (dm-8): xfs_log_force: error 5 returned.
> XFS (dm-7): xfs_log_force: error 5 returned.
> XFS (dm-8): xfs_log_force: error 5 returned.
> XFS (dm-7): xfs_log_force: error 5 returned.

do you know how to reproduced this issue? did you see XFS produce messages like 
"xfs_iunlink_remove: xfs_inotobp() returned error 22." and then XFS shuts itself down?

If you know how to reproduce, possible provide me the testbed / reproducer details or what set of operations were being performed on the XFS while this occurred?

Thanks
-Harsha

Comment 5 Raghavendra Bhat 2013-02-08 06:12:04 UTC
Both the machines user were vms. I had run some tests and then had left the setup as it is for few days (may b a week). Had not run any command on those machines during the idle time. And when I had run the tests there was no error. But when I again logged into the machine after a week the fuse client crashed due to the above error after entering a command on the fuse mount.

I dont know how that xfs issue came. But in one of my other testings where I was doing removing the cable of the disk or if the mount is via losetup then removing the actual file used for loopback, I had got above errors again.

Comment 6 Eric Sandeen 2013-04-08 19:20:07 UTC
"upon lstat()  was returning 5 instead of 0 or -1. "

Which lstat() is this - glibc, or something in the gluster code?

5 is EIO which is the errno xfs is returning from lstat after it shuts down, but whateverteh lstat implementation is should have been putting that into errno and returning -1.  Let's not lose track of *that* bug . . . checking for other values returned from a syscall seems like papering over the root cause of the bug I think.

Comment 7 Eric Sandeen 2013-04-08 19:23:22 UTC
IOW, this is what glibc's lstat does in the face of a shut-down xfs filesystem:

# mount /dev/sdb1 /mnt/test
# touch /mnt/test/file
# /mnt/test2/git/xfstests/src/godown /mnt/test
# strace -e lstat stat /mnt/test/file
lstat("/mnt/test/file", 0x7fffc9ae2480) = -1 EIO (Input/output error)
stat: cannot stat `/mnt/test/file': Input/output error
# dmesg tail -n 2
[359908.052905] XFS (sdb1): xfs_log_force: error 5 returned.
[359938.103901] XFS (sdb1): xfs_log_force: error 5 returned.

-Eric

Comment 8 Eric Sandeen 2013-04-08 19:39:36 UTC
And looking at glusterfs code, it does seem to be doing the right thing in the lstat wappers.  I don't see anything in this bug that indicates lstat *did* return 5.  Are you sure of this?

Thanks,
-Eric

Comment 9 Raghavendra Bhat 2013-04-09 05:58:36 UTC
Earlier glusterfs (storage/posix xlator which talks to the backend) was checking the return value of lstat () syscall and was treating it as a failure if lstat return value is -1. When thig bug was found, lstat () returning 5 instead of -1 (dmesg also said some xfs error, as mentioned in the above comments). Since glusterfs was treating -1 return value as the failure of the syscall, it assumed the lookup call was successful and returned 0 instead of -1.

Other components of glusterfs (replicate xlator) checked the return value to be 0 and started accessing some structures which did not contain the valid information and thus segfaulted.

Comment 10 Eric Sandeen 2013-04-09 15:28:47 UTC
(In reply to comment #9)
> Earlier glusterfs (storage/posix xlator which talks to the backend) was
> checking the return value of lstat () syscall and was treating it as a
> failure if lstat return value is -1. When thig bug was found, lstat ()
> returning 5 instead of -1

How was it determined that lstat returned 5?

> (dmesg also said some xfs error, as mentioned in
> the above comments). 

dmesg may say "error 5 returned" but that does *not* mean that's what lstat returned.  That's what an internal kernel function returned, and it should have been translated into an *errno* of 5, i.e. EIO; see comment #7 above.

Do you have traces of lstat actually returning 5?  If so that is a glibc bug which must be addressed, but I do not see it in my testing.

Comment 11 Eric Sandeen 2013-04-09 23:17:33 UTC
Ok actually RHEL6 does have this bug.  I wish gluster folks had notified us:

xfs_vn_getattr returns positive not negative errno:

        if (XFS_FORCED_SHUTDOWN(mp))
                return XFS_ERROR(EIO);

see also bug #604089

Comment 12 Eric Sandeen 2013-04-09 23:47:00 UTC
Ok.  The XFS bug was fixed in bug #859242

commit 2e8ecba563cfeff97584b578043b2e7ea7a31498
Author: Dave Chinner <dchinner>
Date:   Mon Sep 24 14:21:30 2012 -0400

    [fs] xfs: Return -EIO when xfs_vn_getattr() failed

    Upstream commit: ed32201e65e15f3e6955cb84cbb544b08f81e5a5

    An attribute of inode can be fetched via xfs_vn_getattr() in XFS.
    Currently it returns EIO, not negative value, when it failed.  As a
    result, the system call returns not negative value even though an
    error occured. The stat(2), ls and mv commands cannot handle this
    error and do not work correctly.
    
    This patch fixes this bug, and returns -EIO, not EIO when an error
    is detected in xfs_vn_getattr().

    Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu>
    Reviewed-by: Christoph Hellwig <hch>
    Signed-off-by: Alex Elder <aelder>
    Signed-off-by: Jarod Wilson <jarod>


Sorry for all the noise, but when gluster finds fs problems we really need to know about them, even if you also work around it in the short term.

Thanks,
-Eric