Description of problem: 2x2 distributed replicate volume. ran some tests on fuse and nfs mounts. unmounted nfs mount. was cded into fuse mount and the mount point was idle for 3-4 days. Later when mount point was accessed the fuse client process asserted with the below backtrace. Core was generated by `/usr/local/sbin/glusterfs --volfile-id=mirror --volfile-server=10.70.36.4 /mnt/'. Program terminated with signal 6, Aborted. #0 0x0000003a68432885 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.12.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.5.x86_64 zlib-1.2.3-27.el6.x86_64 (gdb) bt #0 0x0000003a68432885 in raise () from /lib64/libc.so.6 #1 0x0000003a68434065 in abort () from /lib64/libc.so.6 #2 0x0000003a6842b9fe in __assert_fail_base () from /lib64/libc.so.6 #3 0x0000003a6842bac0 in __assert_fail () from /lib64/libc.so.6 #4 0x00007f99a8e7d68f in afr_lookup_update_lk_counts (local=0x7f99a390ec40, this=0xb3ddb0, child_index=1, xattr=0x0) at ../../../../../xlators/cluster/afr/src/afr-common.c:1122 #5 0x00007f99a8e80520 in afr_lookup_handle_success (local=0x7f99a390ec40, this=0xb3ddb0, child_index=1, op_ret=0, op_errno=0, inode=0x7f99a2a8304c, buf=0x7fffe9b6d450, xattr=0x0, postparent=0x7fffe9b6d3e0) at ../../../../../xlators/cluster/afr/src/afr-common.c:2005 #6 0x00007f99a8e8062d in afr_lookup_cbk (frame=0x7f99ac6c7cf8, cookie=0x1, this=0xb3ddb0, op_ret=0, op_errno=0, inode=0x7f99a2a8304c, buf=0x7fffe9b6d450, xattr=0x0, postparent=0x7fffe9b6d3e0) at ../../../../../xlators/cluster/afr/src/afr-common.c:2034 #7 0x00007f99a90ca44a in client3_1_lookup_cbk (req=0x7f99a327b4dc, iov=0x7f99a327b51c, count=1, myframe=0x7f99ac6c845c) at ../../../../../xlators/protocol/client/src/client3_1-fops.c:2636 #8 0x00007f99ad665a3a in rpc_clnt_handle_reply (clnt=0xb73b90, pollin=0x1b84430) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:786 #9 0x00007f99ad665dd7 in rpc_clnt_notify (trans=0xb83720, mydata=0xb73bc0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x1b84430) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:905 #10 0x00007f99ad661ec8 in rpc_transport_notify (this=0xb83720, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x1b84430) at ../../../../rpc/rpc-lib/src/rpc-transport.c:489 #11 0x00007f99a9f0f250 in socket_event_poll_in (this=0xb83720) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1677 #12 0x00007f99a9f0f7d4 in socket_event_handler (fd=13, idx=4, data=0xb83720, poll_in=1, poll_out=0, poll_err=0) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1792 #13 0x00007f99ad8bcd78 in event_dispatch_epoll_handler (event_pool=0xb25510, events=0xb335e0, i=0) at ../../../libglusterfs/src/event.c:785 #14 0x00007f99ad8bcf9b in event_dispatch_epoll (event_pool=0xb25510) at ../../../libglusterfs/src/event.c:847 #15 0x00007f99ad8bd326 in event_dispatch (event_pool=0xb25510) at ../../../libglusterfs/src/event.c:947 #16 0x00000000004085a7 in main (argc=4, argv=0x7fffe9b6db48) at ../../../glusterfsd/src/glusterfsd.c:1689 (gdb) f 4 #4 0x00007f99a8e7d68f in afr_lookup_update_lk_counts (local=0x7f99a390ec40, this=0xb3ddb0, child_index=1, xattr=0x0) at ../../../../../xlators/cluster/afr/src/afr-common.c:1122 1122 GF_ASSERT (xattr); (gdb) p xattr $1 = (dict_t *) 0x0 (gdb) f 7 #7 0x00007f99a90ca44a in client3_1_lookup_cbk (req=0x7f99a327b4dc, iov=0x7f99a327b51c, count=1, myframe=0x7f99ac6c845c) at ../../../../../xlators/protocol/client/src/client3_1-fops.c:2636 2636 CLIENT_STACK_UNWIND (lookup, frame, rsp.op_ret, rsp.op_errno, inode, (gdb) l 2631 else 2632 gf_log (this->name, GF_LOG_TRACE, "not found on remote node"); 2633 2634 } 2635 2636 CLIENT_STACK_UNWIND (lookup, frame, rsp.op_ret, rsp.op_errno, inode, 2637 &stbuf, xdata, &postparent); 2638 2639 if (xdata) 2640 dict_unref (xdata); (gdb) p xdata $2 = (dict_t *) 0x0 (gdb) Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: glusterfs client asserted due to NULL xattr Expected results: glusterfs client should not crash Additional info: gluster volume info Volume Name: mirror Type: Distributed-Replicate Volume ID: c059eea3-74a5-496e-bd17-db651fb910c6 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.36.3:/mnt/export/mirror Brick2: 10.70.36.4:/mnt/export/mirror Brick3: 10.70.36.3:/mnt/export2/mirror Brick4: 10.70.36.4:/mnt/export2/mirror [2012-10-03 21:26:25.960328] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-mirror-client-0: remote operation failed: No such file or directory [2012-10-03 21:26:25.960537] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-mirror-client-3: remote operation failed: No such file or directory [2012-10-03 21:26:25.960723] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-mirror-client-1: remote operation failed: No such file or directory [2012-10-03 21:26:25.960890] W [client3_1-fops.c:474:client3_1_stat_cbk] 0-mirror-client-2: remote operation failed: No such file or directory [2012-10-07 03:16:02.424353] I [glusterfsd.c:889:reincarnate] 0-glusterfsd: Fetching the volume file from server... [2012-10-07 03:16:02.444727] I [glusterfsd-mgmt.c:1569:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing pending frames: frame : type(1) op(LOOKUP) frame : type(1) op(LOOKUP) patchset: git://git.gluster.com/glusterfs.git signal received: 6 time of crash: 2012-10-08 19:47:30 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.3.1qa3
The reason for the crash is because of some problem in the backend xfs filesystem. upon lstat() was returning 5 instead of 0 or -1. Thus posix xlator did not create the dictionary which contains the xattrs information of lookup and thus returned NULL dictionary. This is the o/p of dmesg: XFS (dm-8): xfs_log_force: error 5 returned. XFS (dm-7): xfs_log_force: error 5 returned. XFS (dm-8): xfs_log_force: error 5 returned. XFS (dm-7): xfs_log_force: error 5 returned. XFS (dm-8): xfs_log_force: error 5 returned. XFS (dm-7): xfs_log_force: error 5 returned. XFS (dm-8): xfs_log_force: error 5 returned. XFS (dm-7): xfs_log_force: error 5 returned. XFS (dm-8): xfs_log_force: error 5 returned. XFS (dm-7): xfs_log_force: error 5 returned. XFS (dm-8): xfs_log_force: error 5 returned. XFS (dm-7): xfs_log_force: error 5 returned. XFS (dm-8): xfs_log_force: error 5 returned. XFS (dm-7): xfs_log_force: error 5 returned. XFS (dm-8): xfs_log_force: error 5 returned. XFS (dm-7): xfs_log_force: error 5 returned. XFS (dm-8): xfs_log_force: error 5 returned. XFS (dm-7): xfs_log_force: error 5 returned. XFS (dm-8): xfs_log_force: error 5 returned. XFS (dm-7): xfs_log_force: error 5 returned.
CHANGE: http://review.gluster.org/4056 (storage/posix: return -1 if lstat call returns non zero value apart from -1) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.org/4054 (storage/posix: return -1 if lstat call returns non zero value apart from -1) merged in release-3.3 by Anand Avati (avati)
(In reply to comment #1) > The reason for the crash is because of some problem in the backend xfs > filesystem. upon lstat() was returning 5 instead of 0 or -1. Thus posix > xlator did not create the dictionary which contains the xattrs information > of lookup and thus returned NULL dictionary. > > > This is the o/p of dmesg: > > XFS (dm-8): xfs_log_force: error 5 returned. > XFS (dm-7): xfs_log_force: error 5 returned. > XFS (dm-8): xfs_log_force: error 5 returned. > XFS (dm-7): xfs_log_force: error 5 returned. > XFS (dm-8): xfs_log_force: error 5 returned. > XFS (dm-7): xfs_log_force: error 5 returned. > XFS (dm-8): xfs_log_force: error 5 returned. > XFS (dm-7): xfs_log_force: error 5 returned. > XFS (dm-8): xfs_log_force: error 5 returned. > XFS (dm-7): xfs_log_force: error 5 returned. > XFS (dm-8): xfs_log_force: error 5 returned. > XFS (dm-7): xfs_log_force: error 5 returned. > XFS (dm-8): xfs_log_force: error 5 returned. > XFS (dm-7): xfs_log_force: error 5 returned. > XFS (dm-8): xfs_log_force: error 5 returned. > XFS (dm-7): xfs_log_force: error 5 returned. > XFS (dm-8): xfs_log_force: error 5 returned. > XFS (dm-7): xfs_log_force: error 5 returned. > XFS (dm-8): xfs_log_force: error 5 returned. > XFS (dm-7): xfs_log_force: error 5 returned. do you know how to reproduced this issue? did you see XFS produce messages like "xfs_iunlink_remove: xfs_inotobp() returned error 22." and then XFS shuts itself down? If you know how to reproduce, possible provide me the testbed / reproducer details or what set of operations were being performed on the XFS while this occurred? Thanks -Harsha
Both the machines user were vms. I had run some tests and then had left the setup as it is for few days (may b a week). Had not run any command on those machines during the idle time. And when I had run the tests there was no error. But when I again logged into the machine after a week the fuse client crashed due to the above error after entering a command on the fuse mount. I dont know how that xfs issue came. But in one of my other testings where I was doing removing the cable of the disk or if the mount is via losetup then removing the actual file used for loopback, I had got above errors again.
"upon lstat() was returning 5 instead of 0 or -1. " Which lstat() is this - glibc, or something in the gluster code? 5 is EIO which is the errno xfs is returning from lstat after it shuts down, but whateverteh lstat implementation is should have been putting that into errno and returning -1. Let's not lose track of *that* bug . . . checking for other values returned from a syscall seems like papering over the root cause of the bug I think.
IOW, this is what glibc's lstat does in the face of a shut-down xfs filesystem: # mount /dev/sdb1 /mnt/test # touch /mnt/test/file # /mnt/test2/git/xfstests/src/godown /mnt/test # strace -e lstat stat /mnt/test/file lstat("/mnt/test/file", 0x7fffc9ae2480) = -1 EIO (Input/output error) stat: cannot stat `/mnt/test/file': Input/output error # dmesg tail -n 2 [359908.052905] XFS (sdb1): xfs_log_force: error 5 returned. [359938.103901] XFS (sdb1): xfs_log_force: error 5 returned. -Eric
And looking at glusterfs code, it does seem to be doing the right thing in the lstat wappers. I don't see anything in this bug that indicates lstat *did* return 5. Are you sure of this? Thanks, -Eric
Earlier glusterfs (storage/posix xlator which talks to the backend) was checking the return value of lstat () syscall and was treating it as a failure if lstat return value is -1. When thig bug was found, lstat () returning 5 instead of -1 (dmesg also said some xfs error, as mentioned in the above comments). Since glusterfs was treating -1 return value as the failure of the syscall, it assumed the lookup call was successful and returned 0 instead of -1. Other components of glusterfs (replicate xlator) checked the return value to be 0 and started accessing some structures which did not contain the valid information and thus segfaulted.
(In reply to comment #9) > Earlier glusterfs (storage/posix xlator which talks to the backend) was > checking the return value of lstat () syscall and was treating it as a > failure if lstat return value is -1. When thig bug was found, lstat () > returning 5 instead of -1 How was it determined that lstat returned 5? > (dmesg also said some xfs error, as mentioned in > the above comments). dmesg may say "error 5 returned" but that does *not* mean that's what lstat returned. That's what an internal kernel function returned, and it should have been translated into an *errno* of 5, i.e. EIO; see comment #7 above. Do you have traces of lstat actually returning 5? If so that is a glibc bug which must be addressed, but I do not see it in my testing.
Ok actually RHEL6 does have this bug. I wish gluster folks had notified us: xfs_vn_getattr returns positive not negative errno: if (XFS_FORCED_SHUTDOWN(mp)) return XFS_ERROR(EIO); see also bug #604089
Ok. The XFS bug was fixed in bug #859242 commit 2e8ecba563cfeff97584b578043b2e7ea7a31498 Author: Dave Chinner <dchinner> Date: Mon Sep 24 14:21:30 2012 -0400 [fs] xfs: Return -EIO when xfs_vn_getattr() failed Upstream commit: ed32201e65e15f3e6955cb84cbb544b08f81e5a5 An attribute of inode can be fetched via xfs_vn_getattr() in XFS. Currently it returns EIO, not negative value, when it failed. As a result, the system call returns not negative value even though an error occured. The stat(2), ls and mv commands cannot handle this error and do not work correctly. This patch fixes this bug, and returns -EIO, not EIO when an error is detected in xfs_vn_getattr(). Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu> Reviewed-by: Christoph Hellwig <hch> Signed-off-by: Alex Elder <aelder> Signed-off-by: Jarod Wilson <jarod> Sorry for all the noise, but when gluster finds fs problems we really need to know about them, even if you also work around it in the short term. Thanks, -Eric