Bug 1668304

Summary:	[RHHI-V] Fuse mount crashed while creating the preallocated image
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	SATHEESARAN <sasundar>
Component:	sharding	Assignee:	Krutika Dhananjay <kdhananj>
Status:	CLOSED ERRATA	QA Contact:	SATHEESARAN <sasundar>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	olim, rcyriac, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, vdas
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.4.z Batch Update 3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.12.2-40	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1668309 1669077 (view as bug list)		Environment:
Last Closed:	2019-02-04 07:41:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1668309, 1669077, 1669382

Description SATHEESARAN 2019-01-22 12:07:58 UTC

Description of problem:
------------------------
Fuse mount crashed while creating the preallocated image of size 1TB on a arbitrated replicate volume

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHGS 3.4.3 nightly ( glusterfs-3.12.2-38.el7rhgs )
RHV 4.2.8

How reproducible:
-----------------
1 out of 2 times

Steps to Reproduce:
-------------------
1. Create a arbitrated replicate volume
2. Fuse mount it
3. Create preallocated image of size 1TB
# qemu-img create -f qcow2 -o preallocation=falloc /mnt/test1/vm1.img 1072G

Actual results:
---------------
Fuse mount crashed with segfault

Expected results:
-----------------
Preallocated image should be created successfully

Comment 2 SATHEESARAN 2019-01-22 12:11:03 UTC

Backtrace
-----------

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs --volfile-server=localhost --volfile-id=/data /mnt/test1'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fb1233e0a7a in __inode_get_xl_index (xlator=0x7fb1100143b0, inode=0x7fb104026258) at inode.c:455
455	        if ((inode->_ctx[xlator->xl_id].xl_key != NULL) &&
(gdb) bt
#0  0x00007fb1233e0a7a in __inode_get_xl_index (xlator=0x7fb1100143b0, inode=0x7fb104026258) at inode.c:455
#1  __inode_ref (inode=inode@entry=0x7fb104026258) at inode.c:537
#2  0x00007fb1233e0b81 in inode_ref (inode=inode@entry=0x7fb104026258) at inode.c:581
#3  0x00007fb1233f5d2b in __fd_create (inode=inode@entry=0x7fb104026258, pid=pid@entry=0) at fd.c:633
#4  0x00007fb1233f6f4a in __fd_anonymous (inode=inode@entry=0x7fb104026258, flags=flags@entry=2) at fd.c:779
#5  0x00007fb1233f729d in fd_anonymous (inode=0x7fb104026258) at fd.c:803
#6  0x00007fb115161534 in shard_post_lookup_fsync_handler (frame=0x7fb0c05eb178, this=0x7fb1100143b0) at shard.c:5936
#7  0x00007fb11514913c in shard_lookup_base_file (frame=frame@entry=0x7fb0c05eb178, this=this@entry=0x7fb1100143b0, loc=loc@entry=0x7fb10800a158, 
    handler=handler@entry=0x7fb115161030 <shard_post_lookup_fsync_handler>) at shard.c:1746
#8  0x00007fb1151544c3 in shard_fsync (frame=0x7fb0c05eb178, this=0x7fb1100143b0, fd=0x7fb0f800eb78, datasync=1, xdata=0x0) at shard.c:6015
#9  0x00007fb114f30189 in wb_fsync_helper (frame=0x7fb0f80022e8, this=0x7fb1100159d0, fd=0x7fb0f800eb78, datasync=1, xdata=0x0) at write-behind.c:1974
#10 0x00007fb1233f5b15 in call_resume_keep_stub (stub=0x7fb0f80250f8) at call-stub.c:2582
#11 0x00007fb114f35a69 in wb_do_winds (wb_inode=wb_inode@entry=0x7fb0f800dd70, tasks=tasks@entry=0x7fb10dae7510) at write-behind.c:1672
#12 0x00007fb114f35b7b in wb_process_queue (wb_inode=wb_inode@entry=0x7fb0f800dd70) at write-behind.c:1709
#13 0x00007fb114f35c57 in wb_fulfill_cbk (frame=0x7fb0f8010a58, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, prebuf=<optimized out>, 
    postbuf=0x7fb0f8003670, xdata=0x7fb0c065ee98) at write-behind.c:1054
#14 0x00007fb115156840 in shard_common_inode_write_success_unwind (fop=<optimized out>, frame=0x7fb0f80019b8, op_ret=65536) at shard.c:903
#15 0x00007fb115156bc0 in shard_common_inode_write_post_update_size_handler (frame=<optimized out>, this=<optimized out>) at shard.c:5214
#16 0x00007fb115147cc0 in shard_update_file_size (frame=frame@entry=0x7fb0f80019b8, this=this@entry=0x7fb1100143b0, fd=0x7fb0f800eb78, loc=loc@entry=0x0, 
    handler=handler@entry=0x7fb115156ba0 <shard_common_inode_write_post_update_size_handler>) at shard.c:1201
#17 0x00007fb11515e811 in shard_common_inode_write_do_cbk (frame=frame@entry=0x7fb0f80019b8, cookie=0x7fb0f800eb78, this=0x7fb1100143b0, op_ret=op_ret@entry=65536, 
    op_errno=op_errno@entry=0, pre=pre@entry=0x7fb0f8029730, post=post@entry=0x7fb0f80297a0, xdata=xdata@entry=0x7fb0c065ee98) at shard.c:5326
#18 0x00007fb1153d467e in dht_writev_cbk (frame=0x7fb0f80021d8, cookie=<optimized out>, this=<optimized out>, op_ret=65536, op_errno=0, prebuf=0x7fb0f8029730, postbuf=0x7fb0f80297a0, 
    xdata=0x7fb0c065ee98) at dht-inode-write.c:119
#19 0x00007fb115630b32 in afr_writev_unwind (frame=frame@entry=0x7fb0f8004888, this=this@entry=0x7fb11000fff0) at afr-inode-write.c:246
#20 0x00007fb11563105e in afr_writev_wind_cbk (frame=0x7fb0f800bd08, cookie=<optimized out>, this=0x7fb11000fff0, op_ret=<optimized out>, op_errno=<optimized out>, 
    prebuf=<optimized out>, postbuf=0x7fb10dae7990, xdata=0x7fb0c065ee98) at afr-inode-write.c:406
#21 0x00007fb1158a7ffa in client3_3_writev_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fb0f802d148) at client-rpc-fops.c:838
#22 0x00007fb123198b30 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fb11004a940, pollin=pollin@entry=0x7fb10bb95520) at rpc-clnt.c:778
#23 0x00007fb123198ed3 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fb11004a970, event=<optimized out>, data=0x7fb10bb95520) at rpc-clnt.c:971
#24 0x00007fb123194c33 in rpc_transport_notify (this=this@entry=0x7fb11004ac90, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fb10bb95520) at rpc-transport.c:552
#25 0x00007fb117d89576 in socket_event_poll_in (this=this@entry=0x7fb11004ac90, notify_handled=<optimized out>) at socket.c:2322
#26 0x00007fb117d8bb1c in socket_event_handler (fd=11, idx=4, gen=1, data=0x7fb11004ac90, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
#27 0x00007fb12342ee84 in event_dispatch_epoll_handler (event=0x7fb10dae7e80, event_pool=0x56476beb1ec0) at event-epoll.c:583
#28 event_dispatch_epoll_worker (data=0x56476bf0b1d0) at event-epoll.c:659
#29 0x00007fb12222fdd5 in start_thread (arg=0x7fb10dae8700) at pthread_create.c:307
#30 0x00007fb121af7ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Comment 3 SATHEESARAN 2019-01-22 12:12:24 UTC

Errors as seen in the fuse mount log:
[2019-01-22 11:29:05.487058] E [shard.c:5935:shard_post_lookup_fsync_handler] (-->/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x134c3) [0x7fb1151544c3] -->/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x813c) [0x7fb11514913c] -->/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x2083e) [0x7fb11516183e] ) 0-: Assertion failed: fsync_count > 0
pending frames:
frame : type(0) op(0)
frame : type(1) op(FSYNC)
frame : type(1) op(FSYNC)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-01-22 11:29:05
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x9d)[0x7fb1233d0e8d]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fb1233daf54]
/lib64/libc.so.6(+0x36280)[0x7fb121a30280]
/lib64/libglusterfs.so.0(+0x3ba7a)[0x7fb1233e0a7a]
/lib64/libglusterfs.so.0(inode_ref+0x21)[0x7fb1233e0b81]
/lib64/libglusterfs.so.0(+0x50d2b)[0x7fb1233f5d2b]
/lib64/libglusterfs.so.0(+0x51f4a)[0x7fb1233f6f4a]
/lib64/libglusterfs.so.0(fd_anonymous+0x2d)[0x7fb1233f729d]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x20534)[0x7fb115161534]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x813c)[0x7fb11514913c]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x134c3)[0x7fb1151544c3]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x3189)[0x7fb114f30189]
/lib64/libglusterfs.so.0(call_resume_keep_stub+0x75)[0x7fb1233f5b15]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x8a69)[0x7fb114f35a69]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x8b7b)[0x7fb114f35b7b]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x8c57)[0x7fb114f35c57]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x15840)[0x7fb115156840]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x15bc0)[0x7fb115156bc0]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x6cc0)[0x7fb115147cc0]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x1d811)[0x7fb11515e811]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x6967e)[0x7fb1153d467e]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x22b32)[0x7fb115630b32]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x2305e)[0x7fb11563105e]
/usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so(+0x18ffa)[0x7fb1158a7ffa]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fb123198b30]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a3)[0x7fb123198ed3]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fb123194c33]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7576)[0x7fb117d89576]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b1c)[0x7fb117d8bb1c]
/lib64/libglusterfs.so.0(+0x89e84)[0x7fb12342ee84]
/lib64/libpthread.so.0(+0x7dd5)[0x7fb12222fdd5]
/lib64/libc.so.6(clone+0x6d)[0x7fb121af7ead]
---------

Comment 5 Krutika Dhananjay 2019-01-23 16:37:55 UTC

So this seems to happen when the vm image creation fills the lru list whose size is 16K maintained by shard translator.
Anyway the bug isn't related to the shard deletion patch. Crash is seen during fsync post write. And it seems to be caused by this Dalton fix @ https://bugzilla.redhat.com/show_bug.cgi?id=1583462 at the moment.

Managed to recreate the bug with a smaller data set:
1. Just set shard-lru-limit option to say 100. (In fact the very reason why I added this option in https://bugzilla.redhat.com/show_bug.cgi?id=1603118#c13 was for enabling easier testing of this code path since it is a bit complicated and prone to races).

2. And execute qemu-img command from "Description" above to create a smaller file (say 10G) and there is a crash.

Made some progress in terms of identifying that this bug is hit when the shard is part of the fsync list but NOT the lru list. This defies a certain assumption made at the time of writing that patch that if a shard is part of fsync list, it is guaranteed to be part of lru list as well which is where it gets ref'd. But that is not what happened.

So no lru list presence => no ref. Inode is destroyed in-memory at some point with its "ctx" object still in fsync list. When this list is traversed during fsync, the client crashes as its memory is freed by now.

What's left is to identify the specific case where this might have happened. Calling it a day for now. Will resume investigation tomorrow.

-Krutika

Comment 6 Krutika Dhananjay 2019-01-24 06:33:26 UTC

Found the issue.

When an inode write fop (in this case FALLOCATE) needs to write to more shards than the lru list can hold, some shards that are to participate in this inode write fop get evicted from lru list and later get added to fsync list (because these shards are modified and need to be fsync'd when application sends one) but without holding a ref.
Because of this, at the end of the fop, the inode and its ctx get unref'd and destroyed while still being part of fsync list.

When the application does send an fsync later, the fsync list is traversed and the (now) bad address accessed leading to a crash.

The fix is simple - just ref the shard at the time of adding to fsync list as well and unref when fsync'd.

Need to make this change and perform some tests (to be sure new ref doesn't cause mem-leaks in case it is not unref'd wherever it needs to be) after which I'll send out the patch upstream.

-Krutika

Comment 7 Krutika Dhananjay 2019-01-24 10:22:19 UTC

https://review.gluster.org/c/glusterfs/+/22091

Comment 10 Krutika Dhananjay 2019-01-24 14:20:59 UTC

Downstream patch - https://code.engineering.redhat.com/gerrit/#/c/161397/

(Upstream patch yet to be merged)

-Krutika

Comment 12 SATHEESARAN 2019-01-29 10:16:09 UTC

Tested with glusterfs-3.12.2-40.el7rhgs with the steps described in comment0

1. Created many fallocated images on gluster volumes
2. Created parallel fallocated images.
3. Also tried the steps mentioned in comment5

No issues observed

Comment 14 errata-xmlrpc 2019-02-04 07:41:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0263