1669077 – [ovirt-gluster] Fuse mount crashed while creating the preallocated image

Bug 1669077 - [ovirt-gluster] Fuse mount crashed while creating the preallocated image

Summary: [ovirt-gluster] Fuse mount crashed while creating the preallocated image

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	sharding
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Krutika Dhananjay
QA Contact:	bugs@gluster.org
Docs Contact:
URL:
Whiteboard:
Depends On:	1668304
Blocks:	1668309 1669382
TreeView+	depends on / blocked

Reported:	2019-01-24 09:36 UTC by Krutika Dhananjay
Modified:	2019-03-25 16:33 UTC (History)
CC List:	1 user (show)
Fixed In Version:	glusterfs-6.0
Clone Of:	1668304
Clones:	1669382 (view as bug list)
Environment:
Last Closed:	2019-01-24 18:15:43 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	22091	0	None	Merged	features/shard: Ref shard inode while adding to fsync list	2019-01-24 18:15:42 UTC

Description Krutika Dhananjay 2019-01-24 09:36:03 UTC

+++ This bug was initially created as a clone of Bug #1668304 +++

Description of problem:
------------------------
Fuse mount crashed while creating the preallocated image of size 1TB on a arbitrated replicate volume

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHGS 3.4.3 nightly ( glusterfs-3.12.2-38.el7rhgs )
RHV 4.2.8

How reproducible:
-----------------
1 out of 2 times

Steps to Reproduce:
-------------------
1. Create a arbitrated replicate volume
2. Fuse mount it
3. Create preallocated image of size 1TB
# qemu-img create -f qcow2 -o preallocation=falloc /mnt/test1/vm1.img 1072G

Actual results:
---------------
Fuse mount crashed with segfault

Expected results:
-----------------
Preallocated image should be created successfully

--- Additional comment from SATHEESARAN on 2019-01-22 12:10:19 UTC ---

1. Cluster info
----------------
There are 3 nodes in the cluster


2. Volume info
---------------
[root@ ~]# gluster volume info data
 
Volume Name: data
Type: Replicate
Volume ID: 7eb49e90-e2b6-4f8f-856e-7108212dbb72
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: server1:/gluster_bricks/data/data
Brick2: server2:/gluster_bricks/data/data
Brick3: server3:/gluster_bricks/data/data (arbiter)
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
storage.owner-uid: 36
storage.owner-gid: 36
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
cluster.enable-shared-storage: enable


--- Additional comment from SATHEESARAN on 2019-01-22 12:11:03 UTC ---

Backtrace
-----------

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs --volfile-server=localhost --volfile-id=/data /mnt/test1'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fb1233e0a7a in __inode_get_xl_index (xlator=0x7fb1100143b0, inode=0x7fb104026258) at inode.c:455
455	        if ((inode->_ctx[xlator->xl_id].xl_key != NULL) &&
(gdb) bt
#0  0x00007fb1233e0a7a in __inode_get_xl_index (xlator=0x7fb1100143b0, inode=0x7fb104026258) at inode.c:455
#1  __inode_ref (inode=inode@entry=0x7fb104026258) at inode.c:537
#2  0x00007fb1233e0b81 in inode_ref (inode=inode@entry=0x7fb104026258) at inode.c:581
#3  0x00007fb1233f5d2b in __fd_create (inode=inode@entry=0x7fb104026258, pid=pid@entry=0) at fd.c:633
#4  0x00007fb1233f6f4a in __fd_anonymous (inode=inode@entry=0x7fb104026258, flags=flags@entry=2) at fd.c:779
#5  0x00007fb1233f729d in fd_anonymous (inode=0x7fb104026258) at fd.c:803
#6  0x00007fb115161534 in shard_post_lookup_fsync_handler (frame=0x7fb0c05eb178, this=0x7fb1100143b0) at shard.c:5936
#7  0x00007fb11514913c in shard_lookup_base_file (frame=frame@entry=0x7fb0c05eb178, this=this@entry=0x7fb1100143b0, loc=loc@entry=0x7fb10800a158, 
    handler=handler@entry=0x7fb115161030 <shard_post_lookup_fsync_handler>) at shard.c:1746
#8  0x00007fb1151544c3 in shard_fsync (frame=0x7fb0c05eb178, this=0x7fb1100143b0, fd=0x7fb0f800eb78, datasync=1, xdata=0x0) at shard.c:6015
#9  0x00007fb114f30189 in wb_fsync_helper (frame=0x7fb0f80022e8, this=0x7fb1100159d0, fd=0x7fb0f800eb78, datasync=1, xdata=0x0) at write-behind.c:1974
#10 0x00007fb1233f5b15 in call_resume_keep_stub (stub=0x7fb0f80250f8) at call-stub.c:2582
#11 0x00007fb114f35a69 in wb_do_winds (wb_inode=wb_inode@entry=0x7fb0f800dd70, tasks=tasks@entry=0x7fb10dae7510) at write-behind.c:1672
#12 0x00007fb114f35b7b in wb_process_queue (wb_inode=wb_inode@entry=0x7fb0f800dd70) at write-behind.c:1709
#13 0x00007fb114f35c57 in wb_fulfill_cbk (frame=0x7fb0f8010a58, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, prebuf=<optimized out>, 
    postbuf=0x7fb0f8003670, xdata=0x7fb0c065ee98) at write-behind.c:1054
#14 0x00007fb115156840 in shard_common_inode_write_success_unwind (fop=<optimized out>, frame=0x7fb0f80019b8, op_ret=65536) at shard.c:903
#15 0x00007fb115156bc0 in shard_common_inode_write_post_update_size_handler (frame=<optimized out>, this=<optimized out>) at shard.c:5214
#16 0x00007fb115147cc0 in shard_update_file_size (frame=frame@entry=0x7fb0f80019b8, this=this@entry=0x7fb1100143b0, fd=0x7fb0f800eb78, loc=loc@entry=0x0, 
    handler=handler@entry=0x7fb115156ba0 <shard_common_inode_write_post_update_size_handler>) at shard.c:1201
#17 0x00007fb11515e811 in shard_common_inode_write_do_cbk (frame=frame@entry=0x7fb0f80019b8, cookie=0x7fb0f800eb78, this=0x7fb1100143b0, op_ret=op_ret@entry=65536, 
    op_errno=op_errno@entry=0, pre=pre@entry=0x7fb0f8029730, post=post@entry=0x7fb0f80297a0, xdata=xdata@entry=0x7fb0c065ee98) at shard.c:5326
#18 0x00007fb1153d467e in dht_writev_cbk (frame=0x7fb0f80021d8, cookie=<optimized out>, this=<optimized out>, op_ret=65536, op_errno=0, prebuf=0x7fb0f8029730, postbuf=0x7fb0f80297a0, 
    xdata=0x7fb0c065ee98) at dht-inode-write.c:119
#19 0x00007fb115630b32 in afr_writev_unwind (frame=frame@entry=0x7fb0f8004888, this=this@entry=0x7fb11000fff0) at afr-inode-write.c:246
#20 0x00007fb11563105e in afr_writev_wind_cbk (frame=0x7fb0f800bd08, cookie=<optimized out>, this=0x7fb11000fff0, op_ret=<optimized out>, op_errno=<optimized out>, 
    prebuf=<optimized out>, postbuf=0x7fb10dae7990, xdata=0x7fb0c065ee98) at afr-inode-write.c:406
#21 0x00007fb1158a7ffa in client3_3_writev_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fb0f802d148) at client-rpc-fops.c:838
#22 0x00007fb123198b30 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fb11004a940, pollin=pollin@entry=0x7fb10bb95520) at rpc-clnt.c:778
#23 0x00007fb123198ed3 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fb11004a970, event=<optimized out>, data=0x7fb10bb95520) at rpc-clnt.c:971
#24 0x00007fb123194c33 in rpc_transport_notify (this=this@entry=0x7fb11004ac90, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fb10bb95520) at rpc-transport.c:552
#25 0x00007fb117d89576 in socket_event_poll_in (this=this@entry=0x7fb11004ac90, notify_handled=<optimized out>) at socket.c:2322
#26 0x00007fb117d8bb1c in socket_event_handler (fd=11, idx=4, gen=1, data=0x7fb11004ac90, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
#27 0x00007fb12342ee84 in event_dispatch_epoll_handler (event=0x7fb10dae7e80, event_pool=0x56476beb1ec0) at event-epoll.c:583
#28 event_dispatch_epoll_worker (data=0x56476bf0b1d0) at event-epoll.c:659
#29 0x00007fb12222fdd5 in start_thread (arg=0x7fb10dae8700) at pthread_create.c:307
#30 0x00007fb121af7ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

--- Additional comment from SATHEESARAN on 2019-01-22 12:12:24 UTC ---

Errors as seen in the fuse mount log:
[2019-01-22 11:29:05.487058] E [shard.c:5935:shard_post_lookup_fsync_handler] (-->/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x134c3) [0x7fb1151544c3] -->/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x813c) [0x7fb11514913c] -->/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x2083e) [0x7fb11516183e] ) 0-: Assertion failed: fsync_count > 0
pending frames:
frame : type(0) op(0)
frame : type(1) op(FSYNC)
frame : type(1) op(FSYNC)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-01-22 11:29:05
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x9d)[0x7fb1233d0e8d]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fb1233daf54]
/lib64/libc.so.6(+0x36280)[0x7fb121a30280]
/lib64/libglusterfs.so.0(+0x3ba7a)[0x7fb1233e0a7a]
/lib64/libglusterfs.so.0(inode_ref+0x21)[0x7fb1233e0b81]
/lib64/libglusterfs.so.0(+0x50d2b)[0x7fb1233f5d2b]
/lib64/libglusterfs.so.0(+0x51f4a)[0x7fb1233f6f4a]
/lib64/libglusterfs.so.0(fd_anonymous+0x2d)[0x7fb1233f729d]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x20534)[0x7fb115161534]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x813c)[0x7fb11514913c]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x134c3)[0x7fb1151544c3]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x3189)[0x7fb114f30189]
/lib64/libglusterfs.so.0(call_resume_keep_stub+0x75)[0x7fb1233f5b15]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x8a69)[0x7fb114f35a69]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x8b7b)[0x7fb114f35b7b]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x8c57)[0x7fb114f35c57]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x15840)[0x7fb115156840]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x15bc0)[0x7fb115156bc0]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x6cc0)[0x7fb115147cc0]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x1d811)[0x7fb11515e811]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x6967e)[0x7fb1153d467e]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x22b32)[0x7fb115630b32]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x2305e)[0x7fb11563105e]
/usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so(+0x18ffa)[0x7fb1158a7ffa]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fb123198b30]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a3)[0x7fb123198ed3]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fb123194c33]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7576)[0x7fb117d89576]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b1c)[0x7fb117d8bb1c]
/lib64/libglusterfs.so.0(+0x89e84)[0x7fb12342ee84]
/lib64/libpthread.so.0(+0x7dd5)[0x7fb12222fdd5]
/lib64/libc.so.6(clone+0x6d)[0x7fb121af7ead]
---------

--- Additional comment from Krutika Dhananjay on 2019-01-23 16:37:55 UTC ---

So this seems to happen when the vm image creation fills the lru list whose size is 16K maintained by shard translator.
Anyway the bug isn't related to the shard deletion patch. Crash is seen during fsync post write. And it seems to be caused by this Dalton fix @ https://bugzilla.redhat.com/show_bug.cgi?id=1583462 at the moment.

Managed to recreate the bug with a smaller data set:
1. Just set shard-lru-limit option to say 100. (In fact the very reason why I added this option in https://bugzilla.redhat.com/show_bug.cgi?id=1603118#c13 was for enabling easier testing of this code path since it is a bit complicated and prone to races).

2. And execute qemu-img command from "Description" above to create a smaller file (say 10G) and there is a crash.

Made some progress in terms of identifying that this bug is hit when the shard is part of the fsync list but NOT the lru list. This defies a certain assumption made at the time of writing that patch that if a shard is part of fsync list, it is guaranteed to be part of lru list as well which is where it gets ref'd. But that is not what happened.

So no lru list presence => no ref. Inode is destroyed in-memory at some point with its "ctx" object still in fsync list. When this list is traversed during fsync, the client crashes as its memory is freed by now.

What's left is to identify the specific case where this might have happened. Calling it a day for now. Will resume investigation tomorrow.

-Krutika

--- Additional comment from Krutika Dhananjay on 2019-01-24 06:33:26 UTC ---

Found the issue.

When an inode write fop (in this case FALLOCATE) needs to write to more shards than the lru list can hold, some shards that are to participate in this inode write fop get evicted from lru list and later get added to fsync list (because these shards are modified and need to be fsync'd when application sends one) but without holding a ref.
Because of this, at the end of the fop, the inode and its ctx get unref'd and destroyed while still being part of fsync list.

When the application does send an fsync later, the fsync list is traversed and the (now) bad address accessed leading to a crash.

The fix is simple - just ref the shard at the time of adding to fsync list as well and unref when fsync'd.

Need to make this change and perform some tests (to be sure new ref doesn't cause mem-leaks in case it is not unref'd wherever it needs to be) after which I'll send out the patch upstream.

-Krutika

Comment 1 Worker Ant 2019-01-24 18:15:43 UTC

REVIEW: https://review.gluster.org/22091 (features/shard: Ref shard inode while adding to fsync list) merged (#4) on master by Xavi Hernandez

Comment 2 Shyamsundar 2019-03-25 16:33:11 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report.

glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.