Bug 1405308 - [compound fops] fuse mount crashed when VM installation is in progress & one of the brick killed
Summary: [compound fops] fuse mount crashed when VM installation is in progress & one ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.9
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Krutika Dhananjay
QA Contact:
URL:
Whiteboard:
Depends On: 1405299
Blocks: Gluster-HC-2
TreeView+ depends on / blocked
 
Reported: 2016-12-16 07:30 UTC by Krutika Dhananjay
Modified: 2017-03-08 10:23 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.9.1
Clone Of: 1405299
Environment:
Last Closed: 2017-03-08 10:23:02 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Krutika Dhananjay 2016-12-16 07:30:58 UTC
+++ This bug was initially created as a clone of Bug #1405299 +++

Description of problem:
-----------------------
Fuse mount crashed when the VM installation is in progress on the VM image file residing on the replica 3 volume, and one of the brick being killed.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------


How reproducible:
-----------------
1/1

Steps to Reproduce:
--------------------
1. Create a replica 3 volume with compound-fops and granular-entry-heal enabled
2. Optimize the volume for VM store with shard-block-size set to 4MB
3. Fuse mount the volume on the RHEL 7.3 client/hypervisor
4. Create a VM image file ( sparse ) on the fuse mounted volume
5. Start OS installation on the VM with RHEL 7.3 server
6. While VM installation is in progress, kill one of the brick of the volume

Actual results:
--------------
Fuse mount crashed/core dumped

Expected results:
------------------
There should not be any process crashing

--- Additional comment from SATHEESARAN on 2016-12-16 02:16:03 EST ---

Backtrace:
----------
Core was generated by `/usr/sbin/glusterfs --volfile-server=10.70.37.138 --volfile-id=/rep3vol /mnt/re'.
Program terminated with signal 11, Segmentation fault.
#0  afr_pre_op_writev_cbk (frame=0x7f24e25d2974, cookie=0x1, this=0x7f24d000a7b0, op_ret=<optimized out>, op_errno=<optimized out>, data=<optimized out>, xdata=0x0) at afr-transaction.c:1255
1255	                write_args_cbk = &args_cbk->rsp_list[1];
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-26.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libselinux-2.5-6.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64
(gdb) bt
#0  afr_pre_op_writev_cbk (frame=0x7f24e25d2974, cookie=0x1, this=0x7f24d000a7b0, op_ret=<optimized out>, op_errno=<optimized out>, data=<optimized out>, xdata=0x0) at afr-transaction.c:1255
#1  0x00007f24d6e91dd7 in client3_3_compound_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f24e25ceea8) at client-rpc-fops.c:3214
#2  0x00007f24e48ad785 in saved_frames_unwind (saved_frames=saved_frames@entry=0x7f24c4001620) at rpc-clnt.c:369
#3  0x00007f24e48ad86e in saved_frames_destroy (frames=frames@entry=0x7f24c4001620) at rpc-clnt.c:386
#4  0x00007f24e48aefd4 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7f24d007cf18) at rpc-clnt.c:556
#5  0x00007f24e48af864 in rpc_clnt_handle_disconnect (conn=0x7f24d007cf18, clnt=0x7f24d007cec0) at rpc-clnt.c:881
#6  rpc_clnt_notify (trans=<optimized out>, mydata=0x7f24d007cf18, event=RPC_TRANSPORT_DISCONNECT, data=0x7f24d008cc10) at rpc-clnt.c:937
#7  0x00007f24e48ab883 in rpc_transport_notify (this=this@entry=0x7f24d008cc10, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7f24d008cc10) at rpc-transport.c:537
#8  0x00007f24d9173302 in socket_event_poll_err (this=0x7f24d008cc10) at socket.c:1179
#9  socket_event_handler (fd=<optimized out>, idx=4, data=0x7f24d008cc10, poll_in=1, poll_out=0, poll_err=<optimized out>) at socket.c:2404
#10 0x00007f24e4b3f4f0 in event_dispatch_epoll_handler (event=0x7f24cfffee80, event_pool=0x7f24e5b41f00) at event-epoll.c:571
#11 event_dispatch_epoll_worker (data=0x7f24d003f420) at event-epoll.c:674
#12 0x00007f24e3946dc5 in start_thread (arg=0x7f24cffff700) at pthread_create.c:308
#13 0x00007f24e328b73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

--- Additional comment from SATHEESARAN on 2016-12-16 02:21 EST ---



--- Additional comment from SATHEESARAN on 2016-12-16 02:22:15 EST ---

Volume information
# gluster volume info rep3vol
 
Volume Name: rep3vol
Type: Replicate
Volume ID: 28e00021-7773-48f5-a31f-c9f8f2db0a2d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: server1:/gluster/brick1/b1
Brick2: server2:/gluster/brick1/b1
Brick3: server3:/gluster/brick1/b1
Options Reconfigured:
cluster.use-compound-fops: on
user.cifs: off
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
performance.low-prio-threads: 32
features.shard-block-size: 4MB
storage.owner-gid: 107
storage.owner-uid: 107
cluster.granular-entry-heal: enable
cluster.data-self-heal-algorithm: full
features.shard: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: off
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

--- Additional comment from SATHEESARAN on 2016-12-16 02:23:37 EST ---

Krutika has RCA'ed the issue and found that the patch[1] is missed in the backport, which has caused this issue.

[1] - http://review.gluster.org/#/c/15482/9

@Krutika, Requesting you to provide the detailed RCA

Comment 1 Worker Ant 2016-12-16 07:49:43 UTC
REVIEW: http://review.gluster.org/16161 (protocol/client: fix op_errno handling, was unused variable) posted (#1) for review on release-3.9 by Krutika Dhananjay (kdhananj)

Comment 2 Worker Ant 2016-12-18 12:57:53 UTC
COMMIT: http://review.gluster.org/16161 committed in release-3.9 by Kaleb KEITHLEY (kkeithle) 
------
commit 45431070d742ac9398b41efd23c1ea500a639669
Author: Kaleb S. KEITHLEY <kkeithle>
Date:   Tue Sep 13 05:57:32 2016 -0400

    protocol/client: fix op_errno handling, was unused variable
    
            Backport of: http://review.gluster.org/15482
    
    see comment in patch set one. Match the general logic flow of the
    other fop-cbks and eliminate the unused variable and its associated
    warning
    
    also see comment in patch set seven, re: correct handling of
    client_process_response(); and the associated BZ
    https://bugzilla.redhat.com/show_bug.cgi?id=1376328
    
    http://review.gluster.org/14085 fixes a "pragma leak" where the
    generated rpc/xdr headers have a pair of pragmas that disable these
    warnings. With the warnings disabled, many unused variables have
    crept into the code base.
    
    And 14085 won't pass its own smoke test until all these warnings are
    fixed.
    
    Change-Id: I9958a70b56023258921960410f9b641505fd4387
    BUG: 1405308
    Signed-off-by: Kaleb S. KEITHLEY <kkeithle>
    Reviewed-on: http://review.gluster.org/16161
    Tested-by: Krutika Dhananjay <kdhananj>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 3 Kaushal 2017-03-08 10:23:02 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.1, please open a new bug report.

glusterfs-3.9.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-January/029725.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.