Bug 961615 - Hypervisor mount crashed after rebalance
Summary: Hypervisor mount crashed after rebalance
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On: 960368
Blocks: 1110456
TreeView+ depends on / blocked
 
Reported: 2013-05-10 05:36 UTC by Pranith Kumar K
Modified: 2014-06-19 13:59 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.4.0
Clone Of: 960368
Environment:
virt rhev integration
Last Closed: 2013-07-24 17:39:52 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Pranith Kumar K 2013-05-10 05:36:45 UTC
+++ This bug was initially created as a clone of Bug #960368 +++

Description of problem:
A distributed-replicate volume had 5 vms on it with i/o running. after couple of add-bricks and rebalance client crashed

Version-Release number of selected component (if applicable):
[root@rhs1-bb ~]# rpm -qa| grep gluster
glusterfs-server-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-devel-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-debuginfo-3.4.0.3rhs-1.el6rhs.x86_64




How reproducible:


Steps to Reproduce:
1. I had a 6x2 distributed-replicate volume serving as virtual machine storage
2. added one pair of brick and initiated rebalance 
3. once rebalance is over i hit the bug 960046
4. restarted the vms and added 4 more bricks (2 more pairs)
5. again started rebalance which lead to 960046(vms got paused). after sometime 
storage domain was down, further inverstigation revealed client crash on the hypervisor.
  

volume info
============

Volume Name: vstore
Type: Distributed-Replicate
Volume ID: e8fe6a61-6345-41f0-9329-a802b051a026
Status: Started
Number of Bricks: 9 x 2 = 18
Transport-type: tcp
Bricks:
Brick1: 10.70.37.76:/brick1/vs1
Brick2: 10.70.37.133:/brick1/vs1
Brick3: 10.70.37.76:/brick2/vs2
Brick4: 10.70.37.133:/brick2/vs2
Brick5: 10.70.37.76:/brick3/vs3
Brick6: 10.70.37.133:/brick3/vs3
Brick7: 10.70.37.76:/brick4/vs4
Brick8: 10.70.37.133:/brick4/vs4
Brick9: 10.70.37.76:/brick5/vs5
Brick10: 10.70.37.133:/brick5/vs5
Brick11: 10.70.37.76:/brick6/vs6
Brick12: 10.70.37.133:/brick6/vs6
Brick13: 10.70.37.134:/brick1/vs1
Brick14: 10.70.37.59:/brick1/vs1
Brick15: 10.70.37.134:/brick2/vs7
Brick16: 10.70.37.59:/brick2/vs7
Brick17: 10.70.37.134:/brick3/vs8
Brick18: 10.70.37.59:/brick3/vs8
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: on




bt
===
(gdb) bt
#0  0x000000362780c170 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007fa3304de4c3 in fd_ref (fd=0x2d54124) at fd.c:447
#2  0x00007fa32e8afd4e in fuse_resolve_fd (state=<value optimized out>) at fuse-resolve.c:461
#3  fuse_resolve (state=<value optimized out>) at fuse-resolve.c:622
#4  0x00007fa32e8b026e in fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:665
#5  0x00007fa32e8b02b8 in fuse_resolve_and_resume (state=0x7fa31c038cb0, fn=0x7fa32e8b87a0 <fuse_flush_resume>) at fuse-resolve.c:705
#6  0x00007fa32e8c66f8 in fuse_thread_proc (data=0x100fb70) at fuse-bridge.c:4562
#7  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
#8  0x00000036274e890d in clone () from /lib64/libc.so.6




Program terminated with signal 11, Segmentation fault.
#0  0x000000362780c170 in pthread_spin_lock () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6_4.2.x86_64 libcom_err-1.41.12-14.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 openssl-1.0.0-27.el6_4.2.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) thread apply all bt full

Thread 7 (Thread 0x7fa32c682700 (LWP 15004)):
#0  0x000000362780ed2d in nanosleep () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fa3304cfd4c in gf_timer_proc (ctx=0xff3010) at timer.c:173
        now = 1367905197921268
        now_tv = {tv_sec = 1367905197, tv_usec = 921268}
        event = 0x7fa3240008e0
        reg = 0x10292b0
        sleepts = {tv_sec = 1, tv_nsec = 0}
        __FUNCTION__ = "gf_timer_proc"
#2  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 6 (Thread 0x7fa32dea6700 (LWP 15000)):
#0  0x000000362780b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fa3304eab1f in syncenv_task (proc=0x101b4f0) at syncop.c:335
        env = 0x101b4f0
        task = 0x0
        sleep_till = {tv_sec = 1367905700, tv_nsec = 0}
        ret = <value optimized out>
#2  0x00007fa3304ee8d0 in syncenv_processor (thdata=0x101b4f0) at syncop.c:414
        env = 0x101b4f0
        proc = 0x101b4f0
        task = <value optimized out>
#3  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 5 (Thread 0x7fa32e8a7700 (LWP 14999)):
#0  0x000000362780f2a5 in sigwait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000040523b in glusterfs_sigwaiter (arg=<value optimized out>) at glusterfsd.c:1572
        set = {__val = {18947, 0 <repeats 15 times>}}
        ret = <value optimized out>
        sig = 10
---Type <return> to continue, or q <return> to quit---
#2  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x7fa32335d700 (LWP 15008)):
#0  0x000000362780e54d in read () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fa32e8b3ab3 in notify_kernel_loop (data=<value optimized out>) at fuse-bridge.c:3550
        this = <value optimized out>
        priv = 0x10109e0
        fouh = <value optimized out>
        rv = <value optimized out>
        inval_buf = '\000' <repeats 287 times>
        __FUNCTION__ = "notify_kernel_loop"
#2  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x7fa33006e700 (LWP 14998)):
#0  0x00000036274e8f03 in epoll_wait () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fa330500817 in event_dispatch_epoll (event_pool=0x100eea0) at event-epoll.c:428
        events = 0x102a050
        i = <value optimized out>
        ret = 0
        __FUNCTION__ = "event_dispatch_epoll"
#2  0x00000000004066c6 in main (argc=4, argv=0x7fff277ad7c8) at glusterfsd.c:1902
        ctx = 0xff3010
        ret = 0
        cmdlinestr = "/usr/sbin/glusterfs --volfile-id=vstore --volfile-server=10.70.37.76 /rhev/data-center/mnt/10.70.37.76:vstore", '\000' <repeats 3986 times>
        __FUNCTION__ = "main"

Thread 2 (Thread 0x7fa32d4a5700 (LWP 15001)):
#0  0x000000362780b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fa3304eab1f in syncenv_task (proc=0x101b8b0) at syncop.c:335
        env = 0x101b4f0
---Type <return> to continue, or q <return> to quit---
        task = 0x0
        sleep_till = {tv_sec = 1367905700, tv_nsec = 0}
        ret = <value optimized out>
#2  0x00007fa3304ee8d0 in syncenv_processor (thdata=0x101b8b0) at syncop.c:414
        env = 0x101b4f0
        proc = 0x101b8b0
        task = <value optimized out>
#3  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x7fa323d5e700 (LWP 15007)):
#0  0x000000362780c170 in pthread_spin_lock () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fa3304de4c3 in fd_ref (fd=0x2d54124) at fd.c:447
        refed_fd = 0x0
        __FUNCTION__ = "fd_ref"
#2  0x00007fa32e8afd4e in fuse_resolve_fd (state=<value optimized out>) at fuse-resolve.c:461
        activefd = 0x2d54124
        this = <value optimized out>
        ret = 0
        fd_migration_error = 0 '\000'
        resolve = 0x7fa31c038db0
        basefd = 0x1482d30
        active_subvol = 0x0
        basefd_ctx = 0x7fa31c0314a0
#3  fuse_resolve (state=<value optimized out>) at fuse-resolve.c:622
        resolve = 0x7fa31c038db0
#4  0x00007fa32e8b026e in fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:665
        __FUNCTION__ = "fuse_resolve_all"
#5  0x00007fa32e8b02b8 in fuse_resolve_and_resume (state=0x7fa31c038cb0, fn=0x7fa32e8b87a0 <fuse_flush_resume>) at fuse-resolve.c:705
No locals.
#6  0x00007fa32e8c66f8 in fuse_thread_proc (data=0x100fb70) at fuse-bridge.c:4562
        mount_point = 0x0
        this = 0x100fb70
        priv = 0x10109e0
        res = <value optimized out>
        iobuf = 0x100e6a0
        finh = 0x7fa31c03b970
---Type <return> to continue, or q <return> to quit---
        iov_in = {{iov_base = 0x7fa31c03b970, iov_len = 80}, {iov_base = 0x7fa32f96d000, iov_len = 131072}}
        msg = <value optimized out>
        fuse_ops = 0x7fa32ead2920
        pfd = {{fd = 5, events = 25, revents = 1}, {fd = 7, events = 25, revents = 1}}
        mount_finished = _gf_true
        __FUNCTION__ = "fuse_thread_proc"
#7  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#8  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.
 



attaching the sosreport

--- Additional comment from shylesh on 2013-05-07 02:27:16 EDT ---

sosreports@

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/960368/

--- Additional comment from RHEL Product and Program Management on 2013-05-07 02:44:31 EDT ---

Since this issue was entered in bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from shishir gowda on 2013-05-07 06:40:25 EDT ---

Please provide access to the host where the crash was seen

--- Additional comment from shylesh on 2013-05-07 07:06:55 EDT ---

(In reply to comment #3)
> Please provide access to the host where the crash was seen

hypervisor: rhs-gp-srv4.lab.eng.blr.redhat.com 
please let me know once you are done with access.

--- Additional comment from shishir gowda on 2013-05-07 08:35:07 EDT ---

Looks like a flush call has come in on a freed fd

(gdb) bt
#0  0x000000362780c170 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007fa3304de4c3 in fd_ref (fd=0x2d54124) at fd.c:447
#2  0x00007fa32e8afd4e in fuse_resolve_fd (state=<value optimized out>) at fuse-resolve.c:461
#3  fuse_resolve (state=<value optimized out>) at fuse-resolve.c:622
#4  0x00007fa32e8b026e in fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:665
#5  0x00007fa32e8b02b8 in fuse_resolve_and_resume (state=0x7fa31c038cb0, fn=0x7fa32e8b87a0 <fuse_flush_resume>) at fuse-resolve.c:705
#6  0x00007fa32e8c66f8 in fuse_thread_proc (data=0x100fb70) at fuse-bridge.c:4562
#7  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
#8  0x00000036274e890d in clone () from /lib64/libc.so.6
(gdb) f 1
#1  0x00007fa3304de4c3 in fd_ref (fd=0x2d54124) at fd.c:447
447	        LOCK (&fd->inode->lock);

(gdb) p fd->inode
$12 = (struct _inode *) 0xaaaaaaaa

(gdb) p *fd
$13 = {pid = 0, flags = 0, refcount = 0, inode_list = {next = 0x2d54134, prev = 0x2d54134}, inode = 0xaaaaaaaa, lock = 1, _ctx = 0x7fa31c03a090, xl_count = 32, lk_ctx = 0x7fa31c012230, anonymous = _gf_true}  <=====ref count is zero

(gdb) f 5
#5  0x00007fa32e8b02b8 in fuse_resolve_and_resume (state=0x7fa31c038cb0, fn=0x7fa32e8b87a0 <fuse_flush_resume>) at fuse-resolve.c:705
705	        fuse_resolve_all (state);


(gdb) p activefd 
$15 = (fd_t *) 0x2d54124

(gdb) p *basefd
$17 = {pid = 8614, flags = 573442, refcount = 4, inode_list = {next = 0x7fa31b48f370, prev = 0x7fa31b48f370}, inode = 0x7fa31b48f338, lock = 0, _ctx = 0x7fa31c029d70, xl_count = 26, lk_ctx = 0x7fa31c00a5b0, 
  anonymous = _gf_false}
(gdb) p *basefd->inode
$18 = {table = 0x1482930, gfid = "K\362\247w\304\030G>\226\222\343q\343\270:\t", lock = 1, nlookup = 51, fd_count = 4294967247, ref = 1, ia_type = IA_IFREG, fd_list = {next = 0x1482d40, prev = 0x1482d40}, 
  dentry_list = {next = 0x7fa31b1ee3c8, prev = 0x7fa31b1ee3c8}, hash = {next = 0x7fa31b1270a0, prev = 0x7fa31b1270a0}, list = {next = 0x7fa31b48f91c, prev = 0x7fa31b48f7e4}, _ctx = 0x7fa31c036100}

--- Additional comment from shishir gowda on 2013-05-07 08:56:10 EDT ---

(gdb) p *(xlator_t *)fd->_ctx[2].xl_key
$30 = {name = 0x14a3790 "vstore", type = 0x14a37b0 "debug/io-stats", next = 0x14dc3a0, prev = 0x0, parents = 0x0, children = 0x14a6570, options = 0x7fa32ecbc8a0, dlhandle = 0x1041c30, fops = 0x7fa32b17a5c0, 
  cbks = 0x7fa32b17a860, dumpops = 0x0, volume_options = {next = 0x14a36c0, prev = 0x14a36c0}, fini = 0x7fa32af67f40 <fini>, init = 0x7fa32af683a0 <init>, reconfigure = 0x7fa32af68690 <reconfigure>, 
  mem_acct_init = 0x7fa32af68620 <mem_acct_init>, notify = 0x7fa32af70da0 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 46 times>}, 
  history = 0x0, ctx = 0xff3010, graph = 0x14a5550, itable = 0x2c1d030, init_succeeded = 1 '\001', private = 0x14dd6e0, mem_acct = {num_types = 0, rec = 0x0}, winds = 1, switched = 0 '\000', local_pool = 0x0, 
  is_autoloaded = _gf_false}
(gdb) p *(xlator_t *)fd->_ctx[1].xl_key
$31 = {name = 0x149f340 "vstore-open-behind", type = 0x14a3770 "performance/open-behind", next = 0x14dba00, prev = 0x14dcd40, parents = 0x14a64d0, children = 0x14a3700, options = 0x7fa32ecbc814, 
  dlhandle = 0x1040bc0, fops = 0x7fa32b3824c0, cbks = 0x7fa32b382760, dumpops = 0x7fa32b382780, volume_options = {next = 0x14a36e0, prev = 0x14a36e0}, fini = 0x7fa32b17e650 <fini>, 
  init = 0x7fa32b17e690 <init>, reconfigure = 0x7fa32b17e7e0 <reconfigure>, mem_acct_init = 0x7fa32b17e860 <mem_acct_init>, notify = 0x7fa3304c1190 <default_notify>, loglevel = GF_LOG_NONE, latencies = {{
      min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 46 times>}, history = 0x0, ctx = 0xff3010, graph = 0x14a5550, itable = 0x0, init_succeeded = 1 '\001', private = 0x149f370, mem_acct = {
    num_types = 0, rec = 0x0}, winds = 0, switched = 0 '\000', local_pool = 0x0, is_autoloaded = _gf_false}

--- Additional comment from shishir gowda on 2013-05-07 09:28:14 EDT ---

Looks like a open-behind xlator issue

--- Additional comment from RHEL Product and Program Management on 2013-05-08 03:01:49 EDT ---

This bug report previously had all acks and release flag approved.
However since at least one of its acks has been changed, the
release flag has been reset to ? by the bugbot (pm-rhel).  The
ack needs to become approved before the release flag can become
approved again.

Comment 1 Pranith Kumar K 2013-05-10 06:49:55 UTC
If the following test script is executed we can see that the fuse mount crashes.
#!/bin/bash

. $(dirname $0)/../include.rc
. $(dirname $0)/../volume.rc

#This test tests that an extra fd_unref does not happen in rebalance
#migration completion check code path in dht

cleanup;

TEST glusterd
TEST pidof glusterd
TEST $CLI volume create $V0 $H0:$B0/${V0}0 $H0:$B0/${V0}1
TEST $CLI volume set $V0 performance.quick-read off
TEST $CLI volume set $V0 performance.io-cache off
TEST $CLI volume set $V0 performance.write-behind off
TEST $CLI volume set $V0 performance.stat-prefetch off
TEST $CLI volume set $V0 performance.read-ahead off
TEST $CLI volume start $V0
TEST glusterfs --volfile-id=/$V0 --volfile-server=$H0 $M0 --attribute-timeout=0 --entry-timeout=0
TEST touch $M0/1
#This rename creates a link file for 10 in the other volume.
TEST mv $M0/1 $M0/10
#Lets keep writing to the file which will trigger rebalance completion check
dd if=/dev/zero of=$M0/10 bs=1k &
bg_pid=$!
#Now rebalance force will migrate file '10'
TEST $CLI volume rebalance $V0 start force
EXPECT_WITHIN 60 "completed" rebalance_status_field $V0
#If the bug exists mount would have crashed by now
TEST ls $M0
kill -9 $bg_pid > /dev/null 2>&1
wait > /dev/null 2>&1
cleanup

Comment 2 Anand Avati 2013-05-10 06:58:07 UTC
REVIEW: http://review.gluster.org/4974 (cluster/dht: Don't do extra unref in dht-migration checks) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 3 Anand Avati 2013-05-15 06:41:40 UTC
REVIEW: http://review.gluster.org/4974 (cluster/dht: Don't do extra unref in dht-migration checks) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 4 Anand Avati 2013-05-17 03:47:33 UTC
COMMIT: http://review.gluster.org/4974 committed in master by Vijay Bellur (vbellur) 
------
commit 2acc8c1f04d0376bedf36f1f746e542422344fa6
Author: Pranith Kumar K <pkarampu>
Date:   Fri May 10 10:09:13 2013 +0530

    cluster/dht: Don't do extra unref in dht-migration checks
    
    Problem:
    syncop_open used to perform a ref in syncop_open_cbk so the extra
    unref was needed but now syncop_open_cbk does not take a ref so no
    need to do extra unref.
    
    Fix:
    remove the extra fd_unref and let dht_local_wipe do the final unref.
    
    Change-Id: Ibe8f9a678d456a0c7bff175306068b5cd297ecc4
    BUG: 961615
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/4974
    Reviewed-by: Raghavendra Bhat <raghavendra>
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Reviewed-by: Amar Tumballi <amarts>
    Tested-by: Gluster Build System <jenkins.com>

Comment 5 Anand Avati 2014-06-10 16:53:10 UTC
REVIEW: http://review.gluster.org/8029 (cluster/dht: Don't do extra unref in dht-migration checks) posted (#1) for review on release-3.4 by Vijay Bellur (vbellur)

Comment 6 Anand Avati 2014-06-18 12:17:57 UTC
COMMIT: http://review.gluster.org/8029 committed in release-3.4 by Kaleb KEITHLEY (kkeithle) 
------
commit 2b789331dc933b186360fc8cbffb06289ee60ee9
Author: Vijay Bellur <vbellur>
Date:   Tue Jun 10 22:21:28 2014 +0530

    cluster/dht: Don't do extra unref in dht-migration checks
    
    Problem:
    syncop_open used to perform a ref in syncop_open_cbk so the extra
    unref was needed but now syncop_open_cbk does not take a ref so no
    need to do extra unref.
    
    Fix:
    remove the extra fd_unref and let dht_local_wipe do the final unref.
    
    Change-Id: Ibe8f9a678d456a0c7bff175306068b5cd297ecc4
    BUG: 961615
    Signed-off-by: Pranith Kumar K <pkarampu>
    Signed-off-by: Vijay Bellur <vbellur>
    Reviewed-on: http://review.gluster.org/8029
    Tested-by: Gluster Build System <jenkins.com>
    Tested-by: Joe Julian <joe>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>

Comment 7 Justin Clift 2014-06-19 13:56:43 UTC
Should this be applied to release-3.5 branch as well?

Only seeing mention of master and release-3.4 branches in this BZ.

Comment 8 Vijay Bellur 2014-06-19 13:59:42 UTC
(In reply to Justin Clift from comment #7)
> Should this be applied to release-3.5 branch as well?
> 
> Only seeing mention of master and release-3.4 branches in this BZ.

Already fixed in 3.5.0.


Note You need to log in before you can comment on or make changes to this bug.