Bug 960368

Summary:	Hypervisor mount crashed after rebalance
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	shylesh <shmohan>
Component:	glusterfs	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED ERRATA	QA Contact:	shylesh <shmohan>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	unspecified	CC:	amarts, grajaiya, pkarampu, rhs-bugs, sdharane, sgowda, surs, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.4.0.6rhs-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	961615 (view as bug list)		Environment:	virt rhev integration
Last Closed:	2013-09-23 22:35:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	923534, 961615

Description shylesh 2013-05-07 06:25:44 UTC

Description of problem:
A distributed-replicate volume had 5 vms on it with i/o running. after couple of add-bricks and rebalance client crashed

Version-Release number of selected component (if applicable):
[root@rhs1-bb ~]# rpm -qa| grep gluster
glusterfs-server-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-devel-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-debuginfo-3.4.0.3rhs-1.el6rhs.x86_64




How reproducible:


Steps to Reproduce:
1. I had a 6x2 distributed-replicate volume serving as virtual machine storage
2. added one pair of brick and initiated rebalance 
3. once rebalance is over i hit the bug 960046
4. restarted the vms and added 4 more bricks (2 more pairs)
5. again started rebalance which lead to 960046(vms got paused). after sometime 
storage domain was down, further inverstigation revealed client crash on the hypervisor.
  

volume info
============

Volume Name: vstore
Type: Distributed-Replicate
Volume ID: e8fe6a61-6345-41f0-9329-a802b051a026
Status: Started
Number of Bricks: 9 x 2 = 18
Transport-type: tcp
Bricks:
Brick1: 10.70.37.76:/brick1/vs1
Brick2: 10.70.37.133:/brick1/vs1
Brick3: 10.70.37.76:/brick2/vs2
Brick4: 10.70.37.133:/brick2/vs2
Brick5: 10.70.37.76:/brick3/vs3
Brick6: 10.70.37.133:/brick3/vs3
Brick7: 10.70.37.76:/brick4/vs4
Brick8: 10.70.37.133:/brick4/vs4
Brick9: 10.70.37.76:/brick5/vs5
Brick10: 10.70.37.133:/brick5/vs5
Brick11: 10.70.37.76:/brick6/vs6
Brick12: 10.70.37.133:/brick6/vs6
Brick13: 10.70.37.134:/brick1/vs1
Brick14: 10.70.37.59:/brick1/vs1
Brick15: 10.70.37.134:/brick2/vs7
Brick16: 10.70.37.59:/brick2/vs7
Brick17: 10.70.37.134:/brick3/vs8
Brick18: 10.70.37.59:/brick3/vs8
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: on




bt
===
(gdb) bt
#0  0x000000362780c170 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007fa3304de4c3 in fd_ref (fd=0x2d54124) at fd.c:447
#2  0x00007fa32e8afd4e in fuse_resolve_fd (state=<value optimized out>) at fuse-resolve.c:461
#3  fuse_resolve (state=<value optimized out>) at fuse-resolve.c:622
#4  0x00007fa32e8b026e in fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:665
#5  0x00007fa32e8b02b8 in fuse_resolve_and_resume (state=0x7fa31c038cb0, fn=0x7fa32e8b87a0 <fuse_flush_resume>) at fuse-resolve.c:705
#6  0x00007fa32e8c66f8 in fuse_thread_proc (data=0x100fb70) at fuse-bridge.c:4562
#7  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
#8  0x00000036274e890d in clone () from /lib64/libc.so.6




Program terminated with signal 11, Segmentation fault.
#0  0x000000362780c170 in pthread_spin_lock () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6_4.2.x86_64 libcom_err-1.41.12-14.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 openssl-1.0.0-27.el6_4.2.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) thread apply all bt full

Thread 7 (Thread 0x7fa32c682700 (LWP 15004)):
#0  0x000000362780ed2d in nanosleep () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fa3304cfd4c in gf_timer_proc (ctx=0xff3010) at timer.c:173
        now = 1367905197921268
        now_tv = {tv_sec = 1367905197, tv_usec = 921268}
        event = 0x7fa3240008e0
        reg = 0x10292b0
        sleepts = {tv_sec = 1, tv_nsec = 0}
        __FUNCTION__ = "gf_timer_proc"
#2  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 6 (Thread 0x7fa32dea6700 (LWP 15000)):
#0  0x000000362780b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fa3304eab1f in syncenv_task (proc=0x101b4f0) at syncop.c:335
        env = 0x101b4f0
        task = 0x0
        sleep_till = {tv_sec = 1367905700, tv_nsec = 0}
        ret = <value optimized out>
#2  0x00007fa3304ee8d0 in syncenv_processor (thdata=0x101b4f0) at syncop.c:414
        env = 0x101b4f0
        proc = 0x101b4f0
        task = <value optimized out>
#3  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 5 (Thread 0x7fa32e8a7700 (LWP 14999)):
#0  0x000000362780f2a5 in sigwait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000040523b in glusterfs_sigwaiter (arg=<value optimized out>) at glusterfsd.c:1572
        set = {__val = {18947, 0 <repeats 15 times>}}
        ret = <value optimized out>
        sig = 10
---Type <return> to continue, or q <return> to quit---
#2  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x7fa32335d700 (LWP 15008)):
#0  0x000000362780e54d in read () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fa32e8b3ab3 in notify_kernel_loop (data=<value optimized out>) at fuse-bridge.c:3550
        this = <value optimized out>
        priv = 0x10109e0
        fouh = <value optimized out>
        rv = <value optimized out>
        inval_buf = '\000' <repeats 287 times>
        __FUNCTION__ = "notify_kernel_loop"
#2  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x7fa33006e700 (LWP 14998)):
#0  0x00000036274e8f03 in epoll_wait () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fa330500817 in event_dispatch_epoll (event_pool=0x100eea0) at event-epoll.c:428
        events = 0x102a050
        i = <value optimized out>
        ret = 0
        __FUNCTION__ = "event_dispatch_epoll"
#2  0x00000000004066c6 in main (argc=4, argv=0x7fff277ad7c8) at glusterfsd.c:1902
        ctx = 0xff3010
        ret = 0
        cmdlinestr = "/usr/sbin/glusterfs --volfile-id=vstore --volfile-server=10.70.37.76 /rhev/data-center/mnt/10.70.37.76:vstore", '\000' <repeats 3986 times>
        __FUNCTION__ = "main"

Thread 2 (Thread 0x7fa32d4a5700 (LWP 15001)):
#0  0x000000362780b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fa3304eab1f in syncenv_task (proc=0x101b8b0) at syncop.c:335
        env = 0x101b4f0
---Type <return> to continue, or q <return> to quit---
        task = 0x0
        sleep_till = {tv_sec = 1367905700, tv_nsec = 0}
        ret = <value optimized out>
#2  0x00007fa3304ee8d0 in syncenv_processor (thdata=0x101b8b0) at syncop.c:414
        env = 0x101b4f0
        proc = 0x101b8b0
        task = <value optimized out>
#3  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x7fa323d5e700 (LWP 15007)):
#0  0x000000362780c170 in pthread_spin_lock () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fa3304de4c3 in fd_ref (fd=0x2d54124) at fd.c:447
        refed_fd = 0x0
        __FUNCTION__ = "fd_ref"
#2  0x00007fa32e8afd4e in fuse_resolve_fd (state=<value optimized out>) at fuse-resolve.c:461
        activefd = 0x2d54124
        this = <value optimized out>
        ret = 0
        fd_migration_error = 0 '\000'
        resolve = 0x7fa31c038db0
        basefd = 0x1482d30
        active_subvol = 0x0
        basefd_ctx = 0x7fa31c0314a0
#3  fuse_resolve (state=<value optimized out>) at fuse-resolve.c:622
        resolve = 0x7fa31c038db0
#4  0x00007fa32e8b026e in fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:665
        __FUNCTION__ = "fuse_resolve_all"
#5  0x00007fa32e8b02b8 in fuse_resolve_and_resume (state=0x7fa31c038cb0, fn=0x7fa32e8b87a0 <fuse_flush_resume>) at fuse-resolve.c:705
No locals.
#6  0x00007fa32e8c66f8 in fuse_thread_proc (data=0x100fb70) at fuse-bridge.c:4562
        mount_point = 0x0
        this = 0x100fb70
        priv = 0x10109e0
        res = <value optimized out>
        iobuf = 0x100e6a0
        finh = 0x7fa31c03b970
---Type <return> to continue, or q <return> to quit---
        iov_in = {{iov_base = 0x7fa31c03b970, iov_len = 80}, {iov_base = 0x7fa32f96d000, iov_len = 131072}}
        msg = <value optimized out>
        fuse_ops = 0x7fa32ead2920
        pfd = {{fd = 5, events = 25, revents = 1}, {fd = 7, events = 25, revents = 1}}
        mount_finished = _gf_true
        __FUNCTION__ = "fuse_thread_proc"
#7  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#8  0x00000036274e890d in clone () from /lib64/libc.so.6
No symbol table info available.
 



attaching the sosreport

Comment 3 shishir gowda 2013-05-07 10:40:25 UTC

Please provide access to the host where the crash was seen

Comment 5 shishir gowda 2013-05-07 12:35:07 UTC

Looks like a flush call has come in on a freed fd

(gdb) bt
#0  0x000000362780c170 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007fa3304de4c3 in fd_ref (fd=0x2d54124) at fd.c:447
#2  0x00007fa32e8afd4e in fuse_resolve_fd (state=<value optimized out>) at fuse-resolve.c:461
#3  fuse_resolve (state=<value optimized out>) at fuse-resolve.c:622
#4  0x00007fa32e8b026e in fuse_resolve_all (state=<value optimized out>) at fuse-resolve.c:665
#5  0x00007fa32e8b02b8 in fuse_resolve_and_resume (state=0x7fa31c038cb0, fn=0x7fa32e8b87a0 <fuse_flush_resume>) at fuse-resolve.c:705
#6  0x00007fa32e8c66f8 in fuse_thread_proc (data=0x100fb70) at fuse-bridge.c:4562
#7  0x0000003627807851 in start_thread () from /lib64/libpthread.so.0
#8  0x00000036274e890d in clone () from /lib64/libc.so.6
(gdb) f 1
#1  0x00007fa3304de4c3 in fd_ref (fd=0x2d54124) at fd.c:447
447	        LOCK (&fd->inode->lock);

(gdb) p fd->inode
$12 = (struct _inode *) 0xaaaaaaaa

(gdb) p *fd
$13 = {pid = 0, flags = 0, refcount = 0, inode_list = {next = 0x2d54134, prev = 0x2d54134}, inode = 0xaaaaaaaa, lock = 1, _ctx = 0x7fa31c03a090, xl_count = 32, lk_ctx = 0x7fa31c012230, anonymous = _gf_true}  <=====ref count is zero

(gdb) f 5
#5  0x00007fa32e8b02b8 in fuse_resolve_and_resume (state=0x7fa31c038cb0, fn=0x7fa32e8b87a0 <fuse_flush_resume>) at fuse-resolve.c:705
705	        fuse_resolve_all (state);


(gdb) p activefd 
$15 = (fd_t *) 0x2d54124

(gdb) p *basefd
$17 = {pid = 8614, flags = 573442, refcount = 4, inode_list = {next = 0x7fa31b48f370, prev = 0x7fa31b48f370}, inode = 0x7fa31b48f338, lock = 0, _ctx = 0x7fa31c029d70, xl_count = 26, lk_ctx = 0x7fa31c00a5b0, 
  anonymous = _gf_false}
(gdb) p *basefd->inode
$18 = {table = 0x1482930, gfid = "K\362\247w\304\030G>\226\222\343q\343\270:\t", lock = 1, nlookup = 51, fd_count = 4294967247, ref = 1, ia_type = IA_IFREG, fd_list = {next = 0x1482d40, prev = 0x1482d40}, 
  dentry_list = {next = 0x7fa31b1ee3c8, prev = 0x7fa31b1ee3c8}, hash = {next = 0x7fa31b1270a0, prev = 0x7fa31b1270a0}, list = {next = 0x7fa31b48f91c, prev = 0x7fa31b48f7e4}, _ctx = 0x7fa31c036100}

Comment 6 shishir gowda 2013-05-07 12:56:10 UTC

(gdb) p *(xlator_t *)fd->_ctx[2].xl_key
$30 = {name = 0x14a3790 "vstore", type = 0x14a37b0 "debug/io-stats", next = 0x14dc3a0, prev = 0x0, parents = 0x0, children = 0x14a6570, options = 0x7fa32ecbc8a0, dlhandle = 0x1041c30, fops = 0x7fa32b17a5c0, 
  cbks = 0x7fa32b17a860, dumpops = 0x0, volume_options = {next = 0x14a36c0, prev = 0x14a36c0}, fini = 0x7fa32af67f40 <fini>, init = 0x7fa32af683a0 <init>, reconfigure = 0x7fa32af68690 <reconfigure>, 
  mem_acct_init = 0x7fa32af68620 <mem_acct_init>, notify = 0x7fa32af70da0 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 46 times>}, 
  history = 0x0, ctx = 0xff3010, graph = 0x14a5550, itable = 0x2c1d030, init_succeeded = 1 '\001', private = 0x14dd6e0, mem_acct = {num_types = 0, rec = 0x0}, winds = 1, switched = 0 '\000', local_pool = 0x0, 
  is_autoloaded = _gf_false}
(gdb) p *(xlator_t *)fd->_ctx[1].xl_key
$31 = {name = 0x149f340 "vstore-open-behind", type = 0x14a3770 "performance/open-behind", next = 0x14dba00, prev = 0x14dcd40, parents = 0x14a64d0, children = 0x14a3700, options = 0x7fa32ecbc814, 
  dlhandle = 0x1040bc0, fops = 0x7fa32b3824c0, cbks = 0x7fa32b382760, dumpops = 0x7fa32b382780, volume_options = {next = 0x14a36e0, prev = 0x14a36e0}, fini = 0x7fa32b17e650 <fini>, 
  init = 0x7fa32b17e690 <init>, reconfigure = 0x7fa32b17e7e0 <reconfigure>, mem_acct_init = 0x7fa32b17e860 <mem_acct_init>, notify = 0x7fa3304c1190 <default_notify>, loglevel = GF_LOG_NONE, latencies = {{
      min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 46 times>}, history = 0x0, ctx = 0xff3010, graph = 0x14a5550, itable = 0x0, init_succeeded = 1 '\001', private = 0x149f370, mem_acct = {
    num_types = 0, rec = 0x0}, winds = 0, switched = 0 '\000', local_pool = 0x0, is_autoloaded = _gf_false}

Comment 7 shishir gowda 2013-05-07 13:28:14 UTC

Looks like a open-behind xlator issue

Comment 9 Pranith Kumar K 2013-05-10 06:56:09 UTC

I looked at the fd_unrefs that are present in different xlators in the mount  graph. I saw one unmatched fd_unref in dht. I am able to recreate a crash at a different place using the following script again in fd_ref in fuse_writev code path.

#!/bin/bash

. $(dirname $0)/../include.rc
. $(dirname $0)/../volume.rc

#This test tests that an extra fd_unref does not happen in rebalance
#migration completion check code path in dht

cleanup;

TEST glusterd
TEST pidof glusterd
TEST $CLI volume create $V0 $H0:$B0/${V0}0 $H0:$B0/${V0}1
TEST $CLI volume set $V0 performance.quick-read off
TEST $CLI volume set $V0 performance.io-cache off
TEST $CLI volume set $V0 performance.write-behind off
TEST $CLI volume set $V0 performance.stat-prefetch off
TEST $CLI volume set $V0 performance.read-ahead off
TEST $CLI volume start $V0
TEST glusterfs --volfile-id=/$V0 --volfile-server=$H0 $M0 --attribute-timeout=0 --entry-timeout=0
TEST touch $M0/1
#This rename creates a link file for 10 in the other volume.
TEST mv $M0/1 $M0/10
#Lets keep writing to the file which will trigger rebalance completion check
dd if=/dev/zero of=$M0/10 bs=1k &
bg_pid=$!
#Now rebalance force will migrate file '10'
TEST $CLI volume rebalance $V0 start force
EXPECT_WITHIN 60 "completed" rebalance_status_field $V0
#If the bug exists mount would have crashed by now
TEST ls $M0
kill -9 $bg_pid > /dev/null 2>&1
wait > /dev/null 2>&1
cleanup

Comment 10 shylesh 2013-05-17 11:01:58 UTC

Verified on 3.4.0.8rhs-1.el6rhs.x86_64

Comment 11 Scott Haines 2013-09-23 22:35:26 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html