Bug 1231635

Summary:	glusterd crashed when testing heal full on replaced disks
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	spandura
Component:	glusterd	Assignee:	Anand Nekkunti <anekkunt>
Status:	CLOSED ERRATA	QA Contact:	SATHEESARAN <sasundar>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amukherj, anekkunt, asrivast, atalur, kparthas, nlevinki, nsathyan, rcyriac, sasundar, senaik, spandura, ssampat, vagarwal, vbellur
Target Milestone:	---
Target Release:	RHGS 3.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	GlusterD
Fixed In Version:	glusterfs-3.7.1-8	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1232693 1233041 (view as bug list)		Environment:
Last Closed:	2015-07-29 05:02:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1202842, 1232693, 1233041

Description spandura 2015-06-15 06:46:55 UTC

Description of problem:
=========================
Following is the backtrace of the observed crash:


[New LWP 13894]
[New LWP 13892]
[New LWP 13891]
[New LWP 13890]
[New LWP 14045]
[New LWP 13893]
[New LWP 14044]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fded99ed126 in rcu_read_unlock_bp () from /lib64/liburcu-bp.so.1

Thread 7 (Thread 0x7fded6535700 (LWP 14044)):
#0  0x00007fdee45bb705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fdeda33be23 in hooks_worker (args=<optimized out>) at glusterd-hooks.c:523
        conf = <optimized out>
        hooks_priv = 0x7fdee754e480
        stub = <optimized out>
#2  0x00007fdee45b7df5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007fdee3efe1ad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 6 (Thread 0x7fdedb5d5700 (LWP 13893)):
#0  0x00007fdee45bbab2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fdee5488438 in syncenv_task (proc=proc@entry=0x7fdee74f0ae0) at syncop.c:591
        env = 0x7fdee74f0ae0
        task = 0x0
        sleep_till = {tv_sec = 1434179489, tv_nsec = 0}
        ret = <optimized out>
#2  0x00007fdee54890a0 in syncenv_processor (thdata=0x7fdee74f0ae0) at syncop.c:683
        env = 0x7fdee74f0ae0
        proc = 0x7fdee74f0ae0
        task = <optimized out>
#3  0x00007fdee45b7df5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007fdee3efe1ad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 5 (Thread 0x7fded5d34700 (LWP 14045)):
#0  0x00007fdee45bb705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fdee548657b in __synclock_lock (lock=lock@entry=0x7fded6afc7d8) at syncop.c:909
        task = 0x0
        __FUNCTION__ = "__synclock_lock"
#2  0x00007fdee5489366 in synclock_lock (lock=lock@entry=0x7fded6afc7d8) at syncop.c:935
        ret = 0
#3  0x00007fdeda2eb84d in glusterd_big_locked_cbk (req=0x7fdee7585278, iov=0x7fdee75852b8, count=1, myframe=0x7fdee30921dc, fn=0x7fdeda2e8ab0 <__glusterd_friend_update_cbk>) at glusterd-rpc-ops.c:211
        priv = 0x7fded6ad3050
        ret = -1
#4  0x00007fdee521b480 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fdee7584d40, pollin=pollin@entry=0x7fdec800e4a0) at rpc-clnt.c:766
        conn = 0x7fdee7584d70
        saved_frame = <optimized out>
        ret = 0
        req = 0x7fdee7585278
        xid = 9
        __FUNCTION__ = "rpc_clnt_handle_reply"
#5  0x00007fdee521b6f4 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fdee7584d70, event=<optimized out>, data=0x7fdec800e4a0) at rpc-clnt.c:894
        conn = 0x7fdee7584d70
        clnt = 0x7fdee7584d40
        ret = -1
        req_info = 0x0
        pollin = 0x7fdec800e4a0
        clnt_mydata = 0x0
#6  0x00007fdee52175d3 in rpc_transport_notify (this=this@entry=0x7fdee7587e70, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fdec800e4a0) at rpc-transport.c:543
        ret = -1
        __FUNCTION__ = "rpc_transport_notify"
#7  0x00007fded7d045a6 in socket_event_poll_in (this=this@entry=0x7fdee7587e70) at socket.c:2290
        ret = <optimized out>
        pollin = 0x7fdec800e4a0
        priv = <optimized out>
#8  0x00007fded7d07374 in socket_event_handler (fd=fd@entry=14, idx=idx@entry=4, data=0x7fdee7587e70, poll_in=1, poll_out=0, poll_err=0) at socket.c:2403
        this = 0x7fdee7587e70
        priv = 0x7fdee7588b00
        ret = 0
        __FUNCTION__ = "socket_event_handler"
#9  0x00007fdee54a448a in event_dispatch_epoll_handler (event=0x7fded5d33e80, event_pool=0x7fdee74e1c10) at event-epoll.c:572
        handler = 0x7fded7d07210 <socket_event_handler>
        gen = 1
        slot = 0x7fdee7524900
        data = <optimized out>
        ret = -1
        fd = 14
        ev_data = 0x7fded5d33e84
        idx = 4
#10 event_dispatch_epoll_worker (data=0x7fdee74f4a30) at event-epoll.c:674
        event = {events = 1, data = {ptr = 0x100000004, fd = 4, u32 = 4, u64 = 4294967300}}
        ret = <optimized out>
        ev_data = 0x7fdee74f4a30
        event_pool = 0x7fdee74e1c10
        myindex = 1
        timetodie = 0
        __FUNCTION__ = "event_dispatch_epoll_worker"
#11 0x00007fdee45b7df5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#12 0x00007fdee3efe1ad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x7fdee58f6780 (LWP 13890)):
#0  0x00007fdee45b8f27 in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fdee54a4795 in event_dispatch_epoll (event_pool=0x7fdee74e1c10) at event-epoll.c:759
        i = <optimized out>
        t_id = 140595046860544
        pollercount = 1
        ret = 0
        ev_data = 0x7fdee74f4a30
        __FUNCTION__ = "event_dispatch_epoll"
#2  0x00007fdee59116e7 in main (argc=3, argv=0x7fff45d697a8) at glusterfsd.c:2326
        ctx = 0x7fdee74c3010
        ret = 0
        cmdlinestr = "/usr/sbin/glusterd -p /var/run/glusterd.pid", '\000' <repeats 4052 times>
        cmd = 0x7fdee74c3010
        __FUNCTION__ = "main"

Thread 3 (Thread 0x7fdedc5d7700 (LWP 13891)):
#0  0x00007fdee45be99d in nanosleep () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fdee5469934 in gf_timer_proc (ctx=0x7fdee74c3010) at timer.c:195
        now = 153571559234362
        now_ts = {tv_sec = 153571, tv_nsec = 559234362}
        reg = 0x7fdee74f0730
        sleepts = {tv_sec = 1, tv_nsec = 0}
        event = 0x7fdecc000ad0
        old_THIS = <optimized out>
        __FUNCTION__ = "gf_timer_proc"
#2  0x00007fdee45b7df5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007fdee3efe1ad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x7fdedbdd6700 (LWP 13892)):
#0  0x00007fdee45beec1 in sigwait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fdee59146eb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:1982
        set = {__val = {18947, 0 <repeats 15 times>}}
        ret = <optimized out>
        sig = 0
#2  0x00007fdee45b7df5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007fdee3efe1ad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x7fdedadd4700 (LWP 13894)):
#0  0x00007fded99ed126 in rcu_read_unlock_bp () from /lib64/liburcu-bp.so.1
No symbol table info available.
#1  0x00007fdeda2ba629 in glusterd_ac_send_friend_update (event=<optimized out>, ctx=<optimized out>) at glusterd-sm.c:575
        ret = 0
        cur_peerinfo = 0x7fdee7564120
        peerinfo = <optimized out>
        proc = <optimized out>
        this = 0x7fdee74f9040
        ev_ctx = {uuid = {<optimized out> <repeats 16 times>}, hostname = <optimized out>, op = 1}
        priv = 0x7fded6ad3050
        friends = 0x7fdee570b510
        key = "friend4", '\000' <repeats 92 times>
        count = <optimized out>
        __FUNCTION__ = "glusterd_ac_send_friend_update"
#2  0x00007fdeda2bbec2 in glusterd_friend_sm () at glusterd-sm.c:1228
        event = 0x7fdec00079e0
        tmp = 0x7fdeda5d1160 <gd_friend_sm_queue>
        ret = <optimized out>
        handler = <optimized out>
        state = 0x7fdeda5c7900 <glusterd_state_befriended>
        peerinfo = <optimized out>
        event_type = GD_FRIEND_EVENT_LOCAL_ACC
        is_await_conn = _gf_false
        quorum_action = _gf_false
        old_state = GD_FRIEND_STATE_BEFRIENDED
        this = <optimized out>
        priv = 0x7fded6ad3050
        __FUNCTION__ = "glusterd_friend_sm"
#3  0x00007fdeda35c736 in glusterd_op_state_machine_mgmt_v3_lock (ctx=0x7fdec0001c80, lock_req=0x7fdec84076e0, req=0x7fded7f11b78) at glusterd-mgmt-handler.c:119
        ret = 0
        this = 0x7fdee74f9040
        txn_op_info = {state = {state = GD_OP_STATE_DEFAULT, time = {tv_sec = 0, tv_usec = 0}}, pending_count = 0, brick_pending_count = 0, op_count = 0, op = GD_OP_STATUS_VOLUME, op_peers = {next = 0x0, prev = 0x0}, op_ctx = 0x7fdee570b59c, req = 0x7fded7f11b78, op_ret = 0, op_errno = 0, op_errstr = 0x0, pending_bricks = {next = 0x0, prev = 0x0}, txn_generation = 8}
#4  glusterd_handle_mgmt_v3_lock_fn (req=req@entry=0x7fded7f11b78) at glusterd-mgmt-handler.c:210
        lock_req = {uuid = "\316d\213\066\255\224G\242\261\225r\204\367\230fM", txn_id = "\324\325\342\314\334\356E|\221\375\315\343'\312\247\224", op = 18, dict = {dict_len = 194, dict_val = 0x7fdec0012c80 ""}}
        ret = <optimized out>
        ctx = 0x7fdec0001c80
        this = 0x7fdee74f9040
        is_synctasked = <optimized out>
        free_ctx = _gf_false
        __FUNCTION__ = "glusterd_handle_mgmt_v3_lock_fn"
#5  0x00007fdeda2b1370 in glusterd_big_locked_handler (req=0x7fded7f11b78, actor_fn=0x7fdeda35c260 <glusterd_handle_mgmt_v3_lock_fn>) at glusterd-handler.c:83
        priv = 0x7fded6ad3050
        ret = -1
#6  0x00007fdee5486482 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
        task = 0x7fdec800eb00
#7  0x00007fdee3e4f0f0 in ?? () from /lib64/libc.so.6
No symbol table info available.
#8  0x0000000000000000 in ?? ()
No symbol table info available.

Version-Release number of selected component (if applicable):
============================================================
:: [  BEGIN   ] :: Running 'rpm -qa | grep gluster'
glusterfs-fuse-3.7.1-1.el7rhgs.x86_64
glusterfs-server-3.7.1-1.el7rhgs.x86_64

How reproducible:
================
1/1

Steps to Reproduce:
=====================
1. Running the AFR "heal_full" Automated tests. 

Actual results:
================
https://beaker.engineering.redhat.com/jobs/980505

Comment 2 Kaushal 2015-06-15 06:59:57 UTC

Shwetha spoke to me about this before raising this bug. The sos-reports for this beaker run are missing for some unknown reason. Hence, we didn't have enough information for confirm the cause for this bug.

Shwetha volunteered to run the test again and check if it fails again. Hopefully, we can get the sos-reports in case it happens again. Till then we cannot proceed further with the bug.

Comment 3 krishnan parthasarathi 2015-06-17 08:41:48 UTC

Shwetha,

Could you provide the steps performed to hit this issue?

Comment 5 Anand Nekkunti 2015-06-17 10:01:26 UTC

RCA : crash mainly due to rcu_unlock() is called in different thread. i.e rcu lock is taken in one thread and rcu_unlock is called in different thread. 

 To avoid this I sent patch http://review.gluster.org/#/c/10285/14 , but still some places we forgot to do.

Comment 6 Atin Mukherjee 2015-06-18 05:29:05 UTC

Upstream mainline patch http://review.gluster.org/#/c/11276/ posted for review

Comment 9 Shruti Sampat 2015-07-03 10:10:47 UTC

I have observed this crash in my setup too. I cannot say what steps exactly led to the crash, but I was doing the following operations -

1. On a volume that was mounted on a FUSE client, I tried to disable quota and unmount the volume. The `umount' command was stuck and did not complete for a long time.
2. I tried to stop the volume but the stop request timed out and failed.
3. After volume stop, I was unable to see volume status as it failed with "Another transaction in progress".
4. The volume was now in an inconsistent state as it was shown as started in volume info output but attempts to stop it would fail with "volume not in started state" message.
5. I removed the volume directory from /var/lib/glusterd/vols and tried to restart glusterd. Restart failed and I checked the logs. glusterd had crashed.

From glusterd logs -
-------------------------------------------------------------------------------

[2015-07-03 07:43:07.474874] E [MSGID: 106069] [glusterd-volgen.c:1021:volgen_write_volfile] 0-management: failed to create volfile /var/lib/glusterd/vols/rep/rep-snapd.vol
[2015-07-03 07:43:07.474925] E [MSGID: 106069] [glusterd-snapd-svc.c:254:glusterd_snapdsvc_start] 0-management: Couldn't create snapd volfile for volume: rep
[2015-07-03 07:43:07.475000] E [MSGID: 106113] [glusterd-snapd-svc.c:341:glusterd_snapdsvc_restart] 0-management: Couldn't start snapd for vol: rep
[2015-07-03 07:43:07.562023] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: a1b83030-3890-45fc-9489-0815341722a3
[2015-07-03 07:43:08.153177] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2015-07-03 07:43:08.304427] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (15), shutting down
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-07-03 07:43:08
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7fe6234c8826]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7fe6234e83ef]
/lib64/libc.so.6(+0x38072326a0)[0x7fe621e676a0]
/usr/lib64/liburcu-bp.so.1(rcu_read_unlock_bp+0x16)[0x7fe617925de6]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_friend_update+0x701)[0x7fe617ed0bd1]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fe617eb5a0f]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x295)[0x7fe623291ee5]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103)[0x7fe623292123]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7fe623293ad8]
/usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so(+0xa255)[0x7fe616557255]
/usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so(+0xbe4d)[0x7fe616558e4d]
/usr/lib64/libglusterfs.so.0(+0x89970)[0x7fe62352c970]
/lib64/libpthread.so.0(+0x3807607a51)[0x7fe6225b3a51]
/lib64/libc.so.6(clone+0x6d)[0x7fe621f1d96d]
---------

Comment 10 Anand Nekkunti 2015-07-03 11:30:34 UTC

patch is available in upstream, so moving to post

Comment 11 senaik 2015-07-07 06:26:28 UTC

Version :glusterfs-3.7.1-7.el6rhs.x86_64

Faced a similar glusterd crash while probing another node .

[2015-07-06 20:53:06.046749] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 3ed85019-8616-458
1-8115-22a2554dea26
[2015-07-06 20:53:06.046775] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2015-07-06 20:53:06.048525] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7c1cd644-75ea-4d2
d-b228-09e25827cd45
[2015-07-06 20:53:06.048562] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-07-06 20:53:06
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7fd560ce7826]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7fd560d073ef]
/lib64/libc.so.6(+0x3da84326a0)[0x7fd55f6866a0]
/usr/lib64/liburcu-bp.so.1(rcu_read_unlock_bp+0x16)[0x7fd555144de6]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(+0x58a57)[0x7fd5556f4a57]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_friend_sm+0x189)[0x7fd5556f4479]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_cli_probe+0x14f)[0x7fd5556f0a6f]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fd5556d4a6f]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7fd560d2f1f2]
/lib64/libc.so.6(+0x3da84438f0)[0x7fd55f6978f0]



bt:
==(gdb) bt
#0  0x00007fd555144de6 in rcu_read_unlock_bp () from /usr/lib64/liburcu-bp.so.1
#1  0x00007fd5556f4a57 in glusterd_ac_send_friend_update (event=<value optimized out>, ctx=<value optimized out>) at glusterd-sm.c:592
#2  0x00007fd5556f4479 in glusterd_friend_sm () at glusterd-sm.c:1257
#3  0x00007fd5556f0a6f in __glusterd_handle_cli_probe (req=0x7fd561eb45cc) at glusterd-handler.c:1220
#4  0x00007fd5556d4a6f in glusterd_big_locked_handler (req=0x7fd561eb45cc, actor_fn=0x7fd5556f0920 <__glusterd_handle_cli_probe>) at glusterd-handler.c:83
#5  0x00007fd560d2f1f2 in synctask_wrap (old_task=<value optimized out>) at syncop.c:381
#6  0x00007fd55f6978f0 in ?? () from /lib64/libc.so.6
#7  0x0000000000000000 in ?? ()

Comment 12 senaik 2015-07-07 06:50:11 UTC

Logs for comment 11: 

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1231635/

Comment 13 SATHEESARAN 2015-07-07 06:57:03 UTC

Since this issue was hit atleast 3 times as of now and also with the fact that the fix is available upstream, proposing this issue as blocker for RHGS 3.1

Comment 16 SATHEESARAN 2015-07-20 03:24:10 UTC

As suggested by Dev ( Anand Nekkunti ), this case is the rarely hit race.
The probability of hitting this race is expected to increase, when there are more activated snapshots and peer probe was done.

I tried the case where snapshot creation was happening in one script and concurrently peer probe/deprobe was happening.

I have created 100+ snaps along with peer probing/deprobing, there were no glusterd crash found.

Tried disk replacement procedure as well and triggered 'heal full' on the replaced disks. That works well and there are no issues.

Marking this bug as VERIFIED

Comment 17 errata-xmlrpc 2015-07-29 05:02:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Comment 18 Red Hat Bugzilla 2023-09-14 03:00:39 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days