Bug 1631418 - glusterd crash in regression build
Summary: glusterd crash in regression build
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: RHGS 3.4.z Batch Update 2
Assignee: Sanju
QA Contact: Bala Konda Reddy M
URL:
Whiteboard:
Depends On: 1627610 1633552
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-20 14:31 UTC by Sanju
Modified: 2018-12-17 17:07 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.12.2-27
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1627610
Environment:
Last Closed: 2018-12-17 17:07:04 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3827 0 None None None 2018-12-17 17:07:16 UTC

Description Sanju 2018-09-20 14:31:57 UTC
+++ This bug was initially created as a clone of Bug #1627610 +++

Description of problem:
glusterd generated a core in regression build.

Core and build archieved in http://builder103.cloud.gluster.org/archived_builds/build-install-regression-test-burn-in-4085.tar.bz2

output of t a a bt is:
(gdb) t a a bt

Thread 8 (LWP 17974):
#0  0x00007f50e6ebf56d in nanosleep () from ./lib64/libc.so.6
#1  0x00007f50e6ebf404 in sleep () from ./lib64/libc.so.6
#2  0x00007f50e881a346 in pool_sweeper (arg=0x0)
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/mem-pool.c:470
#3  0x00007f50e762fe25 in start_thread () from ./lib64/libpthread.so.0
#4  0x00007f50e6ef8bad in clone () from ./lib64/libc.so.6

Thread 7 (LWP 17975):
#0  0x00007f50e6e76e6d in __GI__IO_default_xsputn () from ./lib64/libc.so.6
#1  0x00007f50e6e74862 in __GI__IO_file_xsputn () from ./lib64/libc.so.6
#2  0x00007f50e6e46ed3 in vfprintf () from ./lib64/libc.so.6
#3  0x00007f50e6e4d297 in fprintf () from ./lib64/libc.so.6
#4  0x00007f50e882b444 in gf_store_save_value (fd=19, key=0x7f50dd365ed0 "brick-id", 
    value=0x92689f "patchy-client-1")
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/store.c:360
#5  0x00007f50dd27b5a6 in glusterd_store_brickinfo_write (fd=19, brickinfo=0x922ca0)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:403
#6  0x00007f50dd27b8b5 in glusterd_store_perform_brick_store (brickinfo=0x922ca0)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:461
#7  0x00007f50dd27bc19 in glusterd_store_brickinfo (volinfo=0x902290, brickinfo=0x922ca0, 
    brick_count=1, vol_fd=16)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:535
#8  0x00007f50dd27edbf in glusterd_store_brickinfos (volinfo=0x902290, vol_fd=16)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1373
#9  0x00007f50dd27fa35 in glusterd_store_perform_volume_store (volinfo=0x902290)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1613
#10 0x00007f50dd2801f9 in glusterd_store_volinfo (volinfo=0x902290, ac=GLUSTERD_VOLINFO_VER_AC_NONE)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1806
#11 0x00007f50dd258a76 in glusterd_restart_bricks (opaque=0x0)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-utils.c:6422
#12 0x00007f50e883111e in synctask_wrap ()
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:375
#13 0x00007f50e6e42030 in ?? () from ./lib64/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 6 (LWP 18004):
#0  0x00007f50e7633995 in pthread_cond_wait@@GLIBC_2.3.2 () from ./lib64/libpthread.so.0
#1  0x00007f50dd2f293b in hooks_worker (args=0x8aa500)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-hooks.c:538
#2  0x00007f50e762fe25 in start_thread () from ./lib64/libpthread.so.0
#3  0x00007f50e6ef8bad in clone () from ./lib64/libc.so.6

Thread 5 (LWP 17976):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f50e7633d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from ./lib64/libpthread.so.0
#1  0x00007f50e883195d in syncenv_task (proc=0x8a2150)
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:605
#2  0x00007f50e8831bf8 in syncenv_processor (thdata=0x8a2150)
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:697
#3  0x00007f50e762fe25 in start_thread () from ./lib64/libpthread.so.0
#4  0x00007f50e6ef8bad in clone () from ./lib64/libc.so.6

Thread 4 (LWP 17972):
#0  0x00007f50e7636f3d in nanosleep () from ./lib64/libpthread.so.0
#1  0x00007f50e87f2f75 in gf_timer_proc (data=0x8a0ba0)
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/timer.c:202
#2  0x00007f50e762fe25 in start_thread () from ./lib64/libpthread.so.0
#3  0x00007f50e6ef8bad in clone () from ./lib64/libc.so.6

Thread 3 (LWP 17971):
#0  0x00007f50e7630f97 in pthread_join () from ./lib64/libpthread.so.0
#1  0x00007f50e885b443 in event_dispatch_epoll (event_pool=0x898c30)
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/event-epoll.c:777
#2  0x00007f50e8818b11 in event_dispatch (event_pool=0x898c30)
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/event.c:124
#3  0x000000000040b585 in main (argc=11, argv=0x7fff88c79968)
    at /home/jenkins/root/workspace/regression-test-burn-in/glusterfsd/src/glusterfsd.c:2785

Thread 2 (LWP 17973):
#0  0x00007f50e7637461 in sigwait () from ./lib64/libpthread.so.0
#1  0x000000000040a744 in glusterfs_sigwaiter (arg=0x7fff88c78710)
    at /home/jenkins/root/workspace/regression-test-burn-in/glusterfsd/src/glusterfsd.c:2327
#2  0x00007f50e762fe25 in start_thread () from ./lib64/libpthread.so.0
#3  0x00007f50e6ef8bad in clone () from ./lib64/libc.so.6

Thread 1 (LWP 18005):
#0  0x00007f50e6e30277 in raise () from ./lib64/libc.so.6
#1  0x00007f50e6e31968 in abort () from ./lib64/libc.so.6
#2  0x00007f50e6e29096 in __assert_fail_base () from ./lib64/libc.so.6
#3  0x00007f50e6e29142 in __assert_fail () from ./lib64/libc.so.6
#4  0x00007f50e882b341 in gf_store_save_value (fd=0, key=0x91bff0 "performance.client-io-threads", 
    value=0x8bcc40 "off")
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/store.c:344
#5  0x00007f50dd27ca69 in _storeopts (this=0x8bde78, key=0x91bff0 "performance.client-io-threads", 
    value=0x8f7938, data=0x8f8f50)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:773
#6  0x00007f50e87d43ce in dict_foreach_match (dict=0x8bde78, 
    match=0x7f50e87d425e <dict_match_everything>, match_data=0x0, action=0x7f50dd27c819 <_storeopts>, 
    action_data=0x8f8f50)
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/dict.c:1415
#7  0x00007f50e87d42ba in dict_foreach (dict=0x8bde78, fn=0x7f50dd27c819 <_storeopts>, data=0x8f8f50)
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/dict.c:1373
#8  0x00007f50dd27e211 in glusterd_store_volinfo_write (fd=8, volinfo=0x902290)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1157
#9  0x00007f50dd27fa19 in glusterd_store_perform_volume_store (volinfo=0x902290)
---Type <return> to continue, or q <return> to quit---
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1609
#10 0x00007f50dd2801f9 in glusterd_store_volinfo (volinfo=0x902290, ac=GLUSTERD_VOLINFO_VER_AC_NONE)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1806
#11 0x00007f50dd346d3d in glusterd_do_volume_quorum_action (this=0x8aa500, volinfo=0x902290, 
    meets_quorum=true)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-server-quorum.c:403
#12 0x00007f50dd346f92 in glusterd_do_quorum_action ()
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-server-quorum.c:459
#13 0x00007f50dd22bd75 in glusterd_friend_sm ()
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-sm.c:1518
#14 0x00007f50dd21bd46 in __glusterd_handle_incoming_friend_req (req=0x7f50cc003868)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-handler.c:2565
#15 0x00007f50dd214393 in glusterd_big_locked_handler (req=0x7f50cc003868, 
    actor_fn=0x7f50dd21bbb7 <__glusterd_handle_incoming_friend_req>)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-handler.c:80
#16 0x00007f50dd21bd77 in glusterd_handle_incoming_friend_req (req=0x7f50cc003868)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-handler.c:2575
#17 0x00007f50e859c878 in rpcsvc_handle_rpc_call (svc=0x8bd100, trans=0x7f50cc000dc0, 
    msg=0x7f50cc003710)
    at /home/jenkins/root/workspace/regression-test-burn-in/rpc/rpc-lib/src/rpcsvc.c:721
#18 0x00007f50e859cbcb in rpcsvc_notify (trans=0x7f50cc000dc0, mydata=0x8bd100, 
    event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f50cc003710)
    at /home/jenkins/root/workspace/regression-test-burn-in/rpc/rpc-lib/src/rpcsvc.c:815
#19 0x00007f50e85a2895 in rpc_transport_notify (this=0x7f50cc000dc0, event=RPC_TRANSPORT_MSG_RECEIVED, 
    data=0x7f50cc003710)
    at /home/jenkins/root/workspace/regression-test-burn-in/rpc/rpc-lib/src/rpc-transport.c:547
#20 0x00007f50da45a3ff in socket_event_poll_in (this=0x7f50cc000dc0, notify_handled=true)
    at /home/jenkins/root/workspace/regression-test-burn-in/rpc/rpc-transport/socket/src/socket.c:2589
#21 0x00007f50da45b3e0 in socket_event_handler (fd=7, idx=4, gen=1, data=0x7f50cc000dc0, poll_in=1, 
    poll_out=0, poll_err=0)
    at /home/jenkins/root/workspace/regression-test-burn-in/rpc/rpc-transport/socket/src/socket.c:3000
#22 0x00007f50e885ae91 in event_dispatch_epoll_handler (event_pool=0x898c30, event=0x7f50d3ffeea0)
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/event-epoll.c:609
#23 0x00007f50e885b18a in event_dispatch_epoll_worker (data=0x8fa9a0)
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/event-epoll.c:685
#24 0x00007f50e762fe25 in start_thread () from ./lib64/libpthread.so.0
#25 0x00007f50e6ef8bad in clone () from ./lib64/libc.so.6
(gdb) 

Actual results:


Expected results:


Additional info:

--- Additional comment from Sanju on 2018-09-11 08:36:41 IST ---

Root Cause:
From Thread 7:
#10 0x00007f50dd2801f9 in glusterd_store_volinfo (volinfo=0x902290, ac=GLUSTERD_VOLINFO_VER_AC_NONE)
at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1806

From Thread 1:
#10 0x00007f50dd2801f9 in glusterd_store_volinfo (volinfo=0x902290, ac=GLUSTERD_VOLINFO_VER_AC_NONE)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1806

From above snippets from the output of "t a a bt", we can say that Thread 7 and Thread 1 are pointing to the same volinfo structure.

Source code for glusterd_store volinfo_write:
int32_t
glusterd_store_volinfo_write (int fd, glusterd_volinfo_t *volinfo)
{
        int32_t                         ret = -1;
        gf_store_handle_t              *shandle = NULL;
        GF_ASSERT (fd > 0);
        GF_ASSERT (volinfo);
        GF_ASSERT (volinfo->shandle);

        shandle = volinfo->shandle;
        ret = glusterd_volume_exclude_options_write (fd, volinfo);
        if (ret)
                goto out;

        shandle->fd = fd;
        dict_foreach (volinfo->dict, _storeopts, shandle);

        dict_foreach (volinfo->gsync_slaves, _storeslaves, shandle);
        shandle->fd = 0;
out:
        gf_msg_debug (THIS->name, 0, "Returning %d", ret);
        return ret;
}

At Thread 1,
#8  0x00007f50dd27e211 in glusterd_store_volinfo_write (fd=8, volinfo=0x902290)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1157

glusterd_store_volinfo_write is calling _storeopts, which again calls gf_store_save_value. _storeopts is also having a assertion check for whether fd>0. At glusterd_store_volinfo_write fd value is 8.

#4  0x00007f50e882b341 in gf_store_save_value (fd=0, key=0x91bff0 "performance.client-io-threads", 
    value=0x8bcc40 "off")
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/store.c:344
From above we can see that fd value is 0.

At Thread 7, 
#8  0x00007f50dd27edbf in glusterd_store_brickinfos (volinfo=0x902290, vol_fd=16)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1373
#9  0x00007f50dd27fa35 in glusterd_store_perform_volume_store (volinfo=0x902290)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1613
#10 0x00007f50dd2801f9 in glusterd_store_volinfo (volinfo=0x902290, ac=GLUSTERD_VOLINFO_VER_AC_NONE)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-store.c:1806
#11 0x00007f50dd258a76 in glusterd_restart_bricks (opaque=0x0)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-utils.c:6422
#12 0x00007f50e883111e in synctask_wrap ()
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:375
#13 0x00007f50e6e42030 in ?? () from ./lib64/libc.so.6
#14 0x0000000000000000 in ?? ()

In the stack, we can see glusterd_store_perform_volume_store calling glusterd_store_brickinfos. Before calling glusterd_store_brickinfos, glusterd_store_perform_volume_store calls glusterd_store_volinfo_write, which is writing shandle->fd as 0.

So, Thread 7 updated the fd value as 0, where as Thread 1 is expecting fd > 0.
This is happening because we are having a separate syntask for glusterd_restart_bricks. We can see glusterd_restart_bricks at Thread 7 bt.

Solution for this can be, acquiring the locks before writing in a critical section. Need to explore more on the solution.

--- Additional comment from Sanju on 2018-09-11 08:51:02 IST ---

Link to the regression build in which core generated is:
https://build.gluster.org/job/regression-test-burn-in/4085/

--- Additional comment from Worker Ant on 2018-09-11 14:40:22 IST ---

REVIEW: https://review.gluster.org/21150 (glusterd: acquire write lock to update volinfo structure) posted (#1) for review on master by Sanju Rakonde

--- Additional comment from Worker Ant on 2018-09-18 09:39:37 IST ---

COMMIT: https://review.gluster.org/21150 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: acquire lock to update volinfo structure

Problem: With commit cb0339f92, we are using a separate syntask
for restart_bricks. There can be a situation where two threads
are accessing the same volinfo structure at the same time and
updating volinfo structure. This can lead volinfo to have
inconsistent values and assertion failures because of unexpected
values.

Solution: While updating the volinfo structure, acquire a
store_volinfo_lock, and release the lock only when the thread
completed its critical section part.

Fixes: bz#1627610
Signed-off-by: Sanju Rakonde <srakonde>

Change-Id: I545e4e2368e3285d8f7aa28081ff4448abb72f5d

Comment 9 errata-xmlrpc 2018-12-17 17:07:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3827


Note You need to log in before you can comment on or make changes to this bug.