Bug 1603063 - ./tests/bugs/glusterd/validating-server-quorum.t is generated core
Summary: ./tests/bugs/glusterd/validating-server-quorum.t is generated core
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-19 05:10 UTC by Mohit Agrawal
Modified: 2018-10-23 15:14 UTC (History)
2 users (show)

Fixed In Version: glusterfs-5.0
Clone Of:
Environment:
Last Closed: 2018-10-23 15:14:52 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Mohit Agrawal 2018-07-19 05:10:49 UTC
Description of problem:
./tests/bugs/glusterd/validating-server-quorum.t

Version-Release number of selected component (if applicable):


How reproducible:
./tests/bugs/glusterd/validating-server-quorum.t
For more please refer
https://build.gluster.org/job/centos7-regression/1804/console
Steps to Reproduce:
1.
2.
3.

Actual results:
./tests/bugs/glusterd/validating-server-quorum.t is crashed

Expected results:
Test case ./tests/bugs/glusterd/validating-server-quorum.t should not crash

Additional info:

Comment 1 Atin Mukherjee 2018-07-23 03:10:29 UTC
I had a look at https://build.gluster.org/job/regression-test-burn-in/4044/consoleFull which was a recent report of the same crash. Apparently during replace brick, glusterd process crashed due to a null dst_brickinfo while it was trying to resolve this brick through glusterd_resolve_brick ().

(gdb) 
#0  0x00007f523fcf4277 in raise () from ./lib64/libc.so.6
#1  0x00007f523fcf5968 in abort () from ./lib64/libc.so.6
#2  0x00007f523fced096 in __assert_fail_base () from ./lib64/libc.so.6
#3  0x00007f523fced142 in __assert_fail () from ./lib64/libc.so.6
#4  0x00007f523610bb3e in glusterd_resolve_brick (brickinfo=0x0)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-utils.c:1134
#5  0x00007f5236190c79 in glusterd_op_replace_brick (dict=0x7f5228000e78, rsp_dict=0x7f5228015a28)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-replace-brick.c:521
#6  0x00007f52361e58cb in gd_mgmt_v3_commit_fn (op=GD_OP_REPLACE_BRICK, dict=0x7f5228000e78, 
    op_errstr=0x7f5224239338, op_errno=0x7f522423932c, rsp_dict=0x7f5228015a28)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt.c:310
#7  0x00007f52361e30c9 in glusterd_handle_commit_fn (req=0x7f52240041e8)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:609
#8  0x00007f52360d923b in glusterd_big_locked_handler (req=0x7f52240041e8, 
    actor_fn=0x7f52361e2dee <glusterd_handle_commit_fn>)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-handler.c:80
#9  0x00007f52361e4178 in glusterd_handle_commit (req=0x7f52240041e8)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:993
#10 0x00007f52416f2ee4 in synctask_wrap ()
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:375
#11 0x00007f523fd06030 in ?? () from ./lib64/libc.so.6
#12 0x0000000000000000 in ?? ()
(gdb) 
#0  0x00007f523fcf4277 in raise () from ./lib64/libc.so.6
#1  0x00007f523fcf5968 in abort () from ./lib64/libc.so.6
#2  0x00007f523fced096 in __assert_fail_base () from ./lib64/libc.so.6
#3  0x00007f523fced142 in __assert_fail () from ./lib64/libc.so.6
#4  0x00007f523610bb3e in glusterd_resolve_brick (brickinfo=0x0)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-utils.c:1134
#5  0x00007f5236190c79 in glusterd_op_replace_brick (dict=0x7f5228000e78, rsp_dict=0x7f5228015a28)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-replace-brick.c:521
#6  0x00007f52361e58cb in gd_mgmt_v3_commit_fn (op=GD_OP_REPLACE_BRICK, dict=0x7f5228000e78, 
    op_errstr=0x7f5224239338, op_errno=0x7f522423932c, rsp_dict=0x7f5228015a28)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt.c:310
#7  0x00007f52361e30c9 in glusterd_handle_commit_fn (req=0x7f52240041e8)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:609
#8  0x00007f52360d923b in glusterd_big_locked_handler (req=0x7f52240041e8, 
    actor_fn=0x7f52361e2dee <glusterd_handle_commit_fn>)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-handler.c:80
#9  0x00007f52361e4178 in glusterd_handle_commit (req=0x7f52240041e8)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:993
#10 0x00007f52416f2ee4 in synctask_wrap ()
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:375
#11 0x00007f523fd06030 in ?? () from ./lib64/libc.so.6
#12 0x0000000000000000 in ?? ()
(gdb) 
(gdb) bt
#0  0x00007f523fcf4277 in raise () from ./lib64/libc.so.6
#1  0x00007f523fcf5968 in abort () from ./lib64/libc.so.6
#2  0x00007f523fced096 in __assert_fail_base () from ./lib64/libc.so.6
#3  0x00007f523fced142 in __assert_fail () from ./lib64/libc.so.6
#4  0x00007f523610bb3e in glusterd_resolve_brick (brickinfo=0x0)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-utils.c:1134
#5  0x00007f5236190c79 in glusterd_op_replace_brick (dict=0x7f5228000e78, rsp_dict=0x7f5228015a28)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-replace-brick.c:521
#6  0x00007f52361e58cb in gd_mgmt_v3_commit_fn (op=GD_OP_REPLACE_BRICK, dict=0x7f5228000e78, 
    op_errstr=0x7f5224239338, op_errno=0x7f522423932c, rsp_dict=0x7f5228015a28)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt.c:310
#7  0x00007f52361e30c9 in glusterd_handle_commit_fn (req=0x7f52240041e8)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:609
#8  0x00007f52360d923b in glusterd_big_locked_handler (req=0x7f52240041e8, 
    actor_fn=0x7f52361e2dee <glusterd_handle_commit_fn>)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-handler.c:80
#9  0x00007f52361e4178 in glusterd_handle_commit (req=0x7f52240041e8)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:993
#10 0x00007f52416f2ee4 in synctask_wrap ()
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:375
#11 0x00007f523fd06030 in ?? () from ./lib64/libc.so.6
#12 0x0000000000000000 in ?? ()
(gdb) p src_brickinfo
No symbol "src_brickinfo" in current context.
(gdb) p dst_brickinfo
No symbol "dst_brickinfo" in current context.
(gdb) f 5
#5  0x00007f5236190c79 in glusterd_op_replace_brick (dict=0x7f5228000e78, rsp_dict=0x7f5228015a28)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-replace-brick.c:521
521	/home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-replace-brick.c: No such file or directory.
(gdb) p src_brickinfo
$1 = (glusterd_brickinfo_t *) 0x7f521c02ae70
(gdb) p *src_brickinfo
$2 = {hostname = "127.1.1.2", '\000' <repeats 1014 times>, 
  path = "/d/backends/2/patchy2", '\000' <repeats 4074 times>, 
  real_path = "/d/backends/2/patchy2", '\000' <repeats 4074 times>, 
  device_path = '\000' <repeats 4095 times>, 
  mount_dir = "/backends/2/patchy2", '\000' <repeats 4076 times>, 
  brick_id = "patchy-client-1", '\000' <repeats 1008 times>, fstype = '\000' <repeats 254 times>, 
  mnt_opts = '\000' <repeats 1023 times>, brick_list = {next = 0x7f521c035970, prev = 0x7f521c029d70}, 
  uuid = "\311\333d\361\323\313Fm\244c\177\321\030Ld\t", port = 0, rdma_port = 0, logfile = 0x0, 
  shandle = 0x7f521c03d6a0, status = GF_BRICK_STOPPED, rpc = 0x0, decommissioned = 0, 
  vg = '\000' <repeats 4095 times>, caps = 0, snap_status = 0, group = 0, 
  jbr_uuid = '\000' <repeats 15 times>, statfs_fsid = 0, fs_share_count = 0, port_registered = false, 
  start_triggered = false, restart_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, 
      __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\000' <repeats 39 times>, __align = 0}}
(gdb) p dst_brickinfo
$3 = (glusterd_brickinfo_t *) 0x0

The only possibility of dst_brickinfo getting overwritten here is by checking if there's a friend import happening during this time through a separate synctask which is the case here from thread 9.

Thread 9 (LWP 23474):
#0  0x00007f523fd8356d in nanosleep () from ./lib64/libc.so.6
#1  0x00007f523fd83404 in sleep () from ./lib64/libc.so.6
#2  0x00007f523620113e in glusterd_proc_stop (proc=0x7f5241a360d8, sig=15, flags=4)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-proc-mgmt.c:114
#3  0x00007f52362023f0 in glusterd_svc_stop (svc=0x7f5241a340c0, sig=15)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-svc-mgmt.c:239
#4  0x00007f52362034c8 in glusterd_shdsvc_manager (svc=0x7f5241a340c0, data=0x0, flags=2)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-shd-svc.c:126
#5  0x00007f5236205da3 in glusterd_svcs_manager (volinfo=0x0)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-svc-helper.c:126
#6  0x00007f5236118b7a in glusterd_import_friend_volumes_synctask (opaque=0x7f5224003548)
    at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-utils.c:4817
#7  0x00007f52416f2ee4 in synctask_wrap ()
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:375
#8  0x00007f523fd06030 in ?? () from ./lib64/libc.so.6
#9  0x0000000000000000 in ?? ()

This seems to be a race however I'm still not sure why are we frequently hitting this now.

Comment 2 Worker Ant 2018-07-28 05:46:08 UTC
REVIEW: https://review.gluster.org/20584 (glusterd: block operations when volume importing is in progress) posted (#1) for review on master by Atin Mukherjee

Comment 3 Worker Ant 2018-08-10 09:06:50 UTC
REVIEW: https://review.gluster.org/20693 (glusterd: compare friend data within mutex) posted (#4) for review on master by Atin Mukherjee

Comment 4 Worker Ant 2018-08-10 09:07:49 UTC
REVIEW: https://review.gluster.org/20658 (tests: fix brick check ordering in validating-server-quorum.t) posted (#2) for review on master by Atin Mukherjee

Comment 5 Worker Ant 2018-08-13 03:02:12 UTC
COMMIT: https://review.gluster.org/20693 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: compare friend data within mutex

During friend handshake if the glusterd receives more than one friend
updates, it might very well become possible that two threads would end
up working on two different volinfo references and glusterd might end up
updating the store with a old volinfo reference. While debugging
glusterd crash from validating-server-quorum.t test file from the
line-coverage regression the same was observed.

Solution is to run glusterd_compare_friend_data under a mutex.

Test:

As the crash was more visible in the line-coverage run (given lcov does
some instrumentation and exposes the races), 6 manual lcov runs were
triggered starting from https://build.gluster.org/job/line-coverage/443
to https://build.gluster.org/job/line-coverage/449/ and no crash was
observed from validating-server-quorum.t

Change-Id: I86fce473a76fd24742d51bf17a685d28b90a8941
Fixes: bz#1603063
Signed-off-by: Atin Mukherjee <amukherj>

Comment 6 Worker Ant 2018-08-13 13:44:21 UTC
COMMIT: https://review.gluster.org/20658 committed in master by "Shyamsundar Ranganathan" <srangana> with a commit message- tests: fix brick check orders

fix brick checks for validating-server-quorum.t & quorum-validation.t
...and make brick_up_status_1 function more generic.

Also fix a timing issue in
bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t

Change-Id: I797ef4cec5b160aafa979bae7151b1e99fcb48ac
Updates: bz#1603063
Signed-off-by: Atin Mukherjee <amukherj>

Comment 7 Shyamsundar 2018-10-23 15:14:52 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.