Description of problem: ./tests/bugs/glusterd/validating-server-quorum.t Version-Release number of selected component (if applicable): How reproducible: ./tests/bugs/glusterd/validating-server-quorum.t For more please refer https://build.gluster.org/job/centos7-regression/1804/console Steps to Reproduce: 1. 2. 3. Actual results: ./tests/bugs/glusterd/validating-server-quorum.t is crashed Expected results: Test case ./tests/bugs/glusterd/validating-server-quorum.t should not crash Additional info:
I had a look at https://build.gluster.org/job/regression-test-burn-in/4044/consoleFull which was a recent report of the same crash. Apparently during replace brick, glusterd process crashed due to a null dst_brickinfo while it was trying to resolve this brick through glusterd_resolve_brick (). (gdb) #0 0x00007f523fcf4277 in raise () from ./lib64/libc.so.6 #1 0x00007f523fcf5968 in abort () from ./lib64/libc.so.6 #2 0x00007f523fced096 in __assert_fail_base () from ./lib64/libc.so.6 #3 0x00007f523fced142 in __assert_fail () from ./lib64/libc.so.6 #4 0x00007f523610bb3e in glusterd_resolve_brick (brickinfo=0x0) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-utils.c:1134 #5 0x00007f5236190c79 in glusterd_op_replace_brick (dict=0x7f5228000e78, rsp_dict=0x7f5228015a28) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-replace-brick.c:521 #6 0x00007f52361e58cb in gd_mgmt_v3_commit_fn (op=GD_OP_REPLACE_BRICK, dict=0x7f5228000e78, op_errstr=0x7f5224239338, op_errno=0x7f522423932c, rsp_dict=0x7f5228015a28) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt.c:310 #7 0x00007f52361e30c9 in glusterd_handle_commit_fn (req=0x7f52240041e8) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:609 #8 0x00007f52360d923b in glusterd_big_locked_handler (req=0x7f52240041e8, actor_fn=0x7f52361e2dee <glusterd_handle_commit_fn>) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-handler.c:80 #9 0x00007f52361e4178 in glusterd_handle_commit (req=0x7f52240041e8) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:993 #10 0x00007f52416f2ee4 in synctask_wrap () at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:375 #11 0x00007f523fd06030 in ?? () from ./lib64/libc.so.6 #12 0x0000000000000000 in ?? () (gdb) #0 0x00007f523fcf4277 in raise () from ./lib64/libc.so.6 #1 0x00007f523fcf5968 in abort () from ./lib64/libc.so.6 #2 0x00007f523fced096 in __assert_fail_base () from ./lib64/libc.so.6 #3 0x00007f523fced142 in __assert_fail () from ./lib64/libc.so.6 #4 0x00007f523610bb3e in glusterd_resolve_brick (brickinfo=0x0) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-utils.c:1134 #5 0x00007f5236190c79 in glusterd_op_replace_brick (dict=0x7f5228000e78, rsp_dict=0x7f5228015a28) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-replace-brick.c:521 #6 0x00007f52361e58cb in gd_mgmt_v3_commit_fn (op=GD_OP_REPLACE_BRICK, dict=0x7f5228000e78, op_errstr=0x7f5224239338, op_errno=0x7f522423932c, rsp_dict=0x7f5228015a28) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt.c:310 #7 0x00007f52361e30c9 in glusterd_handle_commit_fn (req=0x7f52240041e8) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:609 #8 0x00007f52360d923b in glusterd_big_locked_handler (req=0x7f52240041e8, actor_fn=0x7f52361e2dee <glusterd_handle_commit_fn>) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-handler.c:80 #9 0x00007f52361e4178 in glusterd_handle_commit (req=0x7f52240041e8) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:993 #10 0x00007f52416f2ee4 in synctask_wrap () at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:375 #11 0x00007f523fd06030 in ?? () from ./lib64/libc.so.6 #12 0x0000000000000000 in ?? () (gdb) (gdb) bt #0 0x00007f523fcf4277 in raise () from ./lib64/libc.so.6 #1 0x00007f523fcf5968 in abort () from ./lib64/libc.so.6 #2 0x00007f523fced096 in __assert_fail_base () from ./lib64/libc.so.6 #3 0x00007f523fced142 in __assert_fail () from ./lib64/libc.so.6 #4 0x00007f523610bb3e in glusterd_resolve_brick (brickinfo=0x0) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-utils.c:1134 #5 0x00007f5236190c79 in glusterd_op_replace_brick (dict=0x7f5228000e78, rsp_dict=0x7f5228015a28) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-replace-brick.c:521 #6 0x00007f52361e58cb in gd_mgmt_v3_commit_fn (op=GD_OP_REPLACE_BRICK, dict=0x7f5228000e78, op_errstr=0x7f5224239338, op_errno=0x7f522423932c, rsp_dict=0x7f5228015a28) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt.c:310 #7 0x00007f52361e30c9 in glusterd_handle_commit_fn (req=0x7f52240041e8) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:609 #8 0x00007f52360d923b in glusterd_big_locked_handler (req=0x7f52240041e8, actor_fn=0x7f52361e2dee <glusterd_handle_commit_fn>) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-handler.c:80 #9 0x00007f52361e4178 in glusterd_handle_commit (req=0x7f52240041e8) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-mgmt-handler.c:993 #10 0x00007f52416f2ee4 in synctask_wrap () at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:375 #11 0x00007f523fd06030 in ?? () from ./lib64/libc.so.6 #12 0x0000000000000000 in ?? () (gdb) p src_brickinfo No symbol "src_brickinfo" in current context. (gdb) p dst_brickinfo No symbol "dst_brickinfo" in current context. (gdb) f 5 #5 0x00007f5236190c79 in glusterd_op_replace_brick (dict=0x7f5228000e78, rsp_dict=0x7f5228015a28) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-replace-brick.c:521 521 /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-replace-brick.c: No such file or directory. (gdb) p src_brickinfo $1 = (glusterd_brickinfo_t *) 0x7f521c02ae70 (gdb) p *src_brickinfo $2 = {hostname = "127.1.1.2", '\000' <repeats 1014 times>, path = "/d/backends/2/patchy2", '\000' <repeats 4074 times>, real_path = "/d/backends/2/patchy2", '\000' <repeats 4074 times>, device_path = '\000' <repeats 4095 times>, mount_dir = "/backends/2/patchy2", '\000' <repeats 4076 times>, brick_id = "patchy-client-1", '\000' <repeats 1008 times>, fstype = '\000' <repeats 254 times>, mnt_opts = '\000' <repeats 1023 times>, brick_list = {next = 0x7f521c035970, prev = 0x7f521c029d70}, uuid = "\311\333d\361\323\313Fm\244c\177\321\030Ld\t", port = 0, rdma_port = 0, logfile = 0x0, shandle = 0x7f521c03d6a0, status = GF_BRICK_STOPPED, rpc = 0x0, decommissioned = 0, vg = '\000' <repeats 4095 times>, caps = 0, snap_status = 0, group = 0, jbr_uuid = '\000' <repeats 15 times>, statfs_fsid = 0, fs_share_count = 0, port_registered = false, start_triggered = false, restart_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}} (gdb) p dst_brickinfo $3 = (glusterd_brickinfo_t *) 0x0 The only possibility of dst_brickinfo getting overwritten here is by checking if there's a friend import happening during this time through a separate synctask which is the case here from thread 9. Thread 9 (LWP 23474): #0 0x00007f523fd8356d in nanosleep () from ./lib64/libc.so.6 #1 0x00007f523fd83404 in sleep () from ./lib64/libc.so.6 #2 0x00007f523620113e in glusterd_proc_stop (proc=0x7f5241a360d8, sig=15, flags=4) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-proc-mgmt.c:114 #3 0x00007f52362023f0 in glusterd_svc_stop (svc=0x7f5241a340c0, sig=15) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-svc-mgmt.c:239 #4 0x00007f52362034c8 in glusterd_shdsvc_manager (svc=0x7f5241a340c0, data=0x0, flags=2) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-shd-svc.c:126 #5 0x00007f5236205da3 in glusterd_svcs_manager (volinfo=0x0) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-svc-helper.c:126 #6 0x00007f5236118b7a in glusterd_import_friend_volumes_synctask (opaque=0x7f5224003548) at /home/jenkins/root/workspace/regression-test-burn-in/xlators/mgmt/glusterd/src/glusterd-utils.c:4817 #7 0x00007f52416f2ee4 in synctask_wrap () at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/syncop.c:375 #8 0x00007f523fd06030 in ?? () from ./lib64/libc.so.6 #9 0x0000000000000000 in ?? () This seems to be a race however I'm still not sure why are we frequently hitting this now.
REVIEW: https://review.gluster.org/20584 (glusterd: block operations when volume importing is in progress) posted (#1) for review on master by Atin Mukherjee
REVIEW: https://review.gluster.org/20693 (glusterd: compare friend data within mutex) posted (#4) for review on master by Atin Mukherjee
REVIEW: https://review.gluster.org/20658 (tests: fix brick check ordering in validating-server-quorum.t) posted (#2) for review on master by Atin Mukherjee
COMMIT: https://review.gluster.org/20693 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: compare friend data within mutex During friend handshake if the glusterd receives more than one friend updates, it might very well become possible that two threads would end up working on two different volinfo references and glusterd might end up updating the store with a old volinfo reference. While debugging glusterd crash from validating-server-quorum.t test file from the line-coverage regression the same was observed. Solution is to run glusterd_compare_friend_data under a mutex. Test: As the crash was more visible in the line-coverage run (given lcov does some instrumentation and exposes the races), 6 manual lcov runs were triggered starting from https://build.gluster.org/job/line-coverage/443 to https://build.gluster.org/job/line-coverage/449/ and no crash was observed from validating-server-quorum.t Change-Id: I86fce473a76fd24742d51bf17a685d28b90a8941 Fixes: bz#1603063 Signed-off-by: Atin Mukherjee <amukherj>
COMMIT: https://review.gluster.org/20658 committed in master by "Shyamsundar Ranganathan" <srangana> with a commit message- tests: fix brick check orders fix brick checks for validating-server-quorum.t & quorum-validation.t ...and make brick_up_status_1 function more generic. Also fix a timing issue in bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t Change-Id: I797ef4cec5b160aafa979bae7151b1e99fcb48ac Updates: bz#1603063 Signed-off-by: Atin Mukherjee <amukherj>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/