+++ This bug was initially created as a clone of Bug #1722131 +++ Description of problem: During In-service upgrade, glusterd on upgraded node crashed with a backtrace, when 'gluster vol status' command is issued from non-upgraded nodes. Upgrade scenario is from glusterfs-5 or lower to glusterfs-6 Version-Release number of selected component (if applicable): glusterfs 5 to glusterfs 6 upgrade How reproducible: 3/3 Steps to Reproduce: 1. On three nodes cluster(N1, N2, N3), Create 2020 volumes of replicate(1X3) and started them (Brick-mux enabled) 2. Mounted 3 volumes and running continuous IO from 3 different clients. 3. Upgraded node N1. 4. While heal is going on node N1, Ran 'gluster volume status' on node N2 which is yet to upgrade. Actual results: glusterd crashed with a backtrace no backtrace seen. [2019-06-19 11:13:56.506826] I [MSGID: 106499] [glusterd-handler.c:4497:__glusterd_handle_status_volume] 0-management: Received status volume req for volume testvol_-997 [2019-06-19 11:13:56.512662] I [MSGID: 106499] [glusterd-handler.c:4497:__glusterd_handle_status_volume] 0-management: Received status volume req for volume testvol_-998 [2019-06-19 11:13:56.518409] I [MSGID: 106499] [glusterd-handler.c:4497:__glusterd_handle_status_volume] 0-management: Received status volume req for volume testvol_-999 [2019-06-19 11:14:37.732442] E [MSGID: 101005] [dict.c:2852:dict_serialized_length_lk] 0-dict: value->len (-1162167622) < 0 [Invalid argument] [2019-06-19 11:14:37.732483] E [MSGID: 106130] [glusterd-handler.c:2633:glusterd_op_commit_send_resp] 0-management: failed to get serialized length of dict pending frames: frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2019-06-19 11:14:37 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 6.0 /lib64/libglusterfs.so.0(+0x27240)[0x7f7b5c38a240] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f7b5c394c64] /lib64/libc.so.6(+0x363f0)[0x7f7b5a9c63f0] /lib64/libpthread.so.0(pthread_mutex_lock+0x0)[0x7f7b5b1cad00] /lib64/libglusterfs.so.0(__gf_free+0x12c)[0x7f7b5c3b64cc] /lib64/libglusterfs.so.0(+0x1b889)[0x7f7b5c37e889] /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x478f8)[0x7f7b504c58f8] /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x44514)[0x7f7b504c2514] /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x1d19e)[0x7f7b5049b19e] /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x24dce)[0x7f7b504a2dce] /lib64/libglusterfs.so.0(+0x66610)[0x7f7b5c3c9610] /lib64/libc.so.6(+0x48180)[0x7f7b5a9d8180] --------- Expected results: glusterd should not crash
REVIEW: https://review.gluster.org/22939 (glusterd: conditionally clear txn_opinfo in stage op) posted (#1) for review on master by Atin Mukherjee
Root cause : So on a heterogeneous cluster mode, the above patch would end up clearing the txn_opinfo during the staging phase, but since the originator node is still running older versions it will initiate a commit op which would result in commit op accessing a freed up txn opinfo.
REVIEW: https://review.gluster.org/22939 (glusterd: conditionally clear txn_opinfo in stage op) merged (#2) on master by Atin Mukherjee