Description of problem: When a bunch of gluster-block devices were deleted, gluster-blockd crashed and a core file has been generated. Following operation was done. oc delete pvc/mongodb-{21..50} This in turn calls heketi API and starts deleting the block devices. After deleting from mongodb-{21..28}, gluster-block service has crashed. (gdb) bt #0 0x00007ff84a9051f7 in raise () from /lib64/libc.so.6 #1 0x00007ff84a9068e8 in abort () from /lib64/libc.so.6 #2 0x00007ff84a944f47 in __libc_message () from /lib64/libc.so.6 #3 0x00007ff84a94c619 in _int_free () from /lib64/libc.so.6 #4 0x00007ff84a9fc40a in xdr_string () from /lib64/libc.so.6 #5 0x00000000004047c1 in xdr_blockResponse (xdrs=0x7ff81c03fb48, objp=0x628a60 <clnt_res.5117>) at block_xdr.c:251 #6 0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829885cdf, out=out@entry=0x7ff83c003240) at block_svc_routines.c:429 #7 0x000000000040b6ea in glusterBlockDeleteRemote (data=0x7ff83c003220) at block_svc_routines.c:652 #8 0x00007ff84bc09e25 in start_thread () from /lib64/libpthread.so.0 #9 0x00007ff84a9c834d in clone () from /lib64/libc.so.6 Version-Release number of selected component (if applicable): sh-4.2# rpm -qa | grep 'block' gluster-block-0.2.1-12.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1. delete pvc from ocp, which in turn deletes gluster-block devices via heketi
(gdb) fr 5 #5 0x00000000004047c1 in xdr_blockResponse (xdrs=0x7ff81c03fb48, objp=0x628a60 <clnt_res.5117>) at block_xdr.c:251 251 if (!xdr_string (xdrs, &objp->out, ~0)) (gdb) p objp->out $1 = 0x0 (gdb) fr 6 #6 0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829885cdf, out=out@entry=0x7ff83c003240) at block_svc_routines.c:429 429 if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char *)reply)) { (gdb) p reply $2 = (blockResponse *) 0x628a60 <clnt_res.5117> (gdb) p *reply $3 = {exit = 0, out = 0x0, offset = 0, xdata = {xdata_len = 0, xdata_val = 0x0}} reply has 'out' as NULL, which is leading to the crash.
(In reply to Pranith Kumar K from comment #2) > (gdb) fr 5 > #5 0x00000000004047c1 in xdr_blockResponse (xdrs=0x7ff81c03fb48, > objp=0x628a60 <clnt_res.5117>) at block_xdr.c:251 > 251 if (!xdr_string (xdrs, &objp->out, ~0)) > (gdb) p objp->out > $1 = 0x0 > (gdb) fr 6 > #6 0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 > "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, > rpc_sent=rpc_sent@entry=0x7ff829885cdf, > out=out@entry=0x7ff83c003240) at block_svc_routines.c:429 > 429 if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char > *)reply)) { > (gdb) p reply > $2 = (blockResponse *) 0x628a60 <clnt_res.5117> > (gdb) p *reply > $3 = {exit = 0, out = 0x0, offset = 0, xdata = {xdata_len = 0, xdata_val = > 0x0}} > > reply has 'out' as NULL, which is leading to the crash. krk/Prasanna, I went through the code, As per the logs in the machine where the rpc was sent above i.e. 10.70.46.197, logs indicate that delete request was successful: [2017-09-18 06:58:56.940633] INFO: delete request, blockname=blockvol_1d99e82e2ff 4e31d74cc581e2e18bd13d filename=648fe67a-8151-4304-88fc-1722ed0f135d [at block_ss vc_routines.c+2392 :<block_delete_1_svc>] [2017-09-18 06:58:59.539747] INFO: command exit code, 0 [at block_svc_routines.cc +2435 :<block_delete_1_svc>] Where as in the core the memory is zeroed out. I have one theory, I think sending the requests in parallel is leading to a problem where the static memory is over-written by different thread while other thread is still using it as the generated code doesn't have any synchronization: Example: blockResponse * block_delete_1(blockDelete *argp, CLIENT *clnt) { static blockResponse clnt_res; <<<<<<-------- Same memory is used by everyone memset((char *)&clnt_res, 0, sizeof(clnt_res)); <<<<<---- Here memset is happening if (clnt_call (clnt, BLOCK_DELETE, (xdrproc_t) xdr_blockDelete, (caddr_t) argp, (xdrproc_t) xdr_blockResponse, (caddr_t) &clnt_res, TIMEOUT) != RPC_SUCCESS) { return (NULL); } return (&clnt_res); <<<<<---- ptr to this memory is returned. } So while Thread-1 is returned "return (&clnt_res);" another thread could be doing "memset((char *)&clnt_res, 0, sizeof(clnt_res));" This seem to be a day-1 gluster-blockd bug from the looks of it. Here is the stack trace which confirms the theory: (gdb) t 1 [Switching to thread 1 (Thread 0x7ff829886700 (LWP 4431))] #0 0x00007ff84a9051f7 in raise () from /lib64/libc.so.6 (gdb) fr 6 #6 0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829885cdf, out=out@entry=0x7ff83c003240) at block_svc_routines.c:429 429 if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char *)reply)) { (gdb) p reply ******************* $1 = (blockResponse *) 0x628a60 <clnt_res.5117> <<<----- response which the rpc returned ******************* (gdb) t 5 [Switching to thread 5 (Thread 0x7ff829085700 (LWP 4432))] #0 0x00007ff84a9bda3d in poll () from /lib64/libc.so.6 (gdb) bt #0 0x00007ff84a9bda3d in poll () from /lib64/libc.so.6 #1 0x00007ff84a9f61c9 in readtcp () from /lib64/libc.so.6 #2 0x00007ff84a9f04b1 in xdrrec_getbytes () from /lib64/libc.so.6 #3 0x00007ff84a9f06c1 in xdrrec_getlong () from /lib64/libc.so.6 #4 0x00007ff84a9fb959 in xdr_u_long () from /lib64/libc.so.6 #5 0x00007ff84a9ef251 in xdr_replymsg () from /lib64/libc.so.6 #6 0x00007ff84a9f5f8f in clnttcp_call () from /lib64/libc.so.6 #7 0x000000000041c819 in block_delete_1 (argp=argp@entry=0x7ff829084cf0, clnt=clnt@entry=0x7ff82000a570) at block_clnt.c:33 #8 0x0000000000409518 in glusterBlockCallRPC_1 (host=0x7ff83c1e1280 "10.70.46.203", cobj=cobj@entry=0x7ff829084cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829084cdf, out=out@entry=0x7ff83c003270) at block_svc_routines.c:398 #9 0x000000000040b6ea in glusterBlockDeleteRemote (data=0x7ff83c003250) at block_svc_routines.c:652 #10 0x00007ff84bc09e25 in start_thread () from /lib64/libpthread.so.0 #11 0x00007ff84a9c834d in clone () from /lib64/libc.so.6 (gdb) l block_delete_1 24 return (&clnt_res); 25 } 26 27 blockResponse * 28 block_delete_1(blockDelete *argp, CLIENT *clnt) 29 { 30 static blockResponse clnt_res; 31 32 memset((char *)&clnt_res, 0, sizeof(clnt_res)); 33 if (clnt_call (clnt, BLOCK_DELETE, (gdb) fr 7 #7 0x000000000041c819 in block_delete_1 (argp=argp@entry=0x7ff829084cf0, clnt=clnt@entry=0x7ff82000a570) at block_clnt.c:33 33 if (clnt_call (clnt, BLOCK_DELETE, ******************* (gdb) p &clnt_res $2 = (blockResponse *) 0x628a60 <clnt_res.5117> <<<---- it is in use by thread-5 as well, which would have memset at line 32 above and is waiting for the response. *******************
Here is the patch: https://review.gluster.org/#/c/18314/
Have done loop deletion multiple times on the build gluster-block-0.2.1-13, and tcmu-runner-1.2.0-15. Also basic sanity is done on the said build from rhgs gluster-block side. Changing the qa contact for the final verification.
Changing the qa contact back as there is already a CNS tracker for this bug. This issue has not been hit on RHGS gluster-block side, attempt to reproduce this also went in vain. Hence, would wait for CNS tracker to get closed, to confidently move the present bug to its final state.
Moving this to verified as the dependent bug is validated in CNS environment.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2773