Bug 1492556
Summary: | [CNS env] - gluster-blockd crashed when a bunch of block devices were deleted | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | krishnaram Karthick <kramdoss> | |
Component: | gluster-block | Assignee: | Prasanna Kumar Kalever <prasanna.kalever> | |
Status: | CLOSED ERRATA | QA Contact: | Sweta Anandpara <sanandpa> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.3 | CC: | amukherj, pkarampu, rcyriac, rhs-bugs, sanandpa, storage-qa-internal | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.3.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | gluster-block-0.2.1-13.el7rhgs | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1492599 (view as bug list) | Environment: | ||
Last Closed: | 2017-09-21 04:22:29 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1417151, 1492599 |
Description
krishnaram Karthick
2017-09-18 07:19:14 UTC
(gdb) fr 5 #5 0x00000000004047c1 in xdr_blockResponse (xdrs=0x7ff81c03fb48, objp=0x628a60 <clnt_res.5117>) at block_xdr.c:251 251 if (!xdr_string (xdrs, &objp->out, ~0)) (gdb) p objp->out $1 = 0x0 (gdb) fr 6 #6 0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829885cdf, out=out@entry=0x7ff83c003240) at block_svc_routines.c:429 429 if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char *)reply)) { (gdb) p reply $2 = (blockResponse *) 0x628a60 <clnt_res.5117> (gdb) p *reply $3 = {exit = 0, out = 0x0, offset = 0, xdata = {xdata_len = 0, xdata_val = 0x0}} reply has 'out' as NULL, which is leading to the crash. (In reply to Pranith Kumar K from comment #2) > (gdb) fr 5 > #5 0x00000000004047c1 in xdr_blockResponse (xdrs=0x7ff81c03fb48, > objp=0x628a60 <clnt_res.5117>) at block_xdr.c:251 > 251 if (!xdr_string (xdrs, &objp->out, ~0)) > (gdb) p objp->out > $1 = 0x0 > (gdb) fr 6 > #6 0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 > "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, > rpc_sent=rpc_sent@entry=0x7ff829885cdf, > out=out@entry=0x7ff83c003240) at block_svc_routines.c:429 > 429 if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char > *)reply)) { > (gdb) p reply > $2 = (blockResponse *) 0x628a60 <clnt_res.5117> > (gdb) p *reply > $3 = {exit = 0, out = 0x0, offset = 0, xdata = {xdata_len = 0, xdata_val = > 0x0}} > > reply has 'out' as NULL, which is leading to the crash. krk/Prasanna, I went through the code, As per the logs in the machine where the rpc was sent above i.e. 10.70.46.197, logs indicate that delete request was successful: [2017-09-18 06:58:56.940633] INFO: delete request, blockname=blockvol_1d99e82e2ff 4e31d74cc581e2e18bd13d filename=648fe67a-8151-4304-88fc-1722ed0f135d [at block_ss vc_routines.c+2392 :<block_delete_1_svc>] [2017-09-18 06:58:59.539747] INFO: command exit code, 0 [at block_svc_routines.cc +2435 :<block_delete_1_svc>] Where as in the core the memory is zeroed out. I have one theory, I think sending the requests in parallel is leading to a problem where the static memory is over-written by different thread while other thread is still using it as the generated code doesn't have any synchronization: Example: blockResponse * block_delete_1(blockDelete *argp, CLIENT *clnt) { static blockResponse clnt_res; <<<<<<-------- Same memory is used by everyone memset((char *)&clnt_res, 0, sizeof(clnt_res)); <<<<<---- Here memset is happening if (clnt_call (clnt, BLOCK_DELETE, (xdrproc_t) xdr_blockDelete, (caddr_t) argp, (xdrproc_t) xdr_blockResponse, (caddr_t) &clnt_res, TIMEOUT) != RPC_SUCCESS) { return (NULL); } return (&clnt_res); <<<<<---- ptr to this memory is returned. } So while Thread-1 is returned "return (&clnt_res);" another thread could be doing "memset((char *)&clnt_res, 0, sizeof(clnt_res));" This seem to be a day-1 gluster-blockd bug from the looks of it. Here is the stack trace which confirms the theory: (gdb) t 1 [Switching to thread 1 (Thread 0x7ff829886700 (LWP 4431))] #0 0x00007ff84a9051f7 in raise () from /lib64/libc.so.6 (gdb) fr 6 #6 0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829885cdf, out=out@entry=0x7ff83c003240) at block_svc_routines.c:429 429 if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char *)reply)) { (gdb) p reply ******************* $1 = (blockResponse *) 0x628a60 <clnt_res.5117> <<<----- response which the rpc returned ******************* (gdb) t 5 [Switching to thread 5 (Thread 0x7ff829085700 (LWP 4432))] #0 0x00007ff84a9bda3d in poll () from /lib64/libc.so.6 (gdb) bt #0 0x00007ff84a9bda3d in poll () from /lib64/libc.so.6 #1 0x00007ff84a9f61c9 in readtcp () from /lib64/libc.so.6 #2 0x00007ff84a9f04b1 in xdrrec_getbytes () from /lib64/libc.so.6 #3 0x00007ff84a9f06c1 in xdrrec_getlong () from /lib64/libc.so.6 #4 0x00007ff84a9fb959 in xdr_u_long () from /lib64/libc.so.6 #5 0x00007ff84a9ef251 in xdr_replymsg () from /lib64/libc.so.6 #6 0x00007ff84a9f5f8f in clnttcp_call () from /lib64/libc.so.6 #7 0x000000000041c819 in block_delete_1 (argp=argp@entry=0x7ff829084cf0, clnt=clnt@entry=0x7ff82000a570) at block_clnt.c:33 #8 0x0000000000409518 in glusterBlockCallRPC_1 (host=0x7ff83c1e1280 "10.70.46.203", cobj=cobj@entry=0x7ff829084cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829084cdf, out=out@entry=0x7ff83c003270) at block_svc_routines.c:398 #9 0x000000000040b6ea in glusterBlockDeleteRemote (data=0x7ff83c003250) at block_svc_routines.c:652 #10 0x00007ff84bc09e25 in start_thread () from /lib64/libpthread.so.0 #11 0x00007ff84a9c834d in clone () from /lib64/libc.so.6 (gdb) l block_delete_1 24 return (&clnt_res); 25 } 26 27 blockResponse * 28 block_delete_1(blockDelete *argp, CLIENT *clnt) 29 { 30 static blockResponse clnt_res; 31 32 memset((char *)&clnt_res, 0, sizeof(clnt_res)); 33 if (clnt_call (clnt, BLOCK_DELETE, (gdb) fr 7 #7 0x000000000041c819 in block_delete_1 (argp=argp@entry=0x7ff829084cf0, clnt=clnt@entry=0x7ff82000a570) at block_clnt.c:33 33 if (clnt_call (clnt, BLOCK_DELETE, ******************* (gdb) p &clnt_res $2 = (blockResponse *) 0x628a60 <clnt_res.5117> <<<---- it is in use by thread-5 as well, which would have memset at line 32 above and is waiting for the response. ******************* Here is the patch: https://review.gluster.org/#/c/18314/ Have done loop deletion multiple times on the build gluster-block-0.2.1-13, and tcmu-runner-1.2.0-15. Also basic sanity is done on the said build from rhgs gluster-block side. Changing the qa contact for the final verification. Changing the qa contact back as there is already a CNS tracker for this bug. This issue has not been hit on RHGS gluster-block side, attempt to reproduce this also went in vain. Hence, would wait for CNS tracker to get closed, to confidently move the present bug to its final state. Moving this to verified as the dependent bug is validated in CNS environment. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2773 |