Bug 1492556

Summary:	[CNS env] - gluster-blockd crashed when a bunch of block devices were deleted
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	krishnaram Karthick <kramdoss>
Component:	gluster-block	Assignee:	Prasanna Kumar Kalever <prasanna.kalever>
Status:	CLOSED ERRATA	QA Contact:	Sweta Anandpara <sanandpa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	amukherj, pkarampu, rcyriac, rhs-bugs, sanandpa, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	gluster-block-0.2.1-13.el7rhgs	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1492599 (view as bug list)		Environment:
Last Closed:	2017-09-21 04:22:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1417151, 1492599

Description krishnaram Karthick 2017-09-18 07:19:14 UTC

Description of problem:
When a bunch of gluster-block devices were deleted, gluster-blockd crashed and a core file has been generated.

Following operation was done.

oc delete pvc/mongodb-{21..50}

This in turn calls heketi API and starts deleting the block devices. After deleting from mongodb-{21..28}, gluster-block service has crashed.

(gdb) bt
#0  0x00007ff84a9051f7 in raise () from /lib64/libc.so.6
#1  0x00007ff84a9068e8 in abort () from /lib64/libc.so.6
#2  0x00007ff84a944f47 in __libc_message () from /lib64/libc.so.6
#3  0x00007ff84a94c619 in _int_free () from /lib64/libc.so.6
#4  0x00007ff84a9fc40a in xdr_string () from /lib64/libc.so.6
#5  0x00000000004047c1 in xdr_blockResponse (xdrs=0x7ff81c03fb48, objp=0x628a60 <clnt_res.5117>) at block_xdr.c:251
#6  0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829885cdf, 
    out=out@entry=0x7ff83c003240) at block_svc_routines.c:429
#7  0x000000000040b6ea in glusterBlockDeleteRemote (data=0x7ff83c003220) at block_svc_routines.c:652
#8  0x00007ff84bc09e25 in start_thread () from /lib64/libpthread.so.0
#9  0x00007ff84a9c834d in clone () from /lib64/libc.so.6


Version-Release number of selected component (if applicable):
sh-4.2# rpm -qa | grep 'block'
gluster-block-0.2.1-12.el7rhgs.x86_64


How reproducible:
1/1

Steps to Reproduce:
1. delete pvc from ocp, which in turn deletes gluster-block devices via heketi

Comment 2 Pranith Kumar K 2017-09-18 07:36:01 UTC

(gdb) fr 5
#5  0x00000000004047c1 in xdr_blockResponse (xdrs=0x7ff81c03fb48, objp=0x628a60 <clnt_res.5117>) at block_xdr.c:251
251              if (!xdr_string (xdrs, &objp->out, ~0))
(gdb) p objp->out
$1 = 0x0
(gdb) fr 6
#6  0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829885cdf, 
    out=out@entry=0x7ff83c003240) at block_svc_routines.c:429
429         if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char *)reply)) {
(gdb) p reply 
$2 = (blockResponse *) 0x628a60 <clnt_res.5117>
(gdb) p *reply
$3 = {exit = 0, out = 0x0, offset = 0, xdata = {xdata_len = 0, xdata_val = 0x0}}

reply has 'out' as NULL, which is leading to the crash.

Comment 3 Pranith Kumar K 2017-09-18 08:05:06 UTC

(In reply to Pranith Kumar K from comment #2)
> (gdb) fr 5
> #5  0x00000000004047c1 in xdr_blockResponse (xdrs=0x7ff81c03fb48,
> objp=0x628a60 <clnt_res.5117>) at block_xdr.c:251
> 251              if (!xdr_string (xdrs, &objp->out, ~0))
> (gdb) p objp->out
> $1 = 0x0
> (gdb) fr 6
> #6  0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150
> "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV,
> rpc_sent=rpc_sent@entry=0x7ff829885cdf, 
>     out=out@entry=0x7ff83c003240) at block_svc_routines.c:429
> 429         if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char
> *)reply)) {
> (gdb) p reply 
> $2 = (blockResponse *) 0x628a60 <clnt_res.5117>
> (gdb) p *reply
> $3 = {exit = 0, out = 0x0, offset = 0, xdata = {xdata_len = 0, xdata_val =
> 0x0}}
> 
> reply has 'out' as NULL, which is leading to the crash.

krk/Prasanna,
     I went through the code, As per the logs in the machine where the rpc was sent above i.e. 10.70.46.197, logs indicate that delete request was successful:
[2017-09-18 06:58:56.940633] INFO: delete request, blockname=blockvol_1d99e82e2ff
4e31d74cc581e2e18bd13d filename=648fe67a-8151-4304-88fc-1722ed0f135d [at block_ss
vc_routines.c+2392 :<block_delete_1_svc>]
[2017-09-18 06:58:59.539747] INFO: command exit code, 0 [at block_svc_routines.cc
+2435 :<block_delete_1_svc>]

Where as in the core the memory is zeroed out. 

I have one theory, I think sending the requests in parallel is leading to a problem where the static memory is over-written by different thread while other thread is still using it as the generated code doesn't have any synchronization:

Example:
blockResponse *
block_delete_1(blockDelete *argp, CLIENT *clnt)
{
        static blockResponse clnt_res; <<<<<<-------- Same memory is used by everyone

        memset((char *)&clnt_res, 0, sizeof(clnt_res)); <<<<<---- Here memset is happening
        if (clnt_call (clnt, BLOCK_DELETE,
                (xdrproc_t) xdr_blockDelete, (caddr_t) argp,
                (xdrproc_t) xdr_blockResponse, (caddr_t) &clnt_res,
                TIMEOUT) != RPC_SUCCESS) {
                return (NULL);
        }
        return (&clnt_res); <<<<<---- ptr to this memory is returned.
}

So while Thread-1 is returned "return (&clnt_res);" another thread could be doing "memset((char *)&clnt_res, 0, sizeof(clnt_res));"

This seem to be a day-1 gluster-blockd bug from the looks of it.

Here is the stack trace which confirms the theory:
(gdb) t 1
[Switching to thread 1 (Thread 0x7ff829886700 (LWP 4431))]
#0  0x00007ff84a9051f7 in raise () from /lib64/libc.so.6
(gdb) fr 6
#6  0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829885cdf, 
    out=out@entry=0x7ff83c003240) at block_svc_routines.c:429
429         if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char *)reply)) {
(gdb) p reply

*******************
$1 = (blockResponse *) 0x628a60 <clnt_res.5117> <<<----- response which the rpc returned
*******************

(gdb) t 5
[Switching to thread 5 (Thread 0x7ff829085700 (LWP 4432))]
#0  0x00007ff84a9bda3d in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ff84a9bda3d in poll () from /lib64/libc.so.6
#1  0x00007ff84a9f61c9 in readtcp () from /lib64/libc.so.6
#2  0x00007ff84a9f04b1 in xdrrec_getbytes () from /lib64/libc.so.6
#3  0x00007ff84a9f06c1 in xdrrec_getlong () from /lib64/libc.so.6
#4  0x00007ff84a9fb959 in xdr_u_long () from /lib64/libc.so.6
#5  0x00007ff84a9ef251 in xdr_replymsg () from /lib64/libc.so.6
#6  0x00007ff84a9f5f8f in clnttcp_call () from /lib64/libc.so.6
#7  0x000000000041c819 in block_delete_1 (argp=argp@entry=0x7ff829084cf0, clnt=clnt@entry=0x7ff82000a570) at block_clnt.c:33
#8  0x0000000000409518 in glusterBlockCallRPC_1 (host=0x7ff83c1e1280 "10.70.46.203", cobj=cobj@entry=0x7ff829084cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829084cdf, 
    out=out@entry=0x7ff83c003270) at block_svc_routines.c:398
#9  0x000000000040b6ea in glusterBlockDeleteRemote (data=0x7ff83c003250) at block_svc_routines.c:652
#10 0x00007ff84bc09e25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ff84a9c834d in clone () from /lib64/libc.so.6
(gdb) l block_delete_1
24              return (&clnt_res);
25      }
26
27      blockResponse *
28      block_delete_1(blockDelete *argp, CLIENT *clnt)
29      {
30              static blockResponse clnt_res;
31
32              memset((char *)&clnt_res, 0, sizeof(clnt_res));
33              if (clnt_call (clnt, BLOCK_DELETE,
(gdb) fr 7
#7  0x000000000041c819 in block_delete_1 (argp=argp@entry=0x7ff829084cf0, clnt=clnt@entry=0x7ff82000a570) at block_clnt.c:33
33              if (clnt_call (clnt, BLOCK_DELETE,
*******************
(gdb) p &clnt_res
$2 = (blockResponse *) 0x628a60 <clnt_res.5117> <<<---- it is in use by thread-5 as well, which would have memset at line 32 above and is waiting for the response.
*******************

Comment 6 Prasanna Kumar Kalever 2017-09-18 15:51:34 UTC

Here is the patch:
https://review.gluster.org/#/c/18314/

Comment 9 Sweta Anandpara 2017-09-19 12:14:28 UTC

Have done loop deletion multiple times on the build gluster-block-0.2.1-13, and tcmu-runner-1.2.0-15. Also basic sanity is done on the said build from rhgs gluster-block side.

Changing the qa contact for the final verification.

Comment 10 Sweta Anandpara 2017-09-20 06:17:11 UTC

Changing the qa contact back as there is already a CNS tracker for this bug.

This issue has not been hit on RHGS gluster-block side, attempt to reproduce this also went in vain.

Hence, would wait for CNS tracker to get closed, to confidently move the present bug to its final state.

Comment 11 Sweta Anandpara 2017-09-20 07:39:03 UTC

Moving this to verified as the dependent bug is validated in CNS environment.

Comment 13 errata-xmlrpc 2017-09-21 04:22:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2773