1492556 – [CNS env] - gluster-blockd crashed when a bunch of block devices were deleted

Bug 1492556 - [CNS env] - gluster-blockd crashed when a bunch of block devices were deleted

Summary: [CNS env] - gluster-blockd crashed when a bunch of block devices were deleted

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-block
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Prasanna Kumar Kalever
QA Contact:	Sweta Anandpara
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1417151 1492599
TreeView+	depends on / blocked

Reported:	2017-09-18 07:19 UTC by krishnaram Karthick
Modified:	2017-09-21 04:22 UTC (History)
CC List:	6 users (show)
Fixed In Version:	gluster-block-0.2.1-13.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1492599 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:22:29 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:2773	0	normal	SHIPPED_LIVE	new packages: gluster-block	2017-09-21 08:16:22 UTC

Description krishnaram Karthick 2017-09-18 07:19:14 UTC

Description of problem:
When a bunch of gluster-block devices were deleted, gluster-blockd crashed and a core file has been generated.

Following operation was done.

oc delete pvc/mongodb-{21..50}

This in turn calls heketi API and starts deleting the block devices. After deleting from mongodb-{21..28}, gluster-block service has crashed.

(gdb) bt
#0  0x00007ff84a9051f7 in raise () from /lib64/libc.so.6
#1  0x00007ff84a9068e8 in abort () from /lib64/libc.so.6
#2  0x00007ff84a944f47 in __libc_message () from /lib64/libc.so.6
#3  0x00007ff84a94c619 in _int_free () from /lib64/libc.so.6
#4  0x00007ff84a9fc40a in xdr_string () from /lib64/libc.so.6
#5  0x00000000004047c1 in xdr_blockResponse (xdrs=0x7ff81c03fb48, objp=0x628a60 <clnt_res.5117>) at block_xdr.c:251
#6  0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829885cdf, 
    out=out@entry=0x7ff83c003240) at block_svc_routines.c:429
#7  0x000000000040b6ea in glusterBlockDeleteRemote (data=0x7ff83c003220) at block_svc_routines.c:652
#8  0x00007ff84bc09e25 in start_thread () from /lib64/libpthread.so.0
#9  0x00007ff84a9c834d in clone () from /lib64/libc.so.6


Version-Release number of selected component (if applicable):
sh-4.2# rpm -qa | grep 'block'
gluster-block-0.2.1-12.el7rhgs.x86_64


How reproducible:
1/1

Steps to Reproduce:
1. delete pvc from ocp, which in turn deletes gluster-block devices via heketi

Comment 2 Pranith Kumar K 2017-09-18 07:36:01 UTC

(gdb) fr 5
#5  0x00000000004047c1 in xdr_blockResponse (xdrs=0x7ff81c03fb48, objp=0x628a60 <clnt_res.5117>) at block_xdr.c:251
251              if (!xdr_string (xdrs, &objp->out, ~0))
(gdb) p objp->out
$1 = 0x0
(gdb) fr 6
#6  0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829885cdf, 
    out=out@entry=0x7ff83c003240) at block_svc_routines.c:429
429         if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char *)reply)) {
(gdb) p reply 
$2 = (blockResponse *) 0x628a60 <clnt_res.5117>
(gdb) p *reply
$3 = {exit = 0, out = 0x0, offset = 0, xdata = {xdata_len = 0, xdata_val = 0x0}}

reply has 'out' as NULL, which is leading to the crash.

Comment 3 Pranith Kumar K 2017-09-18 08:05:06 UTC

(In reply to Pranith Kumar K from comment #2)
> (gdb) fr 5
> #5  0x00000000004047c1 in xdr_blockResponse (xdrs=0x7ff81c03fb48,
> objp=0x628a60 <clnt_res.5117>) at block_xdr.c:251
> 251              if (!xdr_string (xdrs, &objp->out, ~0))
> (gdb) p objp->out
> $1 = 0x0
> (gdb) fr 6
> #6  0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150
> "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV,
> rpc_sent=rpc_sent@entry=0x7ff829885cdf, 
>     out=out@entry=0x7ff83c003240) at block_svc_routines.c:429
> 429         if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char
> *)reply)) {
> (gdb) p reply 
> $2 = (blockResponse *) 0x628a60 <clnt_res.5117>
> (gdb) p *reply
> $3 = {exit = 0, out = 0x0, offset = 0, xdata = {xdata_len = 0, xdata_val =
> 0x0}}
> 
> reply has 'out' as NULL, which is leading to the crash.

krk/Prasanna,
     I went through the code, As per the logs in the machine where the rpc was sent above i.e. 10.70.46.197, logs indicate that delete request was successful:
[2017-09-18 06:58:56.940633] INFO: delete request, blockname=blockvol_1d99e82e2ff
4e31d74cc581e2e18bd13d filename=648fe67a-8151-4304-88fc-1722ed0f135d [at block_ss
vc_routines.c+2392 :<block_delete_1_svc>]
[2017-09-18 06:58:59.539747] INFO: command exit code, 0 [at block_svc_routines.cc
+2435 :<block_delete_1_svc>]

Where as in the core the memory is zeroed out. 

I have one theory, I think sending the requests in parallel is leading to a problem where the static memory is over-written by different thread while other thread is still using it as the generated code doesn't have any synchronization:

Example:
blockResponse *
block_delete_1(blockDelete *argp, CLIENT *clnt)
{
        static blockResponse clnt_res; <<<<<<-------- Same memory is used by everyone

        memset((char *)&clnt_res, 0, sizeof(clnt_res)); <<<<<---- Here memset is happening
        if (clnt_call (clnt, BLOCK_DELETE,
                (xdrproc_t) xdr_blockDelete, (caddr_t) argp,
                (xdrproc_t) xdr_blockResponse, (caddr_t) &clnt_res,
                TIMEOUT) != RPC_SUCCESS) {
                return (NULL);
        }
        return (&clnt_res); <<<<<---- ptr to this memory is returned.
}

So while Thread-1 is returned "return (&clnt_res);" another thread could be doing "memset((char *)&clnt_res, 0, sizeof(clnt_res));"

This seem to be a day-1 gluster-blockd bug from the looks of it.

Here is the stack trace which confirms the theory:
(gdb) t 1
[Switching to thread 1 (Thread 0x7ff829886700 (LWP 4431))]
#0  0x00007ff84a9051f7 in raise () from /lib64/libc.so.6
(gdb) fr 6
#6  0x000000000040941c in glusterBlockCallRPC_1 (host=0x7ff83c1e1150 "10.70.46.197", cobj=cobj@entry=0x7ff829885cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829885cdf, 
    out=out@entry=0x7ff83c003240) at block_svc_routines.c:429
429         if (!clnt_freeres(clnt, (xdrproc_t)xdr_blockResponse, (char *)reply)) {
(gdb) p reply

*******************
$1 = (blockResponse *) 0x628a60 <clnt_res.5117> <<<----- response which the rpc returned
*******************

(gdb) t 5
[Switching to thread 5 (Thread 0x7ff829085700 (LWP 4432))]
#0  0x00007ff84a9bda3d in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ff84a9bda3d in poll () from /lib64/libc.so.6
#1  0x00007ff84a9f61c9 in readtcp () from /lib64/libc.so.6
#2  0x00007ff84a9f04b1 in xdrrec_getbytes () from /lib64/libc.so.6
#3  0x00007ff84a9f06c1 in xdrrec_getlong () from /lib64/libc.so.6
#4  0x00007ff84a9fb959 in xdr_u_long () from /lib64/libc.so.6
#5  0x00007ff84a9ef251 in xdr_replymsg () from /lib64/libc.so.6
#6  0x00007ff84a9f5f8f in clnttcp_call () from /lib64/libc.so.6
#7  0x000000000041c819 in block_delete_1 (argp=argp@entry=0x7ff829084cf0, clnt=clnt@entry=0x7ff82000a570) at block_clnt.c:33
#8  0x0000000000409518 in glusterBlockCallRPC_1 (host=0x7ff83c1e1280 "10.70.46.203", cobj=cobj@entry=0x7ff829084cf0, opt=opt@entry=DELETE_SRV, rpc_sent=rpc_sent@entry=0x7ff829084cdf, 
    out=out@entry=0x7ff83c003270) at block_svc_routines.c:398
#9  0x000000000040b6ea in glusterBlockDeleteRemote (data=0x7ff83c003250) at block_svc_routines.c:652
#10 0x00007ff84bc09e25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ff84a9c834d in clone () from /lib64/libc.so.6
(gdb) l block_delete_1
24              return (&clnt_res);
25      }
26
27      blockResponse *
28      block_delete_1(blockDelete *argp, CLIENT *clnt)
29      {
30              static blockResponse clnt_res;
31
32              memset((char *)&clnt_res, 0, sizeof(clnt_res));
33              if (clnt_call (clnt, BLOCK_DELETE,
(gdb) fr 7
#7  0x000000000041c819 in block_delete_1 (argp=argp@entry=0x7ff829084cf0, clnt=clnt@entry=0x7ff82000a570) at block_clnt.c:33
33              if (clnt_call (clnt, BLOCK_DELETE,
*******************
(gdb) p &clnt_res
$2 = (blockResponse *) 0x628a60 <clnt_res.5117> <<<---- it is in use by thread-5 as well, which would have memset at line 32 above and is waiting for the response.
*******************

Comment 6 Prasanna Kumar Kalever 2017-09-18 15:51:34 UTC

Here is the patch:
https://review.gluster.org/#/c/18314/

Comment 9 Sweta Anandpara 2017-09-19 12:14:28 UTC

Have done loop deletion multiple times on the build gluster-block-0.2.1-13, and tcmu-runner-1.2.0-15. Also basic sanity is done on the said build from rhgs gluster-block side.

Changing the qa contact for the final verification.

Comment 10 Sweta Anandpara 2017-09-20 06:17:11 UTC

Changing the qa contact back as there is already a CNS tracker for this bug.

This issue has not been hit on RHGS gluster-block side, attempt to reproduce this also went in vain.

Hence, would wait for CNS tracker to get closed, to confidently move the present bug to its final state.

Comment 11 Sweta Anandpara 2017-09-20 07:39:03 UTC

Moving this to verified as the dependent bug is validated in CNS environment.

Comment 13 errata-xmlrpc 2017-09-21 04:22:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2773

Note You need to log in before you can comment on or make changes to this bug.