1453049 – [DHt] : segfault in dht_selfheal_dir_setattr while running regressions

Bug 1453049 - [DHt] : segfault in dht_selfheal_dir_setattr while running regressions

Summary: [DHt] : segfault in dht_selfheal_dir_setattr while running regressions

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Nithya Balachandran
QA Contact:	Prasad Desala
Docs Contact:
URL:
Whiteboard:
Depends On:	1452102 1453050 1453056
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-05-22 03:27 UTC by Nithya Balachandran
Modified:	2017-09-21 04:43 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.8.4-26
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1452102
Environment:
Last Closed:	2017-09-21 04:43:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Nithya Balachandran 2017-05-22 03:27:17 UTC

+++ This bug was initially created as a clone of Bug #1452102 +++

Description of problem:

Observed crashes in regression tests for two different patches executing 2 different tests.

1 - https://build.gluster.org/job/centos6-regression/4604/consoleFull
2 - https://build.gluster.org/job/centos6-regression/4670/consoleFull

Both are crashing in same function - 

[New Thread 25997]
07:32:45 [Thread debugging using libthread_db enabled]
07:32:45 Core was generated by `/build/install/sbin/glusterfs -s 127.1.1.1 --volfile-id rebalance/patchy --xlat'.
07:32:45 Program terminated with signal 11, Segmentation fault.
07:32:45 #0  0x00007fe59a3553d4 in dht_selfheal_dir_setattr (frame=0x7fe580004c40, loc=0x7fe580001028, stbuf=0x7fe5800010b8, valid=16777215, layout=0x7fe590020220) at /home/jenkins/root/workspace/centos6-regression/xlators/cluster/dht/src/dht-selfheal.c:1180
07:32:45 1180	                        gf_msg_trace (this->name, 0,
07:32:45 
07:32:45 Thread 12 (Thread 0x7fe599570700 (LWP 25997)):
07:32:45 #0  0x00007fe5a6dd01c3 in epoll_wait () from /lib64/libc.so.6
07:32:45 No symbol table info available.
07:32:45 #1  0x00007fe5a820050e in event_dispatch_epoll_worker (data=0x7fe594039970) at /home/jenkins/root/workspace/centos6-regression/libglusterfs/src/event-epoll.c:638
07:32:45         event = {events = 1, data = {ptr = 0x700000005, fd = 5, u32 = 5, u64 = 30064771077}}
07:32:45         ret = 0
07:32:45         ev_data = 0x7fe594039970
07:32:45         event_pool = 0x241efc0
07:32:45         myindex = 2
07:32:45         timetodie = 0
07:32:45         __FUNCTION__ = "event_dispatch_epoll_worker"
07:32:45 #2  0x00007fe5a7467aa1 in start_thread () from /lib64/libpthread.so.0
07:32:45 No symbol table info available.
07:32:45 #3  0x00007fe5a6dcfbcd in clone () from /lib64/libc.so.6


Version-Release number of selected component (if applicable):


How reproducible:

Frequently seen during regression.

Steps to Reproduce:
1.
2.
3.

Actual results:

 crash was seen in tests

Expected results:

There should not be any crash

Additional info:

--- Additional comment from Susant Kumar Palai on 2017-05-18 07:09:16 EDT ---

Assigning this to Nithya, as she is working on this.

--- Additional comment from Nithya Balachandran on 2017-05-19 05:49:30 EDT ---

RCA:

(gdb) bt
#0  0x00007fd6bcfbdfb8 in dht_selfheal_dir_setattr (frame=0x7fd6a8000e50, loc=0x7fd6a8000f68, stbuf=0x7fd6a8000ff8, valid=-1, layout=0x7fd6b8003500)
    at /home/jenkins/root/workspace/centos6-regression/xlators/cluster/dht/src/dht-selfheal.c:1180
#1  0x00007fd6bcfbf994 in dht_selfheal_dir_mkdir (frame=0x7fd6a8000e50, loc=0x7fd6a8000f68, layout=0x7fd6b8003500, force=0)
    at /home/jenkins/root/workspace/centos6-regression/xlators/cluster/dht/src/dht-selfheal.c:1512
#2  0x00007fd6bcfc162a in dht_selfheal_directory (frame=0x7fd6a8000e50, dir_cbk=0x7fd6bcfd1b3c <dht_lookup_selfheal_cbk>, loc=0x7fd6a8000f68, layout=0x7fd6b8003500)
    at /home/jenkins/root/workspace/centos6-regression/xlators/cluster/dht/src/dht-selfheal.c:2167
#3  0x00007fd6bcfd4d98 in dht_lookup_dir_cbk (frame=0x7fd6a8000e50, cookie=0x7fd6b800a860, this=0x7fd6b800cb50, op_ret=0, op_errno=22, inode=0x7fd6a8000c50, 
    stbuf=0x7fd6bdeba840, xattr=0x7fd6b801c700, postparent=0x7fd6bdeba7d0) at /home/jenkins/root/workspace/centos6-regression/xlators/cluster/dht/src/dht-common.c:928
#4  0x00007fd6bd2868ec in client3_3_lookup_cbk (req=0x7fd6b8001cf0, iov=0x7fd6b8001d30, count=1, myframe=0x7fd6b8003870)
    at /home/jenkins/root/workspace/centos6-regression/xlators/protocol/client/src/client-rpc-fops.c:2867
#5  0x00007fd6ca90984d in rpc_clnt_handle_reply (clnt=0x7fd6b8021ee0, pollin=0x7fd6b8001ac0)
    at /home/jenkins/root/workspace/centos6-regression/rpc/rpc-lib/src/rpc-clnt.c:778
#6  0x00007fd6ca909e17 in rpc_clnt_notify (trans=0x7fd6b80220b0, mydata=0x7fd6b8021f10, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7fd6b8001ac0)
    at /home/jenkins/root/workspace/centos6-regression/rpc/rpc-lib/src/rpc-clnt.c:971
#7  0x00007fd6ca905dac in rpc_transport_notify (this=0x7fd6b80220b0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7fd6b8001ac0)
    at /home/jenkins/root/workspace/centos6-regression/rpc/rpc-lib/src/rpc-transport.c:538
#8  0x00007fd6bf6eb58a in socket_event_poll_in (this=0x7fd6b80220b0, notify_handled=_gf_true)
    at /home/jenkins/root/workspace/centos6-regression/rpc/rpc-transport/socket/src/socket.c:2315
#9  0x00007fd6bf6ebbd5 in socket_event_handler (fd=12, idx=6, gen=1, data=0x7fd6b80220b0, poll_in=1, poll_out=0, poll_err=0)
    at /home/jenkins/root/workspace/centos6-regression/rpc/rpc-transport/socket/src/socket.c:2467
#10 0x00007fd6cabb4d9e in event_dispatch_epoll_handler (event_pool=0x119ffc0, event=0x7fd6bdebae70)
    at /home/jenkins/root/workspace/centos6-regression/libglusterfs/src/event-epoll.c:572
#11 0x00007fd6cabb50a0 in event_dispatch_epoll_worker (data=0x11e70b0) at /home/jenkins/root/workspace/centos6-regression/libglusterfs/src/event-epoll.c:648
#12 0x00007fd6c9e1caa1 in start_thread () from ./lib64/libpt




(gdb) f 0
#0  0x00007fd6bcfbdfb8 in dht_selfheal_dir_setattr (frame=0x7fd6a8000e50, loc=0x7fd6a8000f68, stbuf=0x7fd6a8000ff8, valid=-1, layout=0x7fd6b8003500)
    at /home/jenkins/root/workspace/centos6-regression/xlators/cluster/dht/src/dht-selfheal.c:1180
1180	                        gf_msg_trace (this->name, 0,
(gdb) l
1175	                gf_uuid_copy (loc->gfid, local->gfid);
1176	
1177	        local->call_cnt = missing_attr;
1178	        for (i = 0; i < layout->cnt; i++) {
1179	                if (layout->list[i].err == -1) {
1180	                        gf_msg_trace (this->name, 0,
1181	                                      "%s: setattr on subvol %s, gfid = %s",
1182	                                      loc->path, layout->list[i].xlator->name,
1183	                                      uuid_utoa(loc->gfid));
1184	
1185	                        STACK_WIND (frame, dht_selfheal_dir_setattr_cbk,
1186	                                    layout->list[i].xlator,
1187	                                    layout->list[i].xlator->fops->setattr,
1188	                                    loc, stbuf, valid, NULL);




(gdb) p i
$2 = 60
(gdb) p layout->list[i].xlator->name
Cannot access memory at address 0x0
(gdb) p layout
$3 = (dht_layout_t *) 0x7fd6b8003500
(gdb) p *layout
$4 = {spread_cnt = -1379869184, cnt = 8378078, preset = 25198592, commit_hash = 8378040, gen = 41216, type = 8377856, ref = 20480, search_unhashed = _gf_false, 
  list = 0x7fd6b8003520}

(gdb) p *frame
$5 = {root = 0x100000000, parent = 0x100000001, frames = {next = 0x7fd6a8000ea8, prev = 0x7fd6a8000eb0}, local = 0x0, this = 0x0, ret = 0x0, ref_count = 0, lock = {
    spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x7fd6a8000eb0, __next = 0x0}}, 
      __size = '\000' <repeats 24 times>, "\260\016\000\250\326\177\000\000\000\000\000\000\000\000\000", __align = 0}}, cookie = 0x0, complete = _gf_false, 
  op = GF_FOP_NULL, begin = {tv_sec = 140559918309968, tv_usec = 140559918311376}, end = {tv_sec = 1, tv_usec = 0}, wind_from = 0x0, wind_to = 0x0, unwind_from = 0x0, 
  unwind_to = 0x0}



It looks like both frame and layout have been freed and layout is still being accessed. I am unable to reproduce the problem on my local setup/

However, accessing layout->cnt in the for loop is not a good idea so replacing it with a local variable which I think should fix this problem.

--- Additional comment from Worker Ant on 2017-05-19 05:55:05 EDT ---

REVIEW: https://review.gluster.org/17343 (cluster/dht: Fix crash in dht_selfheal_dir_setattr) posted (#1) for review on master by N Balachandran (nbalacha)

--- Additional comment from Worker Ant on 2017-05-19 06:41:51 EDT ---

REVIEW: https://review.gluster.org/17343 (cluster/dht: Fix crash in dht_selfheal_dir_setattr) posted (#2) for review on master by N Balachandran (nbalacha)

--- Additional comment from Worker Ant on 2017-05-19 10:48:18 EDT ---

REVIEW: https://review.gluster.org/17343 (cluster/dht: Fix crash in dht_selfheal_dir_setattr) posted (#3) for review on master by N Balachandran (nbalacha)

--- Additional comment from Worker Ant on 2017-05-19 15:44:56 EDT ---

COMMIT: https://review.gluster.org/17343 committed in master by Shyamsundar Ranganathan (srangana) 
------
commit 17784aaa311494e4538c616f02bf95477ae781bc
Author: N Balachandran <nbalacha>
Date:   Fri May 19 15:22:12 2017 +0530

    cluster/dht: Fix crash in dht_selfheal_dir_setattr
    
    Use a local variable to store the call cnt used in the
    for loop for the STACK_WIND so as not to access local
    which may be freed by STACK_UNWIND after all fops return.
    
    Change-Id: I24f49b6dbd29a2b706e388e2f6d5196c0f80afc5
    BUG: 1452102
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: https://review.gluster.org/17343
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 6 Prasad Desala 2017-07-29 17:33:52 UTC

On glusterfs version 3.8.4-35.el7rhgs.x86_64, ran tests
"./tests/bugs/glusterd/bug-1245045-remove-brick-validation.t" and "./tests/basic/distribute/rebal-all-nodes-migrate.t" in a loop of 100 times and
could not reproduce the issue.
Moving this BZ to (conditionally)Verified considering,
- There is no consistent way to reproduce this issue.
- The same test Passed without any crash.

Comment 8 errata-xmlrpc 2017-09-21 04:43:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.