Bug 1609724

Summary: brick (glusterfsd) crashed at in quota_lookup
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: quotaAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: apaladug, nchilaka, ndevos, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, sunnikri, vdas
Target Milestone: ---Keywords: Reopened
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-16 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:51:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503137    
Attachments:
Description Flags
bt and taabt
none
node2_ brick_crash_possibly_on_node_reboot
none
[WIP] quota: prevent segfault in quota_lookup() none

Description Nag Pavan Chilakam 2018-07-30 09:07:34 UTC
Description of problem:
========================
brick crashed over some non-functional testing cycle


Version-Release number of selected component (if applicable):
--------------
3.12.2-14

How reproducible:
=================
hit it once

Steps to Reproduce:
1.created  a 6 node setup
2.create about 54 arbiter volumes in a daisy chain fashion to distribute arbiter bricks across all nodes
3. pumping IOs on first 8 volumes from 8 different clients (yara)
4. cluster.max-bricks-per-process  set to 40
5. had another 100 volume created(cercei)
6. had a script to create another set of new volumes(54 ollena), and start them and delete these set of volumes, after some time--this continues for about 200 times
7. rebooted n2

left system over weekend

Actual results:
=================
brick crashed on n3


bt is very huge, hence attaching the bt

Comment 2 Nag Pavan Chilakam 2018-07-30 09:09:56 UTC
Created attachment 1471465 [details]
bt and taabt

Comment 3 Nag Pavan Chilakam 2018-07-30 09:25:33 UTC
Created attachment 1471480 [details]
node2_ brick_crash_possibly_on_node_reboot

Comment 5 Niels de Vos 2018-07-30 09:44:21 UTC
uw_init_context_t is part of libunwind. This is not the actual cause of the crash.

I think the segfault from comment #2 got triggered in quota_lookup.

#12034 gf_print_trace (signum=11, ctx=0x557e4d5c2010) at common-utils.c:762
#12035 <signal handler called>
#12036 0x00007f54b290fb61 in quota_lookup (frame=0x7f542813e640, this=0x7f5482205c80, loc=0x7f53037fd880, xattr_req=0x0) at quota.c:1663
#12037 0x00007f54b26e1e53 in io_stats_lookup (frame=frame@entry=0x7f542813e530, this=this@entry=0x7f54820085e0, loc=loc@entry=0x7f53037fd880, xdata=xdata@entry=0x0) at io-stats.c:2723
#12038 0x00007f54c72342cd in default_lookup (frame=0x7f542813e530, this=<optimized out>, loc=0x7f53037fd880, xdata=0x0) at defaults.c:2572


This is similar to the one in the trace from comment #3:

#12035 <signal handler called>
---Type <return> to continue, or q <return> to quit---
#12036 0x00007f78ac7b1b61 in quota_lookup (frame=0x7f7878140530, this=0x7f750b8edcc0, loc=0x7f77f77fd880, 
    xattr_req=0x0) at quota.c:1663


Reassigning to the quota component so that those maintainers can have a look.

Comment 6 Niels de Vos 2018-07-30 09:58:24 UTC
Created attachment 1471485 [details]
[WIP] quota: prevent segfault in quota_lookup()

Comment 8 Niels de Vos 2018-07-30 10:05:57 UTC
Hi Sanoj,

please have a look at the attached patch, this might fix the problem. It would need further corrections, but I think the main issue should be addressed with this.

Nag, how reliable can you reproduce this?

Comment 9 Nag Pavan Chilakam 2018-07-30 10:38:27 UTC
(In reply to Niels de Vos from comment #8)
> Hi Sanoj,
> 
> please have a look at the attached patch, this might fix the problem. It
> would need further corrections, but I think the main issue should be
> addressed with this.
> 
> Nag, how reliable can you reproduce this?


Not sure as this was a test run over 2 days, but what alarms me here is that this crash was seen on 3 different nodes, on the same setup, at different times.

Comment 10 Sanoj Unnikrishnan 2018-07-30 11:10:34 UTC
From Initial look, seems like priv struct of quota was null. However this is supposed to be allocated and non null even if quota was not enabled.. 
Continuing.

Comment 11 Nag Pavan Chilakam 2018-07-30 12:27:39 UTC
(In reply to Niels de Vos from comment #5)
> uw_init_context_t is part of libunwind. This is not the actual cause of the
> crash.
> 
> I think the segfault from comment #2 got triggered in quota_lookup.
> 
> #12034 gf_print_trace (signum=11, ctx=0x557e4d5c2010) at common-utils.c:762
> #12035 <signal handler called>
> #12036 0x00007f54b290fb61 in quota_lookup (frame=0x7f542813e640,
> this=0x7f5482205c80, loc=0x7f53037fd880, xattr_req=0x0) at quota.c:1663
> #12037 0x00007f54b26e1e53 in io_stats_lookup
> (frame=frame@entry=0x7f542813e530, this=this@entry=0x7f54820085e0,
> loc=loc@entry=0x7f53037fd880, xdata=xdata@entry=0x0) at io-stats.c:2723
> #12038 0x00007f54c72342cd in default_lookup (frame=0x7f542813e530,
> this=<optimized out>, loc=0x7f53037fd880, xdata=0x0) at defaults.c:2572
> 
> 
> This is similar to the one in the trace from comment #3:
> 
> #12035 <signal handler called>
> ---Type <return> to continue, or q <return> to quit---
> #12036 0x00007f78ac7b1b61 in quota_lookup (frame=0x7f7878140530,
> this=0x7f750b8edcc0, loc=0x7f77f77fd880, 
>     xattr_req=0x0) at quota.c:1663
> 
> 
> Reassigning to the quota component so that those maintainers can have a look.

Thanks Neils, Changing the title accordingly, feel free to correct it , if required, on further analysis

Comment 12 Sanoj Unnikrishnan 2018-08-01 10:34:36 UTC
from the bt,

#12028 _gf_msg (domain=domain@entry=0x7f78bd1c8682 "", file=file@entry=0x7f78bd1cc14f "mem-pool.c", 
#12029 0x00007f78bd14f7e6 in gf_mem_set_acct_info (xl=xl@entry=0x7f750b8edcc0, 
#12030 0x00007f78bd14f9b9 in __gf_malloc (size=size@entry=33, type=type@entry=39, 
#12031 0x00007f78bd14fcb3 in gf_vasprintf (string_ptr=0x7f77f77fbfc0, format=0x7f78bd1c91d2 "package-string: %s", 
#12032 0x00007f78bd14fda4 in gf_asprintf (string_ptr=string_ptr@entry=0x7f77f77fbfc0, 
#12033 0x00007f78bd131bd8 in gf_dump_config_flags () at common-utils.c:699
#12034 gf_print_trace (signum=11, ctx=0x559d67c3d010) at common-utils.c:762
#12035 <signal handler called>
#12036 0x00007f78ac7b1b61 in quota_lookup (frame=0x7f7878140530, this=0x7f750b8edcc0, loc=0x7f77f77fd880, 
#12037 0x00007f78ac583e53 in io_stats_lookup (frame=frame@entry=0x7f7878140420, this=this@entry=0x7f750b919320, 
#12038 0x00007f78bd1a72cd in default_lookup (frame=0x7f7878140420, this=<optimized out>, loc=0x7f77f77fd880, 
#12039 0x00007f78bd1647a6 in syncop_lookup (subvol=subvol@entry=0x7f750b919ee0, loc=loc@entry=0x7f77f77fd880, 
#12040 0x00007f78ac14c5c6 in server_first_lookup (this=0x7f78a802b8a0, client=client@entry=0x7f7878007040, 
#12041 0x00007f78ac14d231 in server_setvolume (req=0x7f78780017f0) at server-handshake.c:913
#12042 0x00007f78bcee9955 in rpcsvc_handle_rpc_call (svc=0x7f78a803f730, trans=trans@entry=0x7f7878000f70, 
#12043 0x00007f78bcee9b3b in rpcsvc_notify (trans=0x7f7878000f70, mydata=<optimized out>, event=<optimized out>, 
#12044 0x00007f78bceeba73 in rpc_transport_notify (this=this@entry=0x7f7878000f70, 
#12045 0x00007f78b1d11566 in socket_event_poll_in (this=this@entry=0x7f7878000f70, notify_handled=<optimized out>)
#12046 0x00007f78b1d13b0c in socket_event_handler (fd=336, idx=59, gen=3640, data=0x7f7878000f70, poll_in=1, 
#12047 0x00007f78bd1854c4 in event_dispatch_epoll_handler (event=0x7f77f77fde80, event_pool=0x559d67c74210)
#12048 event_dispatch_epoll_worker (data=0x7f787f2f31d0) at event-epoll.c:659
#12049 0x00007f78bbf86dd5 in start_thread (arg=0x7f77f77fe700) at pthread_create.c:308
#12050 0x00007f78bb84fb3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

(gdb) p *this
$3 = {name = 0x7f75c9d90480 "11-ollena-10-quota", type = 0x7f750b95cd70 "features/quota", instance_name = 0x0, next = 0x7f750b8ed100, prev = 0x7f750b919320, parents = 0x7f750b8fce60, 
  children = 0x7f750b96f040, options = 0x0, dlhandle = 0x7f78a8027880, fops = 0x7f78ac9d10c0 <fops>, cbks = 0x7f78ac9d1060 <cbks>, dumpops = 0x7f78ac9d1000 <dumpops>, volume_options = {
    next = 0x7f750b8edd20, prev = 0x7f750b8edd20}, fini = 0x7f78ac7ca280 <fini>, init = 0x7f78ac7c9c60 <init>, reconfigure = 0x7f78ac7ca060 <reconfigure>, 
  mem_acct_init = 0x7f78ac7c9be0 <mem_acct_init>, notify = 0x7f78bd1c7090 <default_notify>, loglevel = GF_LOG_NONE, client_latency = 0, latencies = {{min = 0, max = 0, total = 0, std = 0, 
      mean = 0, count = 0} <repeats 55 times>}, history = 0x0, ctx = 0x559d67c3d010, graph = 0x7f75c9db5290, itable = 0x0, init_succeeded = 1 '\001', private = 0x0, mem_acct = 0x0, 
  winds = 0, switched = 0 '\000', local_pool = 0x0, is_autoloaded = _gf_false, volfile_id = 0x0, xl_id = 17, cleanup_starting = 1, call_cleanup = 1}
(gdb) 

we see this-private = 0,
we initialize the private even when quota is disabled..
It is initialized in init() and free'd in fini()
Notice that cleanup_starting =1, cleanup =1  ==>  indicates that fini() was in progress.
So, It looks like there was an in-flight quota_lookup while the fini() was being invoked on quota translator.
fini() went before lookup and free'd the private causing segfault when lookup accesses it.

I tried recreating this race with gdb as well as with code level instrumentation, however am not seeing any crash.



The backtrace here is very long, the reason for that is within the logging function trigerred on GF_ASSERT, we again assert using GF_ASSERT.
We need to guard GF_ASSERT against such recursive assert.

Comment 13 Sanoj Unnikrishnan 2018-08-07 06:50:36 UTC
from the trace above we see that it was called in server_first_lookup () path..
and during clean up as cleanup_starting = 1. 

The patch https://review.gluster.org/#/c/20427/4 addresses this issue,
by avoiding server_first_lookup if cleanup has started.

find the relevant code snippet below,

              if (client->bound_xl->cleanup_starting) {
                        op_ret = -1;
                        op_errno = EAGAIN;
                        ret = dict_set_str (reply, "ERROR",
                                            "cleanup flag is set for xlator "
                                            "before call first_lookup Try again later");
                } else {
                        op_ret = server_first_lookup (this, client, reply);
                        if (op_ret == -1)
                                op_errno = ENOENT;
                }
        }
 
hence marking this bug as duplicate of 1597627

*** This bug has been marked as a duplicate of bug 1597627 ***

Comment 20 errata-xmlrpc 2018-09-04 06:51:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607