Bug 1609724
Summary: | brick (glusterfsd) crashed at in quota_lookup | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Nag Pavan Chilakam <nchilaka> | ||||||||
Component: | quota | Assignee: | Mohit Agrawal <moagrawa> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Nag Pavan Chilakam <nchilaka> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | rhgs-3.4 | CC: | apaladug, nchilaka, ndevos, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, sunnikri, vdas | ||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||
Target Release: | RHGS 3.4.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | glusterfs-3.12.2-16 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2018-09-04 06:51:13 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1503137 | ||||||||||
Attachments: |
|
Description
Nag Pavan Chilakam
2018-07-30 09:07:34 UTC
Created attachment 1471465 [details]
bt and taabt
Created attachment 1471480 [details]
node2_ brick_crash_possibly_on_node_reboot
uw_init_context_t is part of libunwind. This is not the actual cause of the crash. I think the segfault from comment #2 got triggered in quota_lookup. #12034 gf_print_trace (signum=11, ctx=0x557e4d5c2010) at common-utils.c:762 #12035 <signal handler called> #12036 0x00007f54b290fb61 in quota_lookup (frame=0x7f542813e640, this=0x7f5482205c80, loc=0x7f53037fd880, xattr_req=0x0) at quota.c:1663 #12037 0x00007f54b26e1e53 in io_stats_lookup (frame=frame@entry=0x7f542813e530, this=this@entry=0x7f54820085e0, loc=loc@entry=0x7f53037fd880, xdata=xdata@entry=0x0) at io-stats.c:2723 #12038 0x00007f54c72342cd in default_lookup (frame=0x7f542813e530, this=<optimized out>, loc=0x7f53037fd880, xdata=0x0) at defaults.c:2572 This is similar to the one in the trace from comment #3: #12035 <signal handler called> ---Type <return> to continue, or q <return> to quit--- #12036 0x00007f78ac7b1b61 in quota_lookup (frame=0x7f7878140530, this=0x7f750b8edcc0, loc=0x7f77f77fd880, xattr_req=0x0) at quota.c:1663 Reassigning to the quota component so that those maintainers can have a look. Created attachment 1471485 [details]
[WIP] quota: prevent segfault in quota_lookup()
Hi Sanoj, please have a look at the attached patch, this might fix the problem. It would need further corrections, but I think the main issue should be addressed with this. Nag, how reliable can you reproduce this? (In reply to Niels de Vos from comment #8) > Hi Sanoj, > > please have a look at the attached patch, this might fix the problem. It > would need further corrections, but I think the main issue should be > addressed with this. > > Nag, how reliable can you reproduce this? Not sure as this was a test run over 2 days, but what alarms me here is that this crash was seen on 3 different nodes, on the same setup, at different times. From Initial look, seems like priv struct of quota was null. However this is supposed to be allocated and non null even if quota was not enabled.. Continuing. (In reply to Niels de Vos from comment #5) > uw_init_context_t is part of libunwind. This is not the actual cause of the > crash. > > I think the segfault from comment #2 got triggered in quota_lookup. > > #12034 gf_print_trace (signum=11, ctx=0x557e4d5c2010) at common-utils.c:762 > #12035 <signal handler called> > #12036 0x00007f54b290fb61 in quota_lookup (frame=0x7f542813e640, > this=0x7f5482205c80, loc=0x7f53037fd880, xattr_req=0x0) at quota.c:1663 > #12037 0x00007f54b26e1e53 in io_stats_lookup > (frame=frame@entry=0x7f542813e530, this=this@entry=0x7f54820085e0, > loc=loc@entry=0x7f53037fd880, xdata=xdata@entry=0x0) at io-stats.c:2723 > #12038 0x00007f54c72342cd in default_lookup (frame=0x7f542813e530, > this=<optimized out>, loc=0x7f53037fd880, xdata=0x0) at defaults.c:2572 > > > This is similar to the one in the trace from comment #3: > > #12035 <signal handler called> > ---Type <return> to continue, or q <return> to quit--- > #12036 0x00007f78ac7b1b61 in quota_lookup (frame=0x7f7878140530, > this=0x7f750b8edcc0, loc=0x7f77f77fd880, > xattr_req=0x0) at quota.c:1663 > > > Reassigning to the quota component so that those maintainers can have a look. Thanks Neils, Changing the title accordingly, feel free to correct it , if required, on further analysis from the bt, #12028 _gf_msg (domain=domain@entry=0x7f78bd1c8682 "", file=file@entry=0x7f78bd1cc14f "mem-pool.c", #12029 0x00007f78bd14f7e6 in gf_mem_set_acct_info (xl=xl@entry=0x7f750b8edcc0, #12030 0x00007f78bd14f9b9 in __gf_malloc (size=size@entry=33, type=type@entry=39, #12031 0x00007f78bd14fcb3 in gf_vasprintf (string_ptr=0x7f77f77fbfc0, format=0x7f78bd1c91d2 "package-string: %s", #12032 0x00007f78bd14fda4 in gf_asprintf (string_ptr=string_ptr@entry=0x7f77f77fbfc0, #12033 0x00007f78bd131bd8 in gf_dump_config_flags () at common-utils.c:699 #12034 gf_print_trace (signum=11, ctx=0x559d67c3d010) at common-utils.c:762 #12035 <signal handler called> #12036 0x00007f78ac7b1b61 in quota_lookup (frame=0x7f7878140530, this=0x7f750b8edcc0, loc=0x7f77f77fd880, #12037 0x00007f78ac583e53 in io_stats_lookup (frame=frame@entry=0x7f7878140420, this=this@entry=0x7f750b919320, #12038 0x00007f78bd1a72cd in default_lookup (frame=0x7f7878140420, this=<optimized out>, loc=0x7f77f77fd880, #12039 0x00007f78bd1647a6 in syncop_lookup (subvol=subvol@entry=0x7f750b919ee0, loc=loc@entry=0x7f77f77fd880, #12040 0x00007f78ac14c5c6 in server_first_lookup (this=0x7f78a802b8a0, client=client@entry=0x7f7878007040, #12041 0x00007f78ac14d231 in server_setvolume (req=0x7f78780017f0) at server-handshake.c:913 #12042 0x00007f78bcee9955 in rpcsvc_handle_rpc_call (svc=0x7f78a803f730, trans=trans@entry=0x7f7878000f70, #12043 0x00007f78bcee9b3b in rpcsvc_notify (trans=0x7f7878000f70, mydata=<optimized out>, event=<optimized out>, #12044 0x00007f78bceeba73 in rpc_transport_notify (this=this@entry=0x7f7878000f70, #12045 0x00007f78b1d11566 in socket_event_poll_in (this=this@entry=0x7f7878000f70, notify_handled=<optimized out>) #12046 0x00007f78b1d13b0c in socket_event_handler (fd=336, idx=59, gen=3640, data=0x7f7878000f70, poll_in=1, #12047 0x00007f78bd1854c4 in event_dispatch_epoll_handler (event=0x7f77f77fde80, event_pool=0x559d67c74210) #12048 event_dispatch_epoll_worker (data=0x7f787f2f31d0) at event-epoll.c:659 #12049 0x00007f78bbf86dd5 in start_thread (arg=0x7f77f77fe700) at pthread_create.c:308 #12050 0x00007f78bb84fb3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) p *this $3 = {name = 0x7f75c9d90480 "11-ollena-10-quota", type = 0x7f750b95cd70 "features/quota", instance_name = 0x0, next = 0x7f750b8ed100, prev = 0x7f750b919320, parents = 0x7f750b8fce60, children = 0x7f750b96f040, options = 0x0, dlhandle = 0x7f78a8027880, fops = 0x7f78ac9d10c0 <fops>, cbks = 0x7f78ac9d1060 <cbks>, dumpops = 0x7f78ac9d1000 <dumpops>, volume_options = { next = 0x7f750b8edd20, prev = 0x7f750b8edd20}, fini = 0x7f78ac7ca280 <fini>, init = 0x7f78ac7c9c60 <init>, reconfigure = 0x7f78ac7ca060 <reconfigure>, mem_acct_init = 0x7f78ac7c9be0 <mem_acct_init>, notify = 0x7f78bd1c7090 <default_notify>, loglevel = GF_LOG_NONE, client_latency = 0, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 55 times>}, history = 0x0, ctx = 0x559d67c3d010, graph = 0x7f75c9db5290, itable = 0x0, init_succeeded = 1 '\001', private = 0x0, mem_acct = 0x0, winds = 0, switched = 0 '\000', local_pool = 0x0, is_autoloaded = _gf_false, volfile_id = 0x0, xl_id = 17, cleanup_starting = 1, call_cleanup = 1} (gdb) we see this-private = 0, we initialize the private even when quota is disabled.. It is initialized in init() and free'd in fini() Notice that cleanup_starting =1, cleanup =1 ==> indicates that fini() was in progress. So, It looks like there was an in-flight quota_lookup while the fini() was being invoked on quota translator. fini() went before lookup and free'd the private causing segfault when lookup accesses it. I tried recreating this race with gdb as well as with code level instrumentation, however am not seeing any crash. The backtrace here is very long, the reason for that is within the logging function trigerred on GF_ASSERT, we again assert using GF_ASSERT. We need to guard GF_ASSERT against such recursive assert. from the trace above we see that it was called in server_first_lookup () path.. and during clean up as cleanup_starting = 1. The patch https://review.gluster.org/#/c/20427/4 addresses this issue, by avoiding server_first_lookup if cleanup has started. find the relevant code snippet below, if (client->bound_xl->cleanup_starting) { op_ret = -1; op_errno = EAGAIN; ret = dict_set_str (reply, "ERROR", "cleanup flag is set for xlator " "before call first_lookup Try again later"); } else { op_ret = server_first_lookup (this, client, reply); if (op_ret == -1) op_errno = ENOENT; } } hence marking this bug as duplicate of 1597627 *** This bug has been marked as a duplicate of bug 1597627 *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607 |