Bug 1293594 - Segmentation fault in '_Unwind_Backtrace ()'
Segmentation fault in '_Unwind_Backtrace ()'
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: glusterfs (Show other bugs)
x86_64 Linux
unspecified Severity medium
: pre-dev-freeze
: ---
Assigned To: sankarshan
Marco Bill-Peter
Depends On:
Blocks: 1413146
  Show dependency treegraph
Reported: 2015-12-22 05:16 EST by Soumya Koduri
Modified: 2018-06-28 10:26 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
build-install-core (13.20 MB, application/x-bzip)
2015-12-22 05:19 EST, Soumya Koduri
no flags Details

  None (edit)
Description Soumya Koduri 2015-12-22 05:16:28 EST
Description of problem:

While running few regressions tests of glusterFS, we occasionally run into an issue where there is a segmentation fault in libgcc with the below backtrace -

Program terminated with signal 11, Segmentation fault.
#0  0x00007f800426f867 in ?? () from ./lib64/libgcc_s.so.1
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6_5.1.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6_4.6.x86_64 libcom_err-1.41.12-18.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 openssl-1.0.1e-15.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00007f800426f867 in ?? () from ./lib64/libgcc_s.so.1
#1  0x00007f8004270119 in _Unwind_Backtrace () from ./lib64/libgcc_s.so.1
#2  0x00007f800fb46936 in backtrace () from ./lib64/libc.so.6
#3  0x00007f8010ee6f73 in _gf_msg_backtrace_nomem (level=GF_LOG_ALERT, stacksize=200)
    at /home/jenkins/root/workspace/rackspace-regression-2GB-triggered/libglusterfs/src/logging.c:1090
#4  0x00007f8010eecd38 in gf_print_trace (signum=11, ctx=0xda7010)
    at /home/jenkins/root/workspace/rackspace-regression-2GB-triggered/libglusterfs/src/common-utils.c:740
#5  0x00000000004098d6 in glusterfsd_print_trace (signum=11)
    at /home/jenkins/root/workspace/rackspace-regression-2GB-triggered/glusterfsd/src/glusterfsd.c:2033
#6  <signal handler called>
#7  0x00007f7fff1aa561 in ?? ()
#8  0x00007f80101c6a51 in start_thread () from ./lib64/libpthread.so.0
#9  0x00007f800fb3093d in clone () from ./lib64/libc.so.6

#readelf -s lib64/libgcc_s.so.1 | grep Unwind_Backtrace
    43: 00000000000100a0   183 FUNC    GLOBAL DEFAULT   12 _Unwind_Backtrace@@GCC_3.3

This is not consistently reproducible but often happens when there is another thread trying to do some cleanup and exit the process.

Appreciate any help in further debugging and resolving the issue. 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Most of the times hit the issue while running the test "https://github.com/gluster/glusterfs/blob/master/tests/bugs/snapshot/bug-1140162-file-snapshot-features-encrypt-opts-validation.t"

Actual results:
Process some times exit with a core (segmentation fault in libgcc)

Expected results:
Process should exit cleanly and there should not be any panic or core generated

Additional info:
Comment 1 Soumya Koduri 2015-12-22 05:19 EST
Created attachment 1108606 [details]
Comment 2 Soumya Koduri 2015-12-22 05:21:29 EST
Attached the core and libraries installed. To view the core, execute the following command -

'gdb -ex 'set sysroot ./' -ex 'core-file ./build/install/cores/core.10962' ./build/install/sbin/glusterfs'

Comment 4 Marek Polacek 2015-12-22 06:14:46 EST
Not related to gcc-libraries.  I think installing missing debuginfos to see a more detailed backtrace would be a start.
Comment 5 Soumya Koduri 2015-12-22 12:41:44 EST
Thanks Marek. We had run into the issue while running the tests on slave machines using jenkins. Unfortunately the machine is no longer in that state. I shall try to reproduce it with debuginfos installed and get back.
Comment 6 Soumya Koduri 2015-12-23 05:01:19 EST
I could reproduce the issue. Please find the backtrace below -

Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'.
Program terminated with signal 11, Segmentation fault.
#0  x86_64_fallback_frame_state (context=0x7f89be09db90, fs=0x7f89be09da10)
    at ../../../gcc/config/i386/linux-unwind.h:47
47	  if (*(unsigned char *)(pc+0) == 0x48
Missing separate debuginfos, use: debuginfo-install glusterfs-fuse-3.7.1-11.el6rhs.x86_64
(gdb) bt
#0  x86_64_fallback_frame_state (context=0x7f89be09db90, fs=0x7f89be09da10)
    at ../../../gcc/config/i386/linux-unwind.h:47
#1  uw_frame_state_for (context=0x7f89be09db90, fs=0x7f89be09da10) at ../../../gcc/unwind-dw2.c:1210
#2  0x00007f89c58c4119 in _Unwind_Backtrace (trace=0x7f89d1a2b7d0 <backtrace_helper>, trace_argument=0x7f89be09dcd0)
    at ../../../gcc/unwind.inc:290
#3  0x00007f89d1a2b966 in backtrace () from /lib64/libc.so.6
#4  0x00007f89d2fc08e6 in _gf_msg_backtrace_nomem () from /usr/lib64/libglusterfs.so.0
#5  0x00007f89d2fe04af in gf_print_trace () from /usr/lib64/libglusterfs.so.0
#6  <signal handler called>
#7  0x00007f89c4852aa0 in ?? ()
#8  0x00007f89d20aba51 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f89d1a1596d in clone () from /lib64/libc.so.6
(gdb) f 1
#1  uw_frame_state_for (context=0x7f89be09db90, fs=0x7f89be09da10) at ../../../gcc/unwind-dw2.c:1210
1210	      return MD_FALLBACK_FRAME_STATE_FOR (context, fs);
(gdb) f 0
#0  x86_64_fallback_frame_state (context=0x7f89be09db90, fs=0x7f89be09da10)
    at ../../../gcc/config/i386/linux-unwind.h:47
47	  if (*(unsigned char *)(pc+0) == 0x48
(gdb) l
42	  unsigned char *pc = context->ra;
43	  struct sigcontext *sc;
44	  long new_cfa;
46	  /* movq __NR_rt_sigreturn, %rax ; syscall  */
47	  if (*(unsigned char *)(pc+0) == 0x48
48	      && *(unsigned long *)(pc+1) == 0x050f0000000fc0c7)
49	    {
50	      struct ucontext *uc_ = context->cfa;
51	      /* The void * cast is necessary to avoid an aliasing warning.
(gdb) l
52	         The aliasing warning is correct, but should not be a problem
53	         because it does not alias anything.  */
54	      sc = (struct sigcontext *) (void *) &uc_->uc_mcontext;
55	    }
56	  else
57	    return _URC_END_OF_STACK;
59	  new_cfa = sc->rsp;
60	  fs->regs.cfa_how = CFA_REG_OFFSET;
61	  /* Register 7 is rsp  */
Comment 7 Jeff Law 2016-06-20 17:03:50 EDT
It looks like the contents of the *context structure are bogus.  

(gdb) p/xx *context
$4 = {reg = {0x7f7ffea29ad0, 0x7f7ffea29ac8, 0x7f7ffea29ad8, 0x7f7ffea29ac0, 0x7f7ffea29ab0, 0x7f7ffea29aa8, 0x7f7ffea29ab8, 0x7f7ffea29ae0, 0x7f7ffea29a68, 0x7f7ffea29a70, 0x7f7ffea29a78, 0x7f7ffea29a80, 0x7f7ffea29a88, 
    0x7f7ffea29a90, 0x7f7ffea29a98, 0x7f7ffea29aa0, 0x7f7ffea29ae8, 0x0}, cfa = 0x7f7ffea29eb8, ra = 0x7f7fff1aa561, lsda = 0x0, bases = {tbase = 0x0, dbase = 0x0, func = 0x7f800fa7a69f}, flags = 0xc000000000000000, version = 0x0, 
  args_size = 0x0, by_value = {0x0 <repeats 18 times>}}
(gdb) p/x context->ra
$5 = 0x7f7fff1aa561

But there's nothing mapped at that address:

(gdb) x/x $5
0x7f7fff1aa561: Cannot access memory at address 0x7f7fff1aa561

In the caller we have:

      fde = _Unwind_Find_FDE (context->ra + _Unwind_IsSignalFrame (context) - 1,
1203                              &context->bases);
1204      if (fde == NULL)
1205        {
1207          /* Couldn't find frame unwind info for this function.  Try a
1208             target-specific fallback mechanism.  This will necessarily
1209             not provide a personality routine or LSDA.  */
1210          return MD_FALLBACK_FRAME_STATE_FOR (context, fs);
1211    #else
1212          return _URC_END_OF_STACK;
1213    #endif
1214        }

FDE is NULL, essentially saying we couldn't find frame unwind information for hte given context->ra address.  So it's already suspect.  Then x86_64_fallback_frame_state_for does:

42        unsigned char *pc = context->ra;
43        struct sigcontext *sc;
44        long new_cfa;
46        /* movq __NR_rt_sigreturn, %rax ; syscall  */ 
47        if (*(unsigned char *)(pc+0) == 0x48
48            && *(unsigned long *)(pc+1) == 0x050f0000000fc0c7)

Which is just dumb.  We have no idea why we didn't find the FDE in the caller and no guarantee that *pc is a valid memory location.


Touches on this issue.

At some level I suspect we've got something bogus in the frame chains.  But x86-64_fallback_frame_state_for simply can't do what it's trying to do without being more careful.
Comment 8 Jakub Jelinek 2016-10-03 11:22:29 EDT
While mincore or some other syscall with EFAULT test if the memory is readable could avoid the crashes in some cases, generally if other threads are doing bogus things like unmapping memory, there would be always a window where the memory can be unmapped.  I think it is more important what code has wrong unwind info that lead to this., or if the program is just unmapping memory that is still in use.
Comment 9 Jakub Jelinek 2016-10-18 04:01:08 EDT
Perhaps better would be to determine in configure (or configure option?) that would tell libgcc that it just shouldn't define MD_FALLBACK_FRAME_STATE_FOR
E.g. on x86_64-linux, I think one needs glibc >= 2006-11-29, and e.g. on i386-linux similar glibc and >= 2006-03-31 kernel.  In particular, for dropping MD_FALLBACK_FRAME_STATE_FOR we'd need to be sure that glibc and/or kernel, whenever they define __restore_rt or similar sequences in libc or vDSO, they contain unwind info for it, and to drop MD_FROB_UPDATE_CONTEXT additionally that
the unwind info for it uses "zRS" CIE flags there.
Comment 13 Marek Polacek 2018-04-30 11:26:42 EDT
Is there a way I could try to reproduce this in RHEL 7?  How do I run the bug-1140162-file-snapshot-features-encrypt-opts-validation.t test?
Comment 15 Florian Weimer 2018-06-28 08:27:13 EDT
(In reply to Soumya Koduri from comment #0)
> #5  0x00000000004098d6 in glusterfsd_print_trace (signum=11)
>     at
> /home/jenkins/root/workspace/rackspace-regression-2GB-triggered/glusterfsd/
> src/glusterfsd.c:2033
> #6  <signal handler called>
> #7  0x00007f7fff1aa561 in ?? ()
> #8  0x00007f80101c6a51 in start_thread () from ./lib64/libpthread.so.0
> #9  0x00007f800fb3093d in clone () from ./lib64/libc.so.6
> (gdb)

I think we should consider the root cause the crash *before* the crash handler is run.  I assume that the signal delivered here is SIGSEGV as well.  The address 0x00007f7fff1aa561 looks very much like a shared object address (without randomization), so the stack should be completely valid and parsed correctly by GDB.  It is likely that the code segment has been unmapped by a concurrent dlclose, while some thread was still running that very code.  If we could magically fix the backtracer, there would still be crash here.  You need to find that concurrent dlclose and fix that.

To very this theory, you should run “info files” after initialization, but before termination, and keep a note of the shared objects listed there.  The address that subsequently faults should be in one of the DSOs that is subject to dlclose.

Regarding the crash handler itself: Nowadays, it is generally best to avoid custom crash handlers and let ABRT/systemd-coredumpd do the job of capturing debugging information.  Writing good crash handlers is very hard, and these crash handlers tend to destroy useful information.  (I realize that this bug was filed several years ago.)
Comment 16 Jeff Law 2018-06-28 10:26:04 EDT
It seems to me like this really needs to be reassigned back to the gluster team for further analysis on their end.

Note You need to log in before you can comment on or make changes to this bug.