Bug 1372806

Summary: bash stack smash due to longjmp back to wait_builtin after it returned
Product: Red Hat Enterprise Linux 7 Reporter: Paulo Andrade <pandrade>
Component: bashAssignee: Siteshwar Vashisht <svashisht>
Status: CLOSED ERRATA QA Contact: Martin Kyral <mkyral>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.2CC: fkrska, hartsjc, isenfeld, jherrman, mkolaja, mkyral, ovasik, pandrade, svashisht
Target Milestone: rcKeywords: Patch, ZStream
Target Release: ---Flags: fkrska: needinfo+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: bash-4.2.46-26.el7 Doc Type: Bug Fix
Doc Text:
Due to a bug in trap signal handling, bash in some cases terminated unexpectedly after performing the longjmp() function from the wait_sigint_handler() function to the wait_builtin() function. This update fixes the bug and thus prevents the described crashes from occurring.
Story Points: ---
Clone Of:
: 1384520 1384521 1391150 (view as bug list) Environment:
Last Closed: 2017-08-01 20:32:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1384520, 1384521, 1391150    
Attachments:
Description Flags
bash-4.2-sfdc01683190.patch
none
test.sh
none
kill.sh
none
bash-4.2-sfdc01683190.patch-complete
none
bash-4.2-sfdc01683190.patch
none
bash-4.3-trapped-signals.patch none

Description Paulo Andrade 2016-09-02 18:26:02 UTC
Created attachment 1197263 [details]
bash-4.2-sfdc01683190.patch

On special conditions it is possible that bash will
jump back to wait_builtin after having returned from it.

  An user can easily reproduce it on their environment,
but it is very difficult to catch it for example in
gdb, as attaching gdb will modify timings and no longer
trigger the issue.

Stack looks like this:
#0  0x00007ffff76215f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007ffff76215f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff7622e28 in __GI_abort () at abort.c:119
#2  0x00007ffff7661317 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7ffff77696f4 "*** %s ***: %s terminated\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007ffff76f9b37 in __GI___fortify_fail (msg=msg@entry=0x7ffff77696dc "stack smashing detected") at fortify_fail.c:31
#4  0x00007ffff76f9b00 in __stack_chk_fail () at stack_chk_fail.c:28
#5  0x00000000004796d0 in wait_builtin (list=<optimized out>) at ./wait.def:186
#6  0x0000000000000000 in ?? ()

And stack protector message looks like:
*** stack smashing detected ***: /bin/bash terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7ffff76f9b37]
/lib64/libc.so.6(__fortify_fail+0x0)[0x7ffff76f9b00]
/bin/bash[0x4796d0]
======= Memory map: ========
00400000-004dd000 r-xp 00000000 fd:00 313150                             /usr/bin/bash
006dc000-006dd000 r--p 000dc000 fd:00 313150                             /usr/bin/bash
006dd000-006e6000 rw-p 000dd000 fd:00 313150                             /usr/bin/bash
006e6000-0080d000 rw-p 00000000 00:00 0                                  [heap]
7ffff0eaf000-7ffff0ec4000 r-xp 00000000 fd:00 354651                     /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7ffff0ec4000-7ffff10c3000 ---p 00015000 fd:00 354651                     /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7ffff10c3000-7ffff10c4000 r--p 00014000 fd:00 354651                     /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7ffff10c4000-7ffff10c5000 rw-p 00015000 fd:00 354651                     /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7ffff10c5000-7ffff75ec000 r--p 00000000 fd:00 442477                     /usr/lib/locale/locale-archive
7ffff75ec000-7ffff77a2000 r-xp 00000000 fd:00 354740                     /usr/lib64/libc-2.17.so
7ffff77a2000-7ffff79a2000 ---p 001b6000 fd:00 354740                     /usr/lib64/libc-2.17.so
7ffff79a2000-7ffff79a6000 r--p 001b6000 fd:00 354740                     /usr/lib64/libc-2.17.so
7ffff79a6000-7ffff79a8000 rw-p 001ba000 fd:00 354740                     /usr/lib64/libc-2.17.so
7ffff79a8000-7ffff79ad000 rw-p 00000000 00:00 0
7ffff79ad000-7ffff79b0000 r-xp 00000000 fd:00 354184                     /usr/lib64/libdl-2.17.so
7ffff79b0000-7ffff7baf000 ---p 00003000 fd:00 354184                     /usr/lib64/libdl-2.17.so
7ffff7baf000-7ffff7bb0000 r--p 00002000 fd:00 354184                     /usr/lib64/libdl-2.17.so
7ffff7bb0000-7ffff7bb1000 rw-p 00003000 fd:00 354184                     /usr/lib64/libdl-2.17.so
7ffff7bb1000-7ffff7bd6000 r-xp 00000000 fd:00 353667                     /usr/lib64/libtinfo.so.5.9
7ffff7bd6000-7ffff7dd6000 ---p 00025000 fd:00 353667                     /usr/lib64/libtinfo.so.5.9
7ffff7dd6000-7ffff7dda000 r--p 00025000 fd:00 353667                     /usr/lib64/libtinfo.so.5.9
7ffff7dda000-7ffff7ddb000 rw-p 00029000 fd:00 353667                     /usr/lib64/libtinfo.so.5.9
7ffff7ddb000-7ffff7dfc000 r-xp 00000000 fd:00 354581                     /usr/lib64/ld-2.17.so
7ffff7fa9000-7ffff7fde000 r--s 00000000 fd:00 345136                     /var/db/nscd/passwd
7ffff7fde000-7ffff7fe1000 rw-p 00000000 00:00 0
7ffff7ff1000-7ffff7ff2000 rw-p 00000000 00:00 0
7ffff7ff2000-7ffff7ff9000 r--s 00000000 fd:00 353531                     /usr/lib64/gconv/gconv-modules.cache
7ffff7ff9000-7ffff7ffa000 rw-p 00000000 00:00 0
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 00:00 0                          [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00021000 fd:00 354581                     /usr/lib64/ld-2.17.so
7ffff7ffd000-7ffff7ffe000 rw-p 00022000 fd:00 354581                     /usr/lib64/ld-2.17.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

  The problem is caused because the global variable
this_shell_builtin may be set to wait_builtin for some
time that there may be concurrence, and the code in
jobs.c does not properly check the  value of the other
global, interrupt_immediately, that will be non zero
if the setjmp is still valid.

  Bash-4.3 should handle this condition, but the patch
is too large, so, the smallest possible patch is
proposed to correct the problem, and is confirmed to
correct it.
  Upstream change is in commit
ac50fbac377e32b98d2de396f016ea81e8ee9961
but it is too large and not exactly for the problem
described here.
  The proposed patch just adds a guard to not longjmp
back in wait_builtin if it has already returned.

Comment 6 Siteshwar Vashisht 2016-09-13 21:02:09 UTC
I am able to reliably reproduce this issue by following below steps (Scripts test.sh and kill.sh are attached with the bug):

1. Execute './test.sh'. It will print the process id of bash process that execute the script.
2. Execute './kill.sh <process id of bash process executing test.sh>'
3. kill.sh will continuously send SIGINT and SIGUSR1 signals to test.sh until it crashes.

I was not able to reproduce this issue on a single cpu RHEL 7.2 docker instance, however it reproduces very reliably on a system with 4 cpus.

I tried the test rpms from sfdc case 01683190, but bash still crashes. However it shows a different stacktrace :

(gdb) bt
#0  0x00007febd59f25f7 in ?? ()
#1  0x00007febd59f3ce8 in ?? ()
#2  0x0000000000000020 in ?? ()
#3  0x0000000000000000 in ?? ()

So it looks like the patch only partially works around the problem.

Comment 7 Siteshwar Vashisht 2016-09-13 21:03:21 UTC
Created attachment 1200619 [details]
test.sh

Comment 8 Siteshwar Vashisht 2016-09-13 21:03:52 UTC
Created attachment 1200620 [details]
kill.sh

Comment 9 Paulo Andrade 2016-09-14 12:23:36 UTC
Created attachment 1200814 [details]
bash-4.2-sfdc01683190.patch-complete

  Very good job Siteshwar :)
  My first patch, now attached, works, with your reproducer.
The patch in bash-4.2.46-20.el7.sfdc01683190.3

  The second patch was the one I added to the bugzilla,
because without a reproducer the second one would be more
likely to be accepted, the one in
bash-4.2.46-20.el7.sfdc01683190.4

  The earlier test packages were a libasan enabled bash
and another built without optimizations and with debug
information, to make it easier to debug the problem, but
neither did reproduce the issue.

  Note that the "new" patch is basically a backport of
the diff from bash 4.3, and to be really complete/correct
need to add the definition and calls to CHECK_WAIT_INTR,
but it starts to become too large of a patch :(

Comment 10 Siteshwar Vashisht 2016-09-16 06:34:13 UTC
As per my understanding changes in nojobs.c are not required as this file is not used. Is there any specific reason you have made those changes ?

Comment 11 Paulo Andrade 2016-09-16 12:32:08 UTC
The patch to nojobs.c is not required by Red Hat packages.

The change was intended to be a backport of the bash-4.3
diff, but as I said, the patch is not complete, so I
stopped when the diff would became too large without
a reproducer...

I tried to create a reproducer with a C program that
would spawn a bash that would execute a script and
keep sending signals to it, but tried to create a
"single shot" reproducer; a brute force one would have
been better as you found :)

I can remake the patch if you prefer, basically add
definition of CHECK_WAIT_INTR and calls to it, but the
patch will likely become twice as large.

Comment 15 Paulo Andrade 2016-09-19 15:12:01 UTC
Created attachment 1202537 [details]
bash-4.2-sfdc01683190.patch

For the provided test case, this patch is very close
to latest bash-4.3 and shows the same behavior.
Does not crash, but, because the test case runs in
a loop sending signals, it does eventually stack
overflow due to not handling the traps fast enough.

Comment 16 Paulo Andrade 2016-09-26 15:14:59 UTC
A variant of the minimal patch that does not crash is:

diff -up bash-4.2/jobs.c.orig bash-4.2/jobs.c
--- bash-4.2/jobs.c~	2016-09-26 10:07:28.117401802 -0300
+++ bash-4.2/jobs.c	2016-09-26 12:09:07.803979913 -0300
@@ -2244,10 +2244,14 @@ wait_sigint_handler (sig)
 	  signal_is_trapped (SIGINT) &&
 	  ((sigint_handler = trap_to_sighandler (SIGINT)) == trap_handler))
 	{
-	  interrupt_immediately = 0;
 	  trap_handler (SIGINT);	/* set pending_traps[SIGINT] */
 	  wait_signal_received = SIGINT;
-	  longjmp (wait_intr_buf, 1);
+	  if (interrupt_immediately)
+	    {
+	      interrupt_immediately = 0;
+	      longjmp (wait_intr_buf, 1);
+	    }
+	  SIGRETURN (0);
 	}
       
       ADDINTERRUPT;

Comment 20 Siteshwar Vashisht 2016-10-01 08:52:38 UTC
Created attachment 1206497 [details]
bash-4.3-trapped-signals.patch

Comment 37 errata-xmlrpc 2017-08-01 20:32:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1931