1112306 – ksh crashes because job locking mechanism does not survive compiler optimization

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1112306 - ksh crashes because job locking mechanism does not survive compiler optimization

Summary: ksh crashes because job locking mechanism does not survive compiler optimization

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	ksh
Sub Component:
Version:	6.5
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Michal Hlavinka
QA Contact:	Martin Kyral
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1125326 (view as bug list)
Depends On:
Blocks:	994246 1075802 1116506 1123467 1160749
TreeView+	depends on / blocked

Reported:	2014-06-23 14:23 UTC by Marcel Kolaja
Modified:	2020-09-21 04:05 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ksh-20120801-18.el6
Doc Type:	Bug Fix
Doc Text:	Prior to this update, the compiler optimization dropped parts from the ksh job locking mechanism from the binary code. As a consequence, ksh could terminate unexpectedly with a segmentation fault after it received the SIGCHLD signal. This update implements a fix to ensure the compiler does not drop parts of the ksh mechanism and the crash no longer occurs.
Clone Of:
Clones:	1116506 1123467 1160749 (view as bug list)
Environment:
Last Closed:	2014-10-14 07:07:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:1381	0	normal	SHIPPED_LIVE	ksh bug fix update	2014-10-14 01:28:24 UTC

Description Marcel Kolaja 2014-06-23 14:23:45 UTC

Description of problem:
ksh crashes in job_chksave after receiving SIGCHLD.

Version-Release number of selected component (if applicable):
20120801-10.el6_5.5

How reproducible:
No reproducer known. The issue occurs rarely and randomly.

Actual results:
ksh crashes

Expected results:
ksh doesn't crash

Additional info:
Looks different from BZ1110063 and BZ825520 (a.k.a. BZ1097293).

Comment 5 Marcel Kolaja 2014-06-24 13:33:45 UTC

When SIGCHLD was received, we were in job_subsave():

(gdb) f 3
#3  job_subsave () at /usr/src/debug/ksh-20120801/src/cmd/ksh93/sh/jobs.c:1990
1990		bp->prev = bck.prev;
(gdb) 

Before SIGCHLD was received, job_lock() had been called. This is a simple macro:

#define job_lock()      (job.in_critical++)

It is, therefore, weird that job.in_critical was 0 at that point:

(gdb) p job.in_critical
$1 = 0
(gdb) 

This lead to job_reap() being called in the critical code by the signal handler job_waitsafe() instead of waiting:

static void job_waitsafe(int sig)
{
        if(job.in_critical || vmbusy())
        {
                job.savesig = sig;
                job.waitsafe++;
        }
        else
                job_reap(sig);
}

We need to find out, why job.in_critical was 0 between job_lock() and job_unlock() calls in job_subsave().

Comment 6 Michal Hlavinka 2014-06-24 16:00:24 UTC

Seems that gcc optimizes out locking mechanism:
   │1985    void *job_subsave(void)
   │1986    {
   │1987            struct back_save *bp = new_of(struct back_save,0);
   │1988            job_lock();
   │1989            *bp = bck;
   │1990            bp->prev = bck.prev;
   │1991            bck.count = 0;
   │1992            bck.list = 0;
   │1993            bck.prev = bp;
   │1994            job_unlock();
   │1995            return((void*)bp);
   │1996    }

where
#define job_lock()	(job.in_critical++)
#define job_unlock()	\
	do { \
		int	sig; \
		if (!--job.in_critical && (sig = job.savesig)) \
		{ \
			if (!job.in_critical++ && !vmbusy()) \
				job_reap(sig); \
			job.in_critical--; \
		} \
	} while(0)

results in:
   │0x429801 <job_subsave+1>        mov    $0x18,%edi
   │0x429806 <job_subsave+6>        callq  0x405de0 <malloc@plt>
   │0x42980b <job_subsave+11>       mov    0x343e9e(%rip),%rdx        # 0x76d6b0 <bck>
   │0x429812 <job_subsave+18>       mov    %rax,%rbx
   │0x429815 <job_subsave+21>       mov    0x343bed(%rip),%eax        # 0x76d408 <job+40>
   │0x42981b <job_subsave+27>       movl   $0x0,0x343e8b(%rip)        # 0x76d6b0 <bck>
   │0x429825 <job_subsave+37>       mov    %rdx,(%rbx) 
   │0x429828 <job_subsave+40>       mov    0x343e89(%rip),%rdx        # 0x76d6b8 <bck+8>
   │0x42982f <job_subsave+47>       test   %eax,%eax
   │0x429831 <job_subsave+49>       movq   $0x0,0x343e7c(%rip)        # 0x76d6b8 <bck+8>
   │0x42983c <job_subsave+60>       mov    %rdx,0x8(%rbx)
   │0x429840 <job_subsave+64>       mov    0x343e79(%rip),%rdx        # 0x76d6c0 <bck+16>
   │0x429847 <job_subsave+71>       mov    %rbx,0x343e72(%rip)        # 0x76d6c0 <bck+16>
   │0x42984e <job_subsave+78>       mov    %rdx,0x10(%rbx)
   │0x429852 <job_subsave+82>       jne    0x429887 <job_subsave+135>
   │0x429854 <job_subsave+84>       mov    0x343bb2(%rip),%edi        # 0x76d40c <job+44>
   │0x42985a <job_subsave+90>       test   %edi,%edi
   │0x42985c <job_subsave+92>       je     0x429887 <job_subsave+135>
   │0x42985e <job_subsave+94>       mov    0x341ca3(%rip),%rax        # 0x76b508 <Vmregion>
   │0x429865 <job_subsave+101>      movl   $0x1,0x343b99(%rip)        # 0x76d408 <job+40>
   │0x42986f <job_subsave+111>      mov    0x60(%rax),%rdx 
   │0x429873 <job_subsave+115>      mov    $0x1,%eax
   │0x429878 <job_subsave+120>      mov    (%rdx),%esi
   │0x42987a <job_subsave+122>      test   %esi,%esi
   │0x42987c <job_subsave+124>      je     0x429890 <job_subsave+144>
   │0x42987e <job_subsave+126>      sub    $0x1,%eax
   │0x429881 <job_subsave+129>      mov    %eax,0x343b81(%rip)        # 0x76d408 <job+40>
   │0x429887 <job_subsave+135>      mov    %rbx,%rax
   │0x42988a <job_subsave+138>      pop    %rbx
   │0x42988b <job_subsave+139>      retq
   │0x42988c <job_subsave+140>      nopl   0x0(%rax)
   │0x429890 <job_subsave+144>      callq  0x428d60 <job_reap>
   │0x429895 <job_subsave+149>      mov    0x343b6d(%rip),%eax        # 0x76d408 <job+40>
   │0x42989b <job_subsave+155>      jmp    0x42987e <job_subsave+126>

So it does not increment nor decrement the in_critical lock.

Can't reproduce this, but this is bug anyway.

Comment 21 Michal Hlavinka 2014-08-04 10:44:09 UTC

*** Bug 1125326 has been marked as a duplicate of this bug. ***

Comment 22 errata-xmlrpc 2014-10-14 07:07:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1381.html

Comment 23 Martijn Dekker 2020-09-21 04:05:46 UTC

This is due to an optimiser bug in GCC that persists to this day.

As part of ksh 93u+m development, Johnothan King found a more portable workaround.
Undo ksh-20120801-locking.patch, then apply this: https://github.com/ksh93/ksh/commit/c258a04f

commit c258a04f7abebc97a9dad22b4ab7846264137cd6
Author: Johnothan King
Date:   Tue Jun 16 14:44:02 2020 -0700

    Implement a portable fix for SIGCHLD crashes (#18)

    As previously reported in rhbz#1112306 (https://bugzilla.redhat.com/show_bug.cgi?id=1112306),
    ksh may crash when receiving SIGCHLD because GCC's optimizer will fail to generate
    `addl` and `sub` instructions to increment and decrement `job.in_critical` in the
    `job_subsave` function. This bug *does* occur in GCC 10 with `-O2`, but not `-O1`;
    it doesn't appear this bug has been fixed. As a reference, here is the relevant
    debug assembly output of `job_subsave` when KSH is compiled with `CCFLAGS` set to
    `-g -O1`:

    0000000000034c97 <job_subsave>:

    void *job_subsave(void)
    {
       34c97:       53                      push   %rbx
            struct back_save *bp = new_of(struct back_save,0);
       34c98:       bf 18 00 00 00          mov    $0x18,%edi
       34c9d:       e8 34 4a 0a 00          callq  d96d6 <_ast_malloc>
       34ca2:       48 89 c3                mov    %rax,%rbx
            job_lock();
       34ca5:       83 05 3c 50 13 00 01    addl   $0x1,0x13503c(%rip)        # 169ce8 <job+0x28>
            *bp = bck;
       34cac:       66 0f 6f 05 4c 5a 13    movdqa 0x135a4c(%rip),%xmm0        # 16a700 <bck>
       34cb3:       00
       34cb4:       0f 11 00                movups %xmm0,(%rax)
       34cb7:       48 8b 05 52 5a 13 00    mov    0x135a52(%rip),%rax        # 16a710 <bck+0x10>
       34cbe:       48 89 43 10             mov    %rax,0x10(%rbx)
            bp->prev = bck.prev;
       34cc2:       48 8b 05 47 5a 13 00    mov    0x135a47(%rip),%rax        # 16a710 <bck+0x10>
       34cc9:       48 89 43 10             mov    %rax,0x10(%rbx)
            bck.count = 0;
       34ccd:       c7 05 29 5a 13 00 00    movl   $0x0,0x135a29(%rip)        # 16a700 <bck>
       34cd4:       00 00 00
            bck.list = 0;
       34cd7:       48 c7 05 26 5a 13 00    movq   $0x0,0x135a26(%rip)        # 16a708 <bck+0x8>
       34cde:       00 00 00 00
            bck.prev = bp;
       34ce2:       48 89 1d 27 5a 13 00    mov    %rbx,0x135a27(%rip)        # 16a710 <bck+0x10>
            job_unlock();
       34ce9:       8b 05 f9 4f 13 00       mov    0x134ff9(%rip),%eax        # 169ce8 <job+0x28>
       34cef:       83 e8 01                sub    $0x1,%eax
       34cf2:       89 05 f0 4f 13 00       mov    %eax,0x134ff0(%rip)        # 169ce8 <job+0x28>
       34cf8:       75 2b                   jne    34d25 <job_subsave+0x8e>
       34cfa:       8b 3d ec 4f 13 00       mov    0x134fec(%rip),%edi        # 169cec <job+0x2c>
       34d00:       85 ff                   test   %edi,%edi
       34d02:       74 21                   je     34d25 <job_subsave+0x8e>
       34d04:       c7 05 da 4f 13 00 01    movl   $0x1,0x134fda(%rip)        # 169ce8 <job+0x28>

    When `-O2` is used instead of `-O1`, the `addl` and `sub` instructions for
    incrementing and decrementing the lock are removed. GCC instead generates a
    broken `mov` instruction for `job_lock` and removes the initial `sub` instruction
    in job_unlock (this is also seen in Red Hat's bug report):

           job_lock();
           *bp = bck;
      37d7c:       66 0f 6f 05 7c 79 14    movdqa 0x14797c(%rip),%xmm0        # 17f700 <bck>
      37d83:       00
           struct back_save *bp = new_of(struct back_save,0);
      37d84:       49 89 c4                mov    %rax,%r12
           job_lock();
      37d87:       8b 05 5b 6f 14 00       mov    0x146f5b(%rip),%eax        # 17ece8 <job+0x28>
    ...
            job_unlock();
      37dc6:       89 05 1c 6f 14 00       mov    %eax,0x146f1c(%rip)        # 17ece8 <job+0x28>
      37dcc:       85 c0                   test   %eax,%eax
      37dce:       75 2b                   jne    37dfb <job_subsave+0x8b>

    The original patch works around this bug by using the legacy `__sync_fetch_and_add/sub`
    GCC builtins. This forces GCC to generate instructions that change the lock with
    `lock addl`, `lock xadd` and `lock subl`:

           job_lock();
      37d9f:       f0 83 05 41 6f 14 00    lock addl $0x1,0x146f41(%rip)        # 17ece8 <job+0x28>
      37da6:       01
    ...
           job_unlock();
      37deb:       f0 0f c1 05 f5 6e 14    lock xadd %eax,0x146ef5(%rip)        # 17ece8 <job+0x28>
      37df2:       00
      37df3:       83 f8 01                cmp    $0x1,%eax
      37df6:       74 08                   je     37e00 <job_subsave+0x70>
    ...
      37e25:       74 11                   je     37e38 <job_subsave+0xa8>
      37e27:       f0 83 2d b9 6e 14 00    lock subl $0x1,0x146eb9(%rip)        # 17ece8 <job+0x28>

    While this does work, it isn't portable. This patch implements a different
    workaround for this compiler bug. If `job_lock` is put at the beginning of
    `job_subsave`, GCC will generate the required `addl` and `sub` instructions:

           job_lock();
      37d67:       83 05 7a 5f 14 00 01    addl   $0x1,0x145f7a(%rip)        # 17dce8 <job+0x28>
    ...
            job_unlock();
      37dbb:       83 e8 01                sub    $0x1,%eax
      37dbe:       89 05 24 5f 14 00       mov    %eax,0x145f24(%rip)        # 17dce8 <job+0x28>

    It is odd that moving a single line of code fixes this problem, although
    GCC _should_ have generated these instructions from the original code anyway.
    I'll note that this isn't the only way to get these instructions to generate.
    The problem also seems to go away when inserting almost anything else inside
    of the code for `job_subsave`. This is just a simple workaround for a strange
    compiler bug.

diff --git a/src/cmd/ksh93/sh/jobs.c b/src/cmd/ksh93/sh/jobs.c
index 9b852b0..d9fcf1f 100644
--- a/src/cmd/ksh93/sh/jobs.c
+++ b/src/cmd/ksh93/sh/jobs.c
@@ -1973,8 +1973,14 @@ again:

 void *job_subsave(void)
 {
-       struct back_save *bp = new_of(struct back_save,0);
+       /*
+        * We must make a lock first before doing anything else,
+        * otherwise GCC will remove the job locking mechanism
+        * as a result of compiler optimization.
+        */
        job_lock();
+
+       struct back_save *bp = new_of(struct back_save,0);
        *bp = bck;
        bp->prev = bck.prev;
        bck.count = 0;

Note You need to log in before you can comment on or make changes to this bug.