Bug 1112306
| Summary: | ksh crashes because job locking mechanism does not survive compiler optimization | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Marcel Kolaja <mkolaja> | |
| Component: | ksh | Assignee: | Michal Hlavinka <mhlavink> | |
| Status: | CLOSED ERRATA | QA Contact: | Martin Kyral <mkyral> | |
| Severity: | high | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 6.5 | CC: | asanders, fkrska, jherrman, jstephen, martijn, mkyral, ooprala, ovasik, rmarti | |
| Target Milestone: | rc | Keywords: | Patch, ZStream | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | ksh-20120801-18.el6 | Doc Type: | Bug Fix | |
| Doc Text: |
Prior to this update, the compiler optimization dropped parts from the ksh job locking mechanism from the binary code. As a consequence, ksh could terminate unexpectedly with a segmentation fault after it received the SIGCHLD signal. This update implements a fix to ensure the compiler does not drop parts of the ksh mechanism and the crash no longer occurs.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1116506 1123467 1160749 (view as bug list) | Environment: | ||
| Last Closed: | 2014-10-14 07:07:40 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 994246, 1075802, 1116506, 1123467, 1160749 | |||
|
Description
Marcel Kolaja
2014-06-23 14:23:45 UTC
When SIGCHLD was received, we were in job_subsave():
(gdb) f 3
#3 job_subsave () at /usr/src/debug/ksh-20120801/src/cmd/ksh93/sh/jobs.c:1990
1990 bp->prev = bck.prev;
(gdb)
Before SIGCHLD was received, job_lock() had been called. This is a simple macro:
#define job_lock() (job.in_critical++)
It is, therefore, weird that job.in_critical was 0 at that point:
(gdb) p job.in_critical
$1 = 0
(gdb)
This lead to job_reap() being called in the critical code by the signal handler job_waitsafe() instead of waiting:
static void job_waitsafe(int sig)
{
if(job.in_critical || vmbusy())
{
job.savesig = sig;
job.waitsafe++;
}
else
job_reap(sig);
}
We need to find out, why job.in_critical was 0 between job_lock() and job_unlock() calls in job_subsave().
Seems that gcc optimizes out locking mechanism:
│1985 void *job_subsave(void)
│1986 {
│1987 struct back_save *bp = new_of(struct back_save,0);
│1988 job_lock();
│1989 *bp = bck;
│1990 bp->prev = bck.prev;
│1991 bck.count = 0;
│1992 bck.list = 0;
│1993 bck.prev = bp;
│1994 job_unlock();
│1995 return((void*)bp);
│1996 }
where
#define job_lock() (job.in_critical++)
#define job_unlock() \
do { \
int sig; \
if (!--job.in_critical && (sig = job.savesig)) \
{ \
if (!job.in_critical++ && !vmbusy()) \
job_reap(sig); \
job.in_critical--; \
} \
} while(0)
results in:
│0x429801 <job_subsave+1> mov $0x18,%edi
│0x429806 <job_subsave+6> callq 0x405de0 <malloc@plt>
│0x42980b <job_subsave+11> mov 0x343e9e(%rip),%rdx # 0x76d6b0 <bck>
│0x429812 <job_subsave+18> mov %rax,%rbx
│0x429815 <job_subsave+21> mov 0x343bed(%rip),%eax # 0x76d408 <job+40>
│0x42981b <job_subsave+27> movl $0x0,0x343e8b(%rip) # 0x76d6b0 <bck>
│0x429825 <job_subsave+37> mov %rdx,(%rbx)
│0x429828 <job_subsave+40> mov 0x343e89(%rip),%rdx # 0x76d6b8 <bck+8>
│0x42982f <job_subsave+47> test %eax,%eax
│0x429831 <job_subsave+49> movq $0x0,0x343e7c(%rip) # 0x76d6b8 <bck+8>
│0x42983c <job_subsave+60> mov %rdx,0x8(%rbx)
│0x429840 <job_subsave+64> mov 0x343e79(%rip),%rdx # 0x76d6c0 <bck+16>
│0x429847 <job_subsave+71> mov %rbx,0x343e72(%rip) # 0x76d6c0 <bck+16>
│0x42984e <job_subsave+78> mov %rdx,0x10(%rbx)
│0x429852 <job_subsave+82> jne 0x429887 <job_subsave+135>
│0x429854 <job_subsave+84> mov 0x343bb2(%rip),%edi # 0x76d40c <job+44>
│0x42985a <job_subsave+90> test %edi,%edi
│0x42985c <job_subsave+92> je 0x429887 <job_subsave+135>
│0x42985e <job_subsave+94> mov 0x341ca3(%rip),%rax # 0x76b508 <Vmregion>
│0x429865 <job_subsave+101> movl $0x1,0x343b99(%rip) # 0x76d408 <job+40>
│0x42986f <job_subsave+111> mov 0x60(%rax),%rdx
│0x429873 <job_subsave+115> mov $0x1,%eax
│0x429878 <job_subsave+120> mov (%rdx),%esi
│0x42987a <job_subsave+122> test %esi,%esi
│0x42987c <job_subsave+124> je 0x429890 <job_subsave+144>
│0x42987e <job_subsave+126> sub $0x1,%eax
│0x429881 <job_subsave+129> mov %eax,0x343b81(%rip) # 0x76d408 <job+40>
│0x429887 <job_subsave+135> mov %rbx,%rax
│0x42988a <job_subsave+138> pop %rbx
│0x42988b <job_subsave+139> retq
│0x42988c <job_subsave+140> nopl 0x0(%rax)
│0x429890 <job_subsave+144> callq 0x428d60 <job_reap>
│0x429895 <job_subsave+149> mov 0x343b6d(%rip),%eax # 0x76d408 <job+40>
│0x42989b <job_subsave+155> jmp 0x42987e <job_subsave+126>
So it does not increment nor decrement the in_critical lock.
Can't reproduce this, but this is bug anyway.
*** Bug 1125326 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1381.html This is due to an optimiser bug in GCC that persists to this day. As part of ksh 93u+m development, Johnothan King found a more portable workaround. Undo ksh-20120801-locking.patch, then apply this: https://github.com/ksh93/ksh/commit/c258a04f commit c258a04f7abebc97a9dad22b4ab7846264137cd6 Author: Johnothan King Date: Tue Jun 16 14:44:02 2020 -0700 Implement a portable fix for SIGCHLD crashes (#18) As previously reported in rhbz#1112306 (https://bugzilla.redhat.com/show_bug.cgi?id=1112306), ksh may crash when receiving SIGCHLD because GCC's optimizer will fail to generate `addl` and `sub` instructions to increment and decrement `job.in_critical` in the `job_subsave` function. This bug *does* occur in GCC 10 with `-O2`, but not `-O1`; it doesn't appear this bug has been fixed. As a reference, here is the relevant debug assembly output of `job_subsave` when KSH is compiled with `CCFLAGS` set to `-g -O1`: 0000000000034c97 <job_subsave>: void *job_subsave(void) { 34c97: 53 push %rbx struct back_save *bp = new_of(struct back_save,0); 34c98: bf 18 00 00 00 mov $0x18,%edi 34c9d: e8 34 4a 0a 00 callq d96d6 <_ast_malloc> 34ca2: 48 89 c3 mov %rax,%rbx job_lock(); 34ca5: 83 05 3c 50 13 00 01 addl $0x1,0x13503c(%rip) # 169ce8 <job+0x28> *bp = bck; 34cac: 66 0f 6f 05 4c 5a 13 movdqa 0x135a4c(%rip),%xmm0 # 16a700 <bck> 34cb3: 00 34cb4: 0f 11 00 movups %xmm0,(%rax) 34cb7: 48 8b 05 52 5a 13 00 mov 0x135a52(%rip),%rax # 16a710 <bck+0x10> 34cbe: 48 89 43 10 mov %rax,0x10(%rbx) bp->prev = bck.prev; 34cc2: 48 8b 05 47 5a 13 00 mov 0x135a47(%rip),%rax # 16a710 <bck+0x10> 34cc9: 48 89 43 10 mov %rax,0x10(%rbx) bck.count = 0; 34ccd: c7 05 29 5a 13 00 00 movl $0x0,0x135a29(%rip) # 16a700 <bck> 34cd4: 00 00 00 bck.list = 0; 34cd7: 48 c7 05 26 5a 13 00 movq $0x0,0x135a26(%rip) # 16a708 <bck+0x8> 34cde: 00 00 00 00 bck.prev = bp; 34ce2: 48 89 1d 27 5a 13 00 mov %rbx,0x135a27(%rip) # 16a710 <bck+0x10> job_unlock(); 34ce9: 8b 05 f9 4f 13 00 mov 0x134ff9(%rip),%eax # 169ce8 <job+0x28> 34cef: 83 e8 01 sub $0x1,%eax 34cf2: 89 05 f0 4f 13 00 mov %eax,0x134ff0(%rip) # 169ce8 <job+0x28> 34cf8: 75 2b jne 34d25 <job_subsave+0x8e> 34cfa: 8b 3d ec 4f 13 00 mov 0x134fec(%rip),%edi # 169cec <job+0x2c> 34d00: 85 ff test %edi,%edi 34d02: 74 21 je 34d25 <job_subsave+0x8e> 34d04: c7 05 da 4f 13 00 01 movl $0x1,0x134fda(%rip) # 169ce8 <job+0x28> When `-O2` is used instead of `-O1`, the `addl` and `sub` instructions for incrementing and decrementing the lock are removed. GCC instead generates a broken `mov` instruction for `job_lock` and removes the initial `sub` instruction in job_unlock (this is also seen in Red Hat's bug report): job_lock(); *bp = bck; 37d7c: 66 0f 6f 05 7c 79 14 movdqa 0x14797c(%rip),%xmm0 # 17f700 <bck> 37d83: 00 struct back_save *bp = new_of(struct back_save,0); 37d84: 49 89 c4 mov %rax,%r12 job_lock(); 37d87: 8b 05 5b 6f 14 00 mov 0x146f5b(%rip),%eax # 17ece8 <job+0x28> ... job_unlock(); 37dc6: 89 05 1c 6f 14 00 mov %eax,0x146f1c(%rip) # 17ece8 <job+0x28> 37dcc: 85 c0 test %eax,%eax 37dce: 75 2b jne 37dfb <job_subsave+0x8b> The original patch works around this bug by using the legacy `__sync_fetch_and_add/sub` GCC builtins. This forces GCC to generate instructions that change the lock with `lock addl`, `lock xadd` and `lock subl`: job_lock(); 37d9f: f0 83 05 41 6f 14 00 lock addl $0x1,0x146f41(%rip) # 17ece8 <job+0x28> 37da6: 01 ... job_unlock(); 37deb: f0 0f c1 05 f5 6e 14 lock xadd %eax,0x146ef5(%rip) # 17ece8 <job+0x28> 37df2: 00 37df3: 83 f8 01 cmp $0x1,%eax 37df6: 74 08 je 37e00 <job_subsave+0x70> ... 37e25: 74 11 je 37e38 <job_subsave+0xa8> 37e27: f0 83 2d b9 6e 14 00 lock subl $0x1,0x146eb9(%rip) # 17ece8 <job+0x28> While this does work, it isn't portable. This patch implements a different workaround for this compiler bug. If `job_lock` is put at the beginning of `job_subsave`, GCC will generate the required `addl` and `sub` instructions: job_lock(); 37d67: 83 05 7a 5f 14 00 01 addl $0x1,0x145f7a(%rip) # 17dce8 <job+0x28> ... job_unlock(); 37dbb: 83 e8 01 sub $0x1,%eax 37dbe: 89 05 24 5f 14 00 mov %eax,0x145f24(%rip) # 17dce8 <job+0x28> It is odd that moving a single line of code fixes this problem, although GCC _should_ have generated these instructions from the original code anyway. I'll note that this isn't the only way to get these instructions to generate. The problem also seems to go away when inserting almost anything else inside of the code for `job_subsave`. This is just a simple workaround for a strange compiler bug. diff --git a/src/cmd/ksh93/sh/jobs.c b/src/cmd/ksh93/sh/jobs.c index 9b852b0..d9fcf1f 100644 --- a/src/cmd/ksh93/sh/jobs.c +++ b/src/cmd/ksh93/sh/jobs.c @@ -1973,8 +1973,14 @@ again: void *job_subsave(void) { - struct back_save *bp = new_of(struct back_save,0); + /* + * We must make a lock first before doing anything else, + * otherwise GCC will remove the job locking mechanism + * as a result of compiler optimization. + */ job_lock(); + + struct back_save *bp = new_of(struct back_save,0); *bp = bck; bp->prev = bck.prev; bck.count = 0; |