Bug 1265512

Summary: gcc: gcc miscompiled kernel and caused xfs corruption in xfstests
Product: Red Hat Enterprise Linux 7 Reporter: Eryu Guan <eguan>
Component: kernel-aarch64Assignee: arm-mgr
Status: CLOSED INSUFFICIENT_DATA QA Contact: Filesystem QE <fs-qe>
Severity: high Docs Contact:
Priority: high    
Version: 7.2CC: dchinner, eguan, esandeen, fs-qe, fweimer, jakub, jbastian, jfeeney, mfranc, mpolacek, zlang
Target Milestone: rc   
Target Release: 7.3   
Hardware: aarch64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-26 13:02:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
167.full
none
metadump of scratch dev after test none

Description Eryu Guan 2015-09-23 06:22:44 UTC
Description of problem:
xfstests xfs/167 corrupts v5 xfs on aarch64 host (and only on aarch64)

[root@amd-seattle-19 xfstests]# xfs_repair -n /dev/mapper/rhel_amd--seattle--19-testlv2
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
bad CRC for inode 16777436
bad CRC for inode 16777436, would rewrite
would have cleared inode 16777436
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 2
bad CRC for inode 16777436, would rewrite
would have cleared inode 16777436
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

Version-Release number of selected component (if applicable):
kernel-4.2.0-0.19.el7.aarch64
xfsprogs-3.2.2-2.el7.aarch64

How reproducible:
not every time, but can be reliably reprodued by running xfs/167 in loop, usually within 20 loops

Steps to Reproduce:
1. MKFS_OPTIONS="-m crc=1" ./check xfs/167
2.
3.

Actual results:
xfs corruption

Expected results:
no corruption

Additional info:
This can be reproduced with upstream 4.3-rc2, 4.2, 4.1, 4.0 kernels, I haven't tried kernels prior to 4.0. But still, can be reproduced only on aarch64 hosts, ppc64[le] and x86_64 hosts passed the test.

Similar corruption can be seen in xfs/297 too, aarch64 only too.

Comment 1 Eryu Guan 2015-09-23 06:23:20 UTC
Created attachment 1076088 [details]
167.full

Comment 2 Eryu Guan 2015-09-23 06:24:26 UTC
Created attachment 1076089 [details]
metadump of scratch dev after test

Comment 4 Eryu Guan 2015-09-23 06:29:15 UTC
Also seen similar corruption in shared/006

Comment 6 Dave Chinner 2015-09-25 02:20:55 UTC
Please upgrade xfsprogs to 4.2.0 to rule out a repair bug we've already fixed.

Also, the ARM compiler chain is suspect - we've got evidence that gcc from 4.6 through to 4.9 miscompile various different bits of XFS code (kernel and userspace) due to a combination of bad optimisations and broken (and unfixed) kernel asm code (e.g. the do_div() implementation).

IOWs, it's not a good investment in time to be chasing ARM specific XFS corruptions until the known problems with the toolchain and platform support are fixed up first.

-Dave.

Comment 7 Eryu Guan 2015-09-25 03:41:46 UTC
(In reply to Dave Chinner from comment #6)
> Please upgrade xfsprogs to 4.2.0 to rule out a repair bug we've already
> fixed.

4.2.0 version of xfsprogs also reports the corruption. (tested with the fs image on x86_64 host)

> 
> Also, the ARM compiler chain is suspect - we've got evidence that gcc from
> 4.6 through to 4.9 miscompile various different bits of XFS code (kernel and
> userspace) due to a combination of bad optimisations and broken (and
> unfixed) kernel asm code (e.g. the do_div() implementation).

Current arm distro (7.2 snapshot 3) ships 4.8.5 gcc, so..

> 
> IOWs, it's not a good investment in time to be chasing ARM specific XFS
> corruptions until the known problems with the toolchain and platform support
> are fixed up first.

sounds reasonable. Thanks!

Eryu

Comment 8 Dave Chinner 2015-09-25 05:22:27 UTC
(In reply to Eryu Guan from comment #7)
> (In reply to Dave Chinner from comment #6)
> > Please upgrade xfsprogs to 4.2.0 to rule out a repair bug we've already
> > fixed.
> 
> 4.2.0 version of xfsprogs also reports the corruption. (tested with the fs
> image on x86_64 host)

Ok.

> > Also, the ARM compiler chain is suspect - we've got evidence that gcc from
> > 4.6 through to 4.9 miscompile various different bits of XFS code (kernel and
> > userspace) due to a combination of bad optimisations and broken (and
> > unfixed) kernel asm code (e.g. the do_div() implementation).
> 
> Current arm distro (7.2 snapshot 3) ships 4.8.5 gcc, so..

Suspect compiler.

FWIW, here's some of the background:

http://oss.sgi.com/archives/xfs/2013-02/msg00563.html

gcc 4.6 and 4.7 are suspect.

For one recent issue:

http://oss.sgi.com/archives/xfs/2015-06/msg00218.html

4.6, 4.7 are suspect, 4.9.2 is OK. 4.8 is unknown at this point, but later on on that thread:

http://oss.sgi.com/archives/xfs/2015-08/msg00236.html

using -Os instead of -O2 seems to avoid certain manifestatons of the issue, so it's not a clear delineation of where the problems actually lie. i.e. it may be related to the do_div problem with ARM which doesn't appear to be fixed upstream yet:

http://www.spinics.net/lists/arm-kernel/msg426684.html

So, gcc 4.8 is the next question. Well, it's suspect across the board (not just on ARM) with XFS because of an optimisation added in 4.8.3 is based on an unreliable structure definition heuristic:

http://oss.sgi.com/archives/xfs/2015-08/msg00267.html

This is avoided by XFS changes in 4.3-rc1 and xfsprogs 4.2.0 to stop the compiler making incorrect assumptions about the code, but older code may still be miscompiled by 4.8.3+

So, really, I've got very little confidence that the ARM/aarch64 gcc toolchain and platform support is particularly robust, and it seems to me that gcc 4.8.x where x >= 3 are suspect no matter the platform...

can you see if the problems go away with a more recent 4.9.x compiler?

-dave.

Comment 9 Eryu Guan 2015-09-25 07:32:09 UTC
(In reply to Dave Chinner from comment #8)
> 
> FWIW, here's some of the background:

Thanks for the background!

> 
> can you see if the problems go away with a more recent 4.9.x compiler?

Sure, I'm reserving a host now, will do the testing and report back.

Comment 10 Eryu Guan 2015-09-27 15:12:14 UTC
I installed upstream gcc 4.9.3 and compiled 4.3-rc2 kernel, xfs/167 and shared/006 passed 100 loops. So seems gcc-4.9.3 is good.

Now I'm compiling kernel-aarch64-4.2.0-20.el7 (shipped with RHEL7.2 Snapshot 3) with the new gcc, and will try that kernel too.

Comment 11 Eryu Guan 2015-09-29 02:14:04 UTC
(In reply to Eryu Guan from comment #10)
> 
> Now I'm compiling kernel-aarch64-4.2.0-20.el7 (shipped with RHEL7.2 Snapshot
> 3) with the new gcc, and will try that kernel too.

This re-compiled kernel passed 100 loops (shared/006 and xfs/167) too, while the stock kernel-aarch64-4.2.0-20.el7 failed the test within 10 loops.

So this seems really to be a toolchain issue on arm platform, and gcc 4.8.5 (shipped with RHEL7.2) generates bad code and upstream gcc 4.9.3 is good.

Comment 13 Marek Polacek 2015-09-29 11:33:04 UTC
There is a dearth of information for us to fix or even analyze this problem.

Does the problem go away when using -fno-aggressive-loop-optimizations -fwrapv -fno-strict-aliasing?  If so, then you're likely invoking undefined behavior.  Or you can also try compiling/linking with -fsanitize=undefined and then running the test (but you need newer gcc: 4.9 or 5).

If the test still fails even with those options, we'd need to bisect this at the *.o level between (ABI compatible) objects built in a way that it works (I guess -Os here) and objects built in a way that fails (-O2?) to narrow it down to single .o.  Then we would need to narrow the problem down to a single function, then eyeball the assembly and try to find out where it is wrong and try to create a self-contained executable testcase from the preprocessed source.  But I'd start with trying out the options I mentioned before.

Comment 14 Jakub Jelinek 2015-09-29 11:41:34 UTC
Yeah, even if some compiler "works" and another compiler "doesn't work", it can be and usually is, a bug in the code in question, rather than the compiler.
If the problem is on the kernel side, -fsanitize=undefined won't work, but supposedly -fsanitize=undefined -fsanitize-undefined-trap-on-error might.
In any case, knowing which exact object file is problematic, preprocessed source for it and gcc options used for it is essential.  So, reassigning back to kernel until that is provided.

Comment 15 Dave Chinner 2015-09-30 11:48:12 UTC
To all the compiler and ARM people:

Please read comment #8 and all the threads linked into that comment before pointing fingers about who is to blame. There have been lots of problems, spread across different areas, from old style code that incorrectly triggers compiler optimisation hueristics, to broken ASM code, to compiler bugs. All the information on the problems, how to reproduce them, their cause and solutoins/workarounds are in those threads.

-Dave.

Comment 16 Jakub Jelinek 2015-09-30 12:12:08 UTC
I've skimmed the threads and didn't find anything that looked like a compiler bug (though found two patches that fixed kernel bugs).  So, if you are aware of something in there that is or at least looks like a compiler bug, rather than just folks trying different compiler versions and seeing what works and what doesn't, please point me at it.  As always, for a compiler bugreport we need preprocessed source, compiler version, options passed to the compiler and why do you think it is a compiler bug (and, at least narrowed to a source file with a hint where to look for problem, or better narrowed to a particular function; best just in a form of self-contained testcase).

Comment 17 Eric Sandeen 2015-10-01 03:25:16 UTC
> If the problem is on the kernel side, -fsanitize=undefined won't work, but supposedly -fsanitize=undefined -fsanitize-undefined-trap-on-error might.

how would it behave if it trips?

I compiled xfs with those flags in rawhide (gcc version 5.1.1 20150618 (Red Hat 5.1.1-4) (GCC)) and on xfs unmount I get an oops, but I don't know what to make of it,

[   77.618773] XFS (dm-2): Unmounting Filesystem
[   77.629242] ------------[ cut here ]------------
[   77.630084] Kernel BUG at ffffffffa0675ea5 [verbose debug info unavailable]

Comment 18 Jakub Jelinek 2015-10-01 06:11:46 UTC
With -fsanitize=undefined -fsanitize-undefined-trap-on-error it is
if (this_would_be_undefined_behavior) __builtin_trap (); operation_that_could_invoke_undefined_behavior;
So, one has to inspect the code.
Alternatively, you could try to apply
https://github.com/torvalds/linux/compare/master...aryabinin:ubsan/v2?diff=unified&name=ubsan%2Fv2
and through that compile just with -fsanitize=undefined without that -fsanitize-undefined-trap-on-error, at which point it should be verbose on what exact undefined behavior has been triggered where.
Most likely all you need is just to link in the ubsan.o into the kernel or into the xfs module, and just use -fsanitize=undefined on the xfs module.

Comment 19 Eric Sandeen 2015-10-01 13:17:33 UTC
Just for what it's worth, I asked out of curiosity; I don't want to distract too much from the original problem.  As Eryu said:

> So this seems really to be a toolchain issue on arm platform, and gcc 4.8.5 (shipped with RHEL7.2) generates bad code and upstream gcc 4.9.3 is good.

... that's at least a possibility (vs. "it must be bad xfs code"), right?

Comment 20 Jakub Jelinek 2015-10-01 13:31:39 UTC
Until analyzed, of course, everything is possible.  If it has been analyzed by somebody and there is e.g. a link to a GCC bugzilla, it is enough if you point me to it.  Otherwise, it would be nice if as the first step it is checked if the code isn't obviously bad (and -fsanitize=undefined can help with that, while of course not being 100% at determining that, but handling many common cases), and if that doesn't show anything, somebody familiar with that part of the kernel at least bisecting it to something manageable.  E.g. if you have a "good compiler" and "bad compiler" where good and bad is that it generates/doesn't generate what you expect, and might be different compiler versions and/or different compiler options (e.g. -Os vs. -O2, -fno-inline vs. normal, -fno-aggressive-loop-optimizations, etc.), then trying to bisect between object files compiled with "bad" and "good" compiler to narrow it down to (ideally) a single problematic file is best done by somebody who knows about the app in question and has access to the reproducer.  Even within a single file (which is then best preprocessed and handled in that form further), one can use e.g. noinline attribute, or optimize attribute to narrow it down even further.

Comment 21 Eric Sandeen 2015-10-02 13:21:00 UTC
I'm trying with -fsanitize=undefined, but unfortunately that's tedious.  Any unaligned access will oops, for example, or maybe it's something else - who knows, we just get a line of code and then go try to work out what caused it to abort.  And that may or may not be the issue which caused the problem on this architecture in the first place.

xfsprogs shares a lot of code w/ the kernel, so I did go down the path of building userspace with -fsanitize=undefined - and it's been useful, but most of what it has found is outside of the shared libxfs code.

(I don't think we can really just link libubsan into the kernel, can we?  It would be fantastic, though, to get a run-time printk ala userspace runtime messages, but I don't know how to achieve that).

Comment 22 Jakub Jelinek 2015-10-02 13:32:26 UTC
The https://github.com/torvalds/linux/compare/master...aryabinin:ubsan/v2?diff=unified&name=ubsan%2Fv2 patch I've mentioned provides a libubsan compatible entrypoints for the kernel.
Also, -fsanitize=undefined has many sub-options, it stands for
shift
integer-divide-by-zero
unreachable
vla-bound
return
null
signed-integer-overflow
bool
enum
bounds
alignment
nonnull-attribute
returns-nonnull-attribute
vptr
(exact details depending on exact compiler version).  If the kernel violates alignment requirement everywhere, you can enable only some of the checks, or e.g.
-fsanitize=undefined -fno-sanitize=alignment
etc.

Comment 23 Eric Sandeen 2015-10-02 13:52:39 UTC
Oh, I'm sorry, I had missed that kernel patch.  Thanks, I'll take a look.

Comment 24 Eric Sandeen 2015-10-02 17:01:56 UTC
Ok,got that patch going.

FWIW, xfstests xfs/167 trips no errors on x86_64, but I'll attach the patch and perhaps QE can give it a shot on aarch64 (I can never get access to a machine...)

Comment 28 Eric Sandeen 2015-10-26 14:20:39 UTC
FWIW, I did do a little w/ libubsan & xfs kernel code; the only issue I came across was a zero-sized memcpy from NULL; I sent a patch for that upstream.