Bug 1265512
Summary: | gcc: gcc miscompiled kernel and caused xfs corruption in xfstests | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Eryu Guan <eguan> | ||||||
Component: | kernel-aarch64 | Assignee: | arm-mgr | ||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Filesystem QE <fs-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.2 | CC: | dchinner, eguan, esandeen, fs-qe, fweimer, jakub, jbastian, jfeeney, mfranc, mpolacek, zlang | ||||||
Target Milestone: | rc | ||||||||
Target Release: | 7.3 | ||||||||
Hardware: | aarch64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-10-26 13:02:28 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Eryu Guan
2015-09-23 06:22:44 UTC
Created attachment 1076088 [details]
167.full
Created attachment 1076089 [details]
metadump of scratch dev after test
Also seen similar corruption in shared/006 Please upgrade xfsprogs to 4.2.0 to rule out a repair bug we've already fixed. Also, the ARM compiler chain is suspect - we've got evidence that gcc from 4.6 through to 4.9 miscompile various different bits of XFS code (kernel and userspace) due to a combination of bad optimisations and broken (and unfixed) kernel asm code (e.g. the do_div() implementation). IOWs, it's not a good investment in time to be chasing ARM specific XFS corruptions until the known problems with the toolchain and platform support are fixed up first. -Dave. (In reply to Dave Chinner from comment #6) > Please upgrade xfsprogs to 4.2.0 to rule out a repair bug we've already > fixed. 4.2.0 version of xfsprogs also reports the corruption. (tested with the fs image on x86_64 host) > > Also, the ARM compiler chain is suspect - we've got evidence that gcc from > 4.6 through to 4.9 miscompile various different bits of XFS code (kernel and > userspace) due to a combination of bad optimisations and broken (and > unfixed) kernel asm code (e.g. the do_div() implementation). Current arm distro (7.2 snapshot 3) ships 4.8.5 gcc, so.. > > IOWs, it's not a good investment in time to be chasing ARM specific XFS > corruptions until the known problems with the toolchain and platform support > are fixed up first. sounds reasonable. Thanks! Eryu (In reply to Eryu Guan from comment #7) > (In reply to Dave Chinner from comment #6) > > Please upgrade xfsprogs to 4.2.0 to rule out a repair bug we've already > > fixed. > > 4.2.0 version of xfsprogs also reports the corruption. (tested with the fs > image on x86_64 host) Ok. > > Also, the ARM compiler chain is suspect - we've got evidence that gcc from > > 4.6 through to 4.9 miscompile various different bits of XFS code (kernel and > > userspace) due to a combination of bad optimisations and broken (and > > unfixed) kernel asm code (e.g. the do_div() implementation). > > Current arm distro (7.2 snapshot 3) ships 4.8.5 gcc, so.. Suspect compiler. FWIW, here's some of the background: http://oss.sgi.com/archives/xfs/2013-02/msg00563.html gcc 4.6 and 4.7 are suspect. For one recent issue: http://oss.sgi.com/archives/xfs/2015-06/msg00218.html 4.6, 4.7 are suspect, 4.9.2 is OK. 4.8 is unknown at this point, but later on on that thread: http://oss.sgi.com/archives/xfs/2015-08/msg00236.html using -Os instead of -O2 seems to avoid certain manifestatons of the issue, so it's not a clear delineation of where the problems actually lie. i.e. it may be related to the do_div problem with ARM which doesn't appear to be fixed upstream yet: http://www.spinics.net/lists/arm-kernel/msg426684.html So, gcc 4.8 is the next question. Well, it's suspect across the board (not just on ARM) with XFS because of an optimisation added in 4.8.3 is based on an unreliable structure definition heuristic: http://oss.sgi.com/archives/xfs/2015-08/msg00267.html This is avoided by XFS changes in 4.3-rc1 and xfsprogs 4.2.0 to stop the compiler making incorrect assumptions about the code, but older code may still be miscompiled by 4.8.3+ So, really, I've got very little confidence that the ARM/aarch64 gcc toolchain and platform support is particularly robust, and it seems to me that gcc 4.8.x where x >= 3 are suspect no matter the platform... can you see if the problems go away with a more recent 4.9.x compiler? -dave. (In reply to Dave Chinner from comment #8) > > FWIW, here's some of the background: Thanks for the background! > > can you see if the problems go away with a more recent 4.9.x compiler? Sure, I'm reserving a host now, will do the testing and report back. I installed upstream gcc 4.9.3 and compiled 4.3-rc2 kernel, xfs/167 and shared/006 passed 100 loops. So seems gcc-4.9.3 is good. Now I'm compiling kernel-aarch64-4.2.0-20.el7 (shipped with RHEL7.2 Snapshot 3) with the new gcc, and will try that kernel too. (In reply to Eryu Guan from comment #10) > > Now I'm compiling kernel-aarch64-4.2.0-20.el7 (shipped with RHEL7.2 Snapshot > 3) with the new gcc, and will try that kernel too. This re-compiled kernel passed 100 loops (shared/006 and xfs/167) too, while the stock kernel-aarch64-4.2.0-20.el7 failed the test within 10 loops. So this seems really to be a toolchain issue on arm platform, and gcc 4.8.5 (shipped with RHEL7.2) generates bad code and upstream gcc 4.9.3 is good. There is a dearth of information for us to fix or even analyze this problem. Does the problem go away when using -fno-aggressive-loop-optimizations -fwrapv -fno-strict-aliasing? If so, then you're likely invoking undefined behavior. Or you can also try compiling/linking with -fsanitize=undefined and then running the test (but you need newer gcc: 4.9 or 5). If the test still fails even with those options, we'd need to bisect this at the *.o level between (ABI compatible) objects built in a way that it works (I guess -Os here) and objects built in a way that fails (-O2?) to narrow it down to single .o. Then we would need to narrow the problem down to a single function, then eyeball the assembly and try to find out where it is wrong and try to create a self-contained executable testcase from the preprocessed source. But I'd start with trying out the options I mentioned before. Yeah, even if some compiler "works" and another compiler "doesn't work", it can be and usually is, a bug in the code in question, rather than the compiler. If the problem is on the kernel side, -fsanitize=undefined won't work, but supposedly -fsanitize=undefined -fsanitize-undefined-trap-on-error might. In any case, knowing which exact object file is problematic, preprocessed source for it and gcc options used for it is essential. So, reassigning back to kernel until that is provided. To all the compiler and ARM people: Please read comment #8 and all the threads linked into that comment before pointing fingers about who is to blame. There have been lots of problems, spread across different areas, from old style code that incorrectly triggers compiler optimisation hueristics, to broken ASM code, to compiler bugs. All the information on the problems, how to reproduce them, their cause and solutoins/workarounds are in those threads. -Dave. I've skimmed the threads and didn't find anything that looked like a compiler bug (though found two patches that fixed kernel bugs). So, if you are aware of something in there that is or at least looks like a compiler bug, rather than just folks trying different compiler versions and seeing what works and what doesn't, please point me at it. As always, for a compiler bugreport we need preprocessed source, compiler version, options passed to the compiler and why do you think it is a compiler bug (and, at least narrowed to a source file with a hint where to look for problem, or better narrowed to a particular function; best just in a form of self-contained testcase). > If the problem is on the kernel side, -fsanitize=undefined won't work, but supposedly -fsanitize=undefined -fsanitize-undefined-trap-on-error might.
how would it behave if it trips?
I compiled xfs with those flags in rawhide (gcc version 5.1.1 20150618 (Red Hat 5.1.1-4) (GCC)) and on xfs unmount I get an oops, but I don't know what to make of it,
[ 77.618773] XFS (dm-2): Unmounting Filesystem
[ 77.629242] ------------[ cut here ]------------
[ 77.630084] Kernel BUG at ffffffffa0675ea5 [verbose debug info unavailable]
With -fsanitize=undefined -fsanitize-undefined-trap-on-error it is if (this_would_be_undefined_behavior) __builtin_trap (); operation_that_could_invoke_undefined_behavior; So, one has to inspect the code. Alternatively, you could try to apply https://github.com/torvalds/linux/compare/master...aryabinin:ubsan/v2?diff=unified&name=ubsan%2Fv2 and through that compile just with -fsanitize=undefined without that -fsanitize-undefined-trap-on-error, at which point it should be verbose on what exact undefined behavior has been triggered where. Most likely all you need is just to link in the ubsan.o into the kernel or into the xfs module, and just use -fsanitize=undefined on the xfs module. Just for what it's worth, I asked out of curiosity; I don't want to distract too much from the original problem. As Eryu said:
> So this seems really to be a toolchain issue on arm platform, and gcc 4.8.5 (shipped with RHEL7.2) generates bad code and upstream gcc 4.9.3 is good.
... that's at least a possibility (vs. "it must be bad xfs code"), right?
Until analyzed, of course, everything is possible. If it has been analyzed by somebody and there is e.g. a link to a GCC bugzilla, it is enough if you point me to it. Otherwise, it would be nice if as the first step it is checked if the code isn't obviously bad (and -fsanitize=undefined can help with that, while of course not being 100% at determining that, but handling many common cases), and if that doesn't show anything, somebody familiar with that part of the kernel at least bisecting it to something manageable. E.g. if you have a "good compiler" and "bad compiler" where good and bad is that it generates/doesn't generate what you expect, and might be different compiler versions and/or different compiler options (e.g. -Os vs. -O2, -fno-inline vs. normal, -fno-aggressive-loop-optimizations, etc.), then trying to bisect between object files compiled with "bad" and "good" compiler to narrow it down to (ideally) a single problematic file is best done by somebody who knows about the app in question and has access to the reproducer. Even within a single file (which is then best preprocessed and handled in that form further), one can use e.g. noinline attribute, or optimize attribute to narrow it down even further. I'm trying with -fsanitize=undefined, but unfortunately that's tedious. Any unaligned access will oops, for example, or maybe it's something else - who knows, we just get a line of code and then go try to work out what caused it to abort. And that may or may not be the issue which caused the problem on this architecture in the first place. xfsprogs shares a lot of code w/ the kernel, so I did go down the path of building userspace with -fsanitize=undefined - and it's been useful, but most of what it has found is outside of the shared libxfs code. (I don't think we can really just link libubsan into the kernel, can we? It would be fantastic, though, to get a run-time printk ala userspace runtime messages, but I don't know how to achieve that). The https://github.com/torvalds/linux/compare/master...aryabinin:ubsan/v2?diff=unified&name=ubsan%2Fv2 patch I've mentioned provides a libubsan compatible entrypoints for the kernel. Also, -fsanitize=undefined has many sub-options, it stands for shift integer-divide-by-zero unreachable vla-bound return null signed-integer-overflow bool enum bounds alignment nonnull-attribute returns-nonnull-attribute vptr (exact details depending on exact compiler version). If the kernel violates alignment requirement everywhere, you can enable only some of the checks, or e.g. -fsanitize=undefined -fno-sanitize=alignment etc. Oh, I'm sorry, I had missed that kernel patch. Thanks, I'll take a look. Ok,got that patch going. FWIW, xfstests xfs/167 trips no errors on x86_64, but I'll attach the patch and perhaps QE can give it a shot on aarch64 (I can never get access to a machine...) FWIW, I did do a little w/ libubsan & xfs kernel code; the only issue I came across was a zero-sized memcpy from NULL; I sent a patch for that upstream. |