Bug 1262498

Summary: r_ext4_small_bg test failure with gcc-4.8.5-4, valgrind errors
Product: Red Hat Enterprise Linux 7 Reporter: Martin Sebor <msebor>
Component: e2fsprogsAssignee: Eric Sandeen <esandeen>
Status: CLOSED ERRATA QA Contact: Boyang Xue <bxue>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: eguan, msebor, xzhou
Target Milestone: rc   
Target Release: ---   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: e2fsprogs-1.42.9-9.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-04 06:41:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Valgrind output for e2fsck. none

Description Martin Sebor 2015-09-11 23:35:31 UTC
Created attachment 1072655 [details]
Valgrind output for e2fsck.

After upgrading GCC to gcc version 4.8.5 20150623 (Red Hat 4.8.5-4), the r_ext4_small_bg test fails on ppc64le (I haven't tested other targets) with the following output:

Running e2fsprogs test suite...
 
dumpe2fs 1.42.9 (28-Dec-2013)
r_ext4_small_bg: ext4 1024 blocksize with small block groups: failed
141 tests succeeded	1 tests failed
Tests failed: r_ext4_small_bg 

All other tests pass.

While debugging the failure I narrowed it down to the e2fsck/util.c file where I was able to make it go away by changing compiler options or making small code changes.  For instance, disabling optimization helped, as well as compiling the file with _FORTIFY_SOURCE undefined (and optimization enabled), and surprisingly, even removing -g or stripping the e2fsck program.  Since neither the presence or absence of debugging symbols or other symbols has any effect on the generated code, the problem must be in the program data.  Running the e2fsck program under valgrind revealed a large number of errors pointing out uses of uninitialized data (see the attachment). I believe these are the cause of the test failure.

Comment 2 Martin Sebor 2015-09-11 23:48:24 UTC
I should clarify that the e2fsck/util.c file isn't the only one where even small changes can cause the test failure to disappear.  The failure can also be eliminated by making what should otherwise be inconsequential changes in other source files that e2fsck links with.

Comment 3 Eric Sandeen 2015-09-14 14:11:40 UTC
==85688== Conditional jump or move depends on uninitialised value(s)
==85688==    at 0x42EC394: ??? (in /usr/lib64/power8/libc-2.17.so)
==85688==    by 0x434C2BF: ??? (in /usr/lib64/power8/libc-2.17.so)
==85688==    by 0x40DC287: check_mntent_file (ismounted.c:112)
==85688==    by 0x40DC803: check_mntent (ismounted.c:227)
==85688==    by 0x40DC803: ext2fs_check_mount_point (ismounted.c:360)
==85688==    by 0x40DC91F: ext2fs_check_if_mounted (ismounted.c:400)
==85688==    by 0x100098E3: check_mount (unix.c:228)
==85688==    by 0x100098E3: main (unix.c:1234)

Could you please run this again with glibc-debuginfo installed?

Thanks,
-Eric

Comment 4 Eric Sandeen 2015-09-14 19:19:41 UTC
Rebuilding with

gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) 

on x86_64 succeeded; could this be arch-specific?

Or - did you run this by rebuilding the RHEL7 RPM, or some other e2fsprogs version?

Comment 5 Martin Sebor 2015-09-14 19:52:33 UTC
This came up during a mass RHEL 7.2 rebuild on powerpc64le.  I haven't tried any other targets.  The exact e2fsprogs version is 1.42.9-8.el7.  For some reason mock/yum isn't finding glibc-devel here so I don't have a more complete stack trace at the moment.

Comment 6 Eric Sandeen 2015-09-14 20:54:58 UTC
Ok, on ppc64le I'm getting the failure too, but so far valgrind isn't showing anything...

Comment 7 Martin Sebor 2015-09-15 14:59:36 UTC
I can reproduce the same or similar valgrind errors even with GCC 4.8.3-9 by modifying tests/scripts/resize_test like so and rerunning the test via 'make check TESTS=r_ext4_small_bg':

--- tests/scripts/resize_test.~0~	2015-09-15 10:58:04.866164753 -0400
+++ tests/scripts/resize_test	2015-09-15 10:58:08.466211056 -0400
@@ -53,7 +53,7 @@
 fi
 
 echo $FSCK -fp $TMPFILE >> $LOG 2>&1 
-if ! $FSCK -fp $TMPFILE >> $LOG 2>&1
+if ! valgrind $FSCK -fp $TMPFILE >> $LOG 2>&1
 then
 	dumpe2fs $TMPFILE >> $LOG
 	return 1

Comment 8 Eric Sandeen 2015-09-15 17:27:08 UTC
Weird, I did the same, and get nothing from valgrind.

But if you get the same errors, then the valgrind output is probably not related to the new-gcc-specific failure, I suppose.

Comment 9 Eric Sandeen 2015-09-17 00:37:35 UTC
I'm not sure this is a gcc issue.  if I take the original mkfs'd filesystem, transport it to another old rhel6 machine, and run resize on it there, I get minor corruption.  It seems to be a bitmap marking problem during resize.

I can't explain the gcc impact; it looks like just a straightforward bug.

If gcc affects the layout of the original un-resized filesystem, somehow, maybe that's it?  Very strange.

Comment 10 Eric Sandeen 2015-09-17 23:04:53 UTC
This sure looks like a plain bug in resize2fs.  Patch sent upstream:

http://marc.info/?l=linux-ext4&m=144252982403894&w=2

I can't explain how gcc versions might tickle this, unless it's affecting something else which changes how allocations behave during the test...

-Eric

Comment 11 Eric Sandeen 2015-09-18 03:20:10 UTC
I think the only reason the different gcc tweaked the bug is that the
test copies the e2fsck/e2fsck binary into the filesystem under test,
and the size changes depending on the compiler.

This leads to a different allocation pattern, and tickles the bug.

I don't think it's a gcc problem, or even a regression, though of course we'd like it to pass the self-checks on rebuild...

-Eric

Comment 12 Martin Sebor 2015-09-18 16:15:57 UTC
I also don't believe it's a gcc bug (even though had initially I suspected it because of the effect of even subtle code changes, until I noticed they had no impact on the generated assembly). Thanks for the confirmation!

Comment 13 Eric Sandeen 2016-01-14 14:10:35 UTC
The patch has been sent upstream, but never merged, pinged again...

Comment 14 Eric Sandeen 2016-02-19 22:51:26 UTC
moving to rhel7.3; still can't get any success merging it uptream despite 2 reviewers, but I do have a patch that fixes this.

-Eric

Comment 16 Eric Sandeen 2016-06-13 14:48:27 UTC
commit f3745728bc254892da4c569ba3fd8801895f3524
Author: Eric Sandeen <sandeen>
Date:   Sun Mar 6 21:51:23 2016 -0500

    resize2fs: clear uninit BG if allocating from new group
    
    If resize2fs_get_alloc_block() allocates from a BLOCK_UNINIT group, we
    need to make sure that the UNINIT flag is cleared on both file system
    structures which are maintained by resize2fs.  This causes the
    modified bitmaps to not get written out, which leads to post-resize2fs
    e2fsck errors; used blocks in UNINIT groups, not marked in the block
    bitmap.  This was seen on r_ext4_small_bg.
    
    This patch uses clear_block_uninit() to clear the flag,
    and my problem goes away.
    
    Signed-off-by: Eric Sandeen <sandeen>
    Reviewed-by: Darrick J. Wong <darrick.wong>
    Reviewed-by: Andreas Dilger <adilger>
    Signed-off-by: Theodore Ts'o <tytso>

Comment 20 errata-xmlrpc 2016-11-04 06:41:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2454.html