When e2fsprogs are built with hardened build enabled in redhat-rpm-config, r_ext4_small_bg test fails on 32-bit s390 (http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=1776795). After downgrading to redhat-rpm-config-27-1.fc23 I was able to build e2fsprogs. Version-Release number of selected component (if applicable): e2fsprogs-1.42.12-3.fc23 redhat-rpm-config-28-1.fc23 gcc-5.0.0-0.17.fc22 gcc-5.0.1-0.1.fc22
Hm: ./scripts/resize_test: line 100: dumpe2fs: command not found r_ext4_small_bg: ext4 1024 blocksize with small block groups: failed but it seems to have installed & built fine ... do you have a task # for the build w/ the downgraded redhat-rpm-config?
http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=1776816 the diff between the 2 buildroots is [dan@eagle ~]$ diff /tmp/e-1 /tmp/e-2 11c11 < cpp-5.0.1-0.1.fc22.s390 --- > cpp-5.0.0-0.17.fc22.s390 32,34c32,34 < gcc-5.0.1-0.1.fc22.s390 < gcc-c++-5.0.1-0.1.fc22.s390 < gcc-gdb-plugin-5.0.1-0.1.fc22.s390 --- > gcc-5.0.0-0.17.fc22.s390 > gcc-c++-5.0.0-0.17.fc22.s390 > gcc-gdb-plugin-5.0.0-0.17.fc22.s390 63,64c63,64 < libblkid-2.26-1.fc22.s390 < libblkid-devel-2.26-1.fc22.s390 --- > libblkid-2.26-0.4.fc22.s390 > libblkid-devel-2.26-0.4.fc22.s390 72c72 < libfdisk-2.26-1.fc22.s390 --- > libfdisk-2.26-0.4.fc22.s390 74c74 < libgcc-5.0.1-0.1.fc22.s390 --- > libgcc-5.0.0-0.17.fc22.s390 76c76 < libgomp-5.0.1-0.1.fc22.s390 --- > libgomp-5.0.0-0.17.fc22.s390 81c81 < libmount-2.26-1.fc22.s390 --- > libmount-2.26-0.4.fc22.s390 89c89 < libsmartcols-2.26-1.fc22.s390 --- > libsmartcols-2.26-0.4.fc22.s390 91,92c91,92 < libstdc++-5.0.1-0.1.fc22.s390 < libstdc++-devel-5.0.1-0.1.fc22.s390 --- > libstdc++-5.0.0-0.17.fc22.s390 > libstdc++-devel-5.0.0-0.17.fc22.s390 98,99c98,99 < libuuid-2.26-1.fc22.s390 < libuuid-devel-2.26-1.fc22.s390 --- > libuuid-2.26-0.4.fc22.s390 > libuuid-devel-2.26-0.4.fc22.s390 167c167 < redhat-rpm-config-28-1.fc23.noarch --- > redhat-rpm-config-27-1.fc22.noarch 184c184 < util-linux-2.26-1.fc22.s390 --- > util-linux-2.26-0.4.fc22.s390
Is e2fsprogs installed in the former build root and not the latter? That would explain the missing dumpe2fs, and the script is trying to run it from the root, not from the build tree. Does this fix it? diff --git a/tests/scripts/resize_test b/tests/scripts/resize_test index 0633e0c..dfd45ac 100755 --- a/tests/scripts/resize_test +++ b/tests/scripts/resize_test @@ -67,7 +67,7 @@ fi echo $FSCK -fy $TMPFILE >> $LOG 2>&1 if ! $FSCK -fy $TMPFILE >> $LOG 2>&1 then - dumpe2fs $TMPFILE >> $LOG + $DUMPE2FS $TMPFILE >> $LOG return 1 fi @@ -97,7 +97,7 @@ fi echo $FSCK -fy $TMPFILE >> $LOG 2>&1 if ! $FSCK -fy $TMPFILE >> $LOG 2>&1 then - dumpe2fs $TMPFILE >> $LOG + $DUMPE2FS $TMPFILE >> $LOG return 1 fi @@ -122,7 +122,7 @@ fi echo $FSCK -fy $TMPFILE >> $LOG 2>&1 if ! $FSCK -fy $TMPFILE >> $LOG 2>&1 then - dumpe2fs $TMPFILE >> $LOG + $DUMPE2FS $TMPFILE >> $LOG return 1 fi @@ -147,7 +147,7 @@ fi echo $FSCK -fy $TMPFILE >> $LOG 2>&1 if ! $FSCK -fy $TMPFILE >> $LOG 2>&1 then - dumpe2fs $TMPFILE >> $LOG + $DUMPE2FS $TMPFILE >> $LOG return 1 fi
(In reply to Eric Sandeen from comment #3) > Is e2fsprogs installed in the former build root and not the latter? That > would explain the missing dumpe2fs, and the script is trying to run it from > the root, not from the build tree. Does this fix it? unfortunately not -> http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=1777003 If I see correctly, then in neither case e2fsprogs are installed in buildroot.
Hm, thanks. So that changes it from: ./scripts/resize_test: line 100: dumpe2fs: command not found r_ext4_small_bg: ext4 1024 blocksize with small block groups: failed to: dumpe2fs 1.42.12 (29-Aug-2014) r_ext4_small_bg: ext4 1024 blocksize with small block groups: failed but it still fails. Odd. I need to fix that test infra to say *how* it failed. :(
with added %undefine _hardened_build on top of the spec it passes - http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=1777017
Created attachment 1014423 [details] test log file
few more info - rechecked with e2fsprogs-1.42.12-5.fc23 - reproduced with gcc-5.1.1-1.fc22.s390 and hardened build enabled - going from -O2 to -O1 and keeping hardened build makes the tests pass So very likely a compiler issue. Jakub, I have a F-22 s390 chroot ready and accessible via ssh, ping me for details.
So, it seems what matters for this test is whether e2fsck program has been linked with just -Wl,-z,relro -pie or -Wl,-z,relro -pie -Wl,-z,now (hardening adds both, which doesn't work). But, there is no change in the generated code between that, the only changes are in the layout of the sections (the first one lays sections in the RW segment out so that the .got section is 4KB aligned, the latter has 16 byte longer .dynamic section and lays them out so that the .data section is 4KB aligned). Replacing -Wl,-z,relro with -Wl,-z,norelro also fixes the problem (again, the section layout of the writable sections is very much different), both with -Wl,-z,now and without. Thus, I'd say this is likely some uninitialized variable issue or something similar where things go wrong depending on the exact data section layout. I'd suggest if somebody familiar with e2fsck sources just did a side-by-side debugging session between e2fsck linked with the default flags (-Wl,-z,relro -pie -Wl,-z,now) vs. any of the working combinations (-pie -Wl,-z,norelro,-z,now or -pie -Wl,z,relro) to find out where and why things start to look different. Not ruling gcc issues, but but in e2fsck is much more likely in this case.
Wish this didn't require an s390 box! I wonder if running that test through valgrind would offer any clues, if Jakub thinks it might be an uninitialized variable?
And there is no valgrind for s390 (32 bit). But Jakub gave me few hints, so I'll ask guys is our team to look on them first. From calling sequence in the logs (mke2fs, e2fsck, resize, e2fsck, resize, e2fsck) it looks like the last e2fsck fails.
Ok, try to get it down to a filesystem image + e2fsck call which fails; if it is an uninit var it might be discoverable via valgrind on another platform. I'm not sure how to arrive at your test platform so I'm not sure how to dig into this one. If you have a box/chroot I could log into and poke around in, I'd be willing. Maybe find me on IRC if you want me to log in and we can hash out details. :) Thanks, -Eric
This seems to reproduce it, but not with a stock ./configure && make #!/bin/bash MKFS="misc/mke2fs -t ext4" RESIZE2FS=resize/resize2fs E2FSCK=e2fsck/e2fsck DEBUGFS=debugfs/debugfs.static rm -f fsfile.img truncate -s 2G fsfile.img $MKFS -O '^resize_inode' -b 1024 -g 512 -qF fsfile.img 64M OUT_TMP=tmpfile date > $OUT_TMP cat $E2FSCK >> $OUT_TMP # slightly populate the fs $DEBUGFS -w fsfile.img << EOF mkdir test cd test write $OUT_TMP e2fsck ls /test stat /test/e2fsck quit EOF rm -f $OUT_TMP $E2FSCK -fn fsfile.img || echo "initial fsck failed" $RESIZE2FS fsfile.img 2G $E2FSCK -fn fsfile.img || echo "post-resize fsck failed" $RESIZE2FS -M fsfile.img 2G $E2FSCK -fn fsfile.img || echo "post-resize -M fsck failed"
On x86_64, both resize2fs invocations yield this under valgrind: ==20334== Invalid read of size 4 ==20334== at 0x406E70: main (main.c:472) ==20334== Address 0x4c26e08 is 40 bytes inside a block of size 296 free'd ==20334== at 0x4A063F0: free (vg_replace_malloc.c:446) ==20334== by 0x406679: resize_fs (resize2fs.c:211) ==20334== by 0x406FFF: main (main.c:457)
Ok, that's not the root cause, that's just a use after free when printing the new size: commit deae5e809b524a3cca3ecf66be28058134575a02 Author: Theodore Ts'o <tytso> Date: Wed Oct 8 12:09:35 2014 -0400 resize2fs: fix fs->blocksize dereference after fs has been freed Commit 77255cf36944b introduced a use after free bug. Signed-off-by: Theodore Ts'o <tytso>
For posterity, the way this resize test fails is: ... Pass 5: Checking group summary information Group 12 block(s) in use but group is marked BLOCK_UNINIT Fix? no Block bitmap differences: +6145 Fix? no fsfile.img: ********** WARNING: Filesystem still has errors ********** fsfile.img: 13/2816 files (0.0% non-contiguous), 5491/10962 blocks post-resize -M fsck failed
I'm stumped. :/
Do we know if the fs image after the "resize -M" is correct and it is the e2fsck what's buggy? Or is the fs image already corrupted?
The image checks cleanly prior to the resize2fs -M, and badly after it. (I suppose it's possible that the first e2fsck missed it but that seems unlikely) I've tried racing through where we should get to the point where we clear the UNINIT flag, and ... things get wonky. I'm not sure if that's my failing, or gcc's :)
ok, time for the hard way of "bisecting" through the sources then :-) Thanks for your time, Eric.
little progress - switching e2fsck/rehash.c to -O1 makes the problem go away, trying to find the function now ...
so there are 2 things that help - switching 3 functions in rehash.c to O1 (see below) or rebuilding e2fsck with -fno-strict-aliasing, so it still might be a coding issue --- rehash.c 2015-05-12 07:28:36.000000000 -0400 +++ /home/sharkcz/e2fsprogs-sandeen/e2fsprogs-1.42.12/e2fsck/rehash.c 2014-08-02 16:26:22.000000000 -0400 @@ -276,7 +274,7 @@ * expand the length of the filename beyond the padding available in * the directory entry. */ -__attribute__((optimize("O1"))) static void mutate_name(char *str, __u16 *len) +static void mutate_name(char *str, __u16 *len) { int i; __u16 l = *len & 0xFF, h = *len & 0xff00; @@ -549,7 +547,7 @@ * This function takes the leaf nodes which have been written in * outdir, and populates the root node and any necessary interior nodes. */ -__attribute__((optimize("O1"))) static errcode_t calculate_tree(ext2_filsys fs, +static errcode_t calculate_tree(ext2_filsys fs, struct out_dir *outdir, ext2_ino_t ino, ext2_ino_t parent) @@ -634,7 +632,7 @@ /* * Helper function which writes out a directory block. */ -__attribute__((optimize("O1"))) static int write_dir_block(ext2_filsys fs, +static int write_dir_block(ext2_filsys fs, blk64_t *block_nr, e2_blkcnt_t blockcnt, blk64_t ref_block EXT2FS_ATTR((unused)),
The code doesn't look to be aliasing safe. Haven't spent much time on it, but e.g. struct ext2_dx_entry { __u32 hash; __u32 block; }; struct ext2_dx_countlimit { __u16 limit; __u16 count; }; ... dx_ent = set_int_node(fs, block_start); limit = (struct ext2_dx_countlimit *) dx_ent; c2 = limit->limit; root_offset += sizeof(struct ext2_dx_entry); c1--; } dx_ent->block = ext2fs_cpu_to_le32(i); if (c2 != limit->limit) dx_ent->hash = ext2fs_cpu_to_le32(outdir->hashes[i]); dx_ent++; c2--; } limit->count = ext2fs_cpu_to_le16(limit->limit - c2); limit->limit = ext2fs_cpu_to_le16(limit->limit); So, set_int_node writes some chunk of memory using ext2_dx_countlimit struct, and then mixes reads and writes from it using both struct. Not saying this is exactly the problem spot, the code seems to be full of issues like that.
Should we just add a -fno-strict-aliasing to the top-level Makefile's BUILD_CFLAGS? (Given that the code seems to be full of issues like that) :) -Eric
Supposedly yes, that is the quickest solution. If you'd want to know why exactly it misbehaved and have proper analysis, it would require somebody to spend time on the concurrent debugging sessions of both -fno-strict-aliasing and -fstrict-aliasing built compiled e2fsck and see when the two start to diverge in their decisions and with those details let me have a look at that. We don't (yet) have an strict aliasing sanitizer, and even if we would, I bet we'd certainly flag the above as violation, because the accesses aren't through a union and the aliasing incompatible structs overlap each other.
Sure, I'm not trying to argue that these things shouldn't be fixed, but on the other hand, it'd be good to get a non-broken build sooner rather than later. :) I'll certainly take a look at your example above, but like you said, I think it's one of many.
Also a question is if this kind of programming style (reusing the same memory for different structs without using unions) is in all files, or just in e2fsck, or just in a subset of e2fsck.
(In reply to Eric Sandeen from comment #26) > Sure, I'm not trying to argue that these things shouldn't be fixed, but on > the other hand, it'd be good to get a non-broken build sooner rather than > later. :) I'll certainly take a look at your example above, but like you > said, I think it's one of many. I would just add -fno-strict-aliasing to the spec file with a reference to this bz and left the real solution for later. And switch this bug to e2fsprogs and leave it open, so it's tracked.
Well, e2fsck has to cope with every possible bit of metadata; things like debugfs more or less do, as well, usually via libext2. So this style is probably fairly common. I'd have to go look. ISTR that the directory indexing / htree code is the most frequent culprit. There have been other fixes in the past, i.e.. c816ecb e2fsprogs: fix type-punning warnings 2694f31 Fix type punning bugs in ext2fs_get_mem() and ext2fs_free_mem() -Eric
Erix, shall we do a new build with -fno-strict-aliasing set globally? I think it is the only safe option now.
Sorry, this fell off my TODO list somehow. Yes, I can set that and rebuild. Doing it for rawhide is enough, right? -Eric
(In reply to Eric Sandeen from comment #31) > Sorry, this fell off my TODO list somehow. Yes, I can set that and rebuild. > Doing it for rawhide is enough, right? > > -Eric no problem :-) Yes, rawhide for sure, not so sure about f22, because I think we are only lucky there, same gcc, potentially the same problem.
Ok built in rawhide now. Sorry for the delay! e2fsprogs-1.42.13-2.fc23