Created attachment 472171 [details] Patch to remove hard-coding optimizer flags to -O2/-Os in multiple places Description of problem: Build from the source package fails with a segfault on the ARM platform (armv5tel). There are several problems, some of which I am attaching patches for. 1) CFLAG passing is a complete mess. CFLAGS are being passed from: a) rpmrc (good) b) spec (bad) c) the following files in the source distribution (very bad): - dietlibc-0.32/findcflags.sh - dietlibc-0.32/contrib/Makefile.dyn - dietlibc-0.32/libpthread/Makefile Version-Release number of selected component (if applicable): Tested on F12 and F13 ARM rootfs distros (gcc 4.4.2 and 4.4.4 respectively), with dietlibc packages from F12, F13 and F14, all behave in the exact same way. dietlibc-0.32-0.fc12.src.rpm (F12/F13) dietlibc-0.32-1400.fc14.src.rpm (F14) Patches provided are against the F14 package (dietlibc-0.32-1400.fc14.src.rpm) because it is the most recent one. The reason this is important is because building the "diet" binary fails when -Ox, x={s,2,3} is used. The source distribution provides some mechanisms for things like this to be handled in the Makefile, but this is largely clobbered by the extra -O flags being passed in. How reproducible: Every time. Steps to Reproduce: rpmbuild --rebuild dietlibc-0.32-1400.fc14.src.rpm Actual results: Build fails because the binary "bin-arm/diet" built during the build segfaults. Here are the last few lines: ----snip---- bin-arm/diet gcc -D__dietlibc__ -O2 -g -march=armv5te -fomit-frame-pointer -fno-exceptions -fno-asynchronous-unwind-tables -fno-stack-protector -Os -g3 -Werror-implicit-function-declaration -o bin-arm/elftrunc contrib/elftrunc.c make: *** [bin-arm/elftrunc] Segmentation fault error: Bad exit status from /var/tmp/rpm-tmp.Y1qRwn (%build) ----snap---- Even when the attached patches are applied, the regular build stills fails with segfaults during the %check stage. I looked into the first failing test briefly, and gdb reveals the following: ----snip---- # gdb atexit (gdb) run Starting program: /usr/src/redhat/BUILD/dietlibc-0.32/test/atexit Program received signal SIGSEGV, Segmentation fault. 0x00008230 in ?? () (gdb) backtrace Cannot access memory at address 0x0 #0 0x00008230 in ?? () #1 0x000081dc in __libc_exit (code=0) at lib/atexit.c:25 #2 0x00008104 in _start () at arm/start.S:34 ----snap---- So the problem seems to be in arm/start.S on line 34. ----snip---- _start: mov fp, #0 @ clear the frame pointer ldr a1, [sp], #4 @ argc mov a2, sp @ argv ldr ip, .L3 add a3, a2, a1, lsl #2 @ &argv[argc] add a3, a3, #4 @ envp str a3, [ip, #0] @ environ = envp bl main @ @ The exit status from main() is already in r0. @ We need to branch to 'exit' in case we have linked with 'atexit'. @ bl exit ----snap---- Line 34 is the one with "bl exit" on it. Expected results: Binary RPM package should be generated. Additional info: Attached patches (package patch and a spec file patch) make the package build (just about), but even so, debuginfo.list doesn't get generated, and %check stage fails in a number of places, so the only way to actually get this to build the binary RPMs is to use: rpmbuild --define='%check exit 0' -bb dietlibc.spec --define='%debug_package %{nil}' which skips the %check stage self-test and building of the debug packages. This package also even fails to build cleanly on x86. On ARM the problem is just much worse (many more self-tests fail with segfaults).
Created attachment 472172 [details] Patch to remove hard-coded -Os -g3 CFLAGS flags in the spec file -g3 doesn't appear to be documented on gcc -O flags should be coming from the defaults in rpmrc or in exceptional cases get overriden in the build source where necessary. In this instance most compiler invocations ended up having both -O2 and -Os (sometimes twice), which was breaking some code.
Can you please try the rawhide version? You might need to apply the last two patches from github: https://github.com/ensc/dietlibc/commit/749ea37e7793f58be8f0131b82d1affd249de244.patch https://github.com/ensc/dietlibc/commit/0fb8d66c33252c784d3e0a5d16d1b78095c92d92.patch When this version segfaults too, please attach one of the crashing binaries and the complete buildlog.
Created attachment 472294 [details] Rawhide build log with the two git patches applied.
Created attachment 472295 [details] segfaulting bin-arm/diet
It would appear that at least a part of the problem that dietlibc has on the ARM architecture is coming from breaking alignment. Running a build + test suite (with the patches I have attached here, which is the only way to make it successfully build on ARM) results in over a million alignment violations being logged. Each one will result in corrupted data being retrieved. That's a lot of corrupted data. The alignment is an ARM specific issue. It can be partially worked around by enabling auto-fixing of alignment in the kernel, but this comes with a significant performance penalty so isn't really acceptable. To try this, check /proc/cpu/alignment on an ARM machine before and after building the packages and building the test suite. Some ARM CPUs have automatic fixing for this in hardware, so /proc/cpu/alignment will never show any violations, but it still slows things down even with it done in hardware. A SheevaPlug is a good example of a machine with no hardware alignment fixer where these errors will clearly show up. On a separate note, the rawhide version with the two git patches seems to pass a lot more of the self-tests than before, but still fails on a lot of them, and two of those are segfaults, even with the auto-fixing of alignment enabled in the kernel. The alignment traps catch alignment violations in the following test files: test-canon tst-limits tst-printf tst-rand48 tst-sscanf tst-strtod tst-strtol tst-strtoll The test suite files that segfault are: atexit tst-printf I will attach those segfaulting binaries and the build log with the two previously attached patches applied.
Created attachment 472323 [details] Build log with the optimizer flag patches applied
Created attachment 472324 [details] segfaulting atexit
Created attachment 472325 [details] segfaulting tst-printf
please try https://github.com/ensc/dietlibc/commit/f0e2369eb745de768e0fc4195b0e65c64392dcbc.diff
That last patch fixes the diet executable segfault. It also seems to have fixed the tst_printf segfault. However, now mmap_test segfaults and throws misaligned access errors on ARM. This doesn't happen if my optimizer flag patch is applied. I will attach the build log and the mmap_test binary.
Created attachment 472371 [details] build log with patch in comment 9 applied
Created attachment 472372 [details] Segfaulting mmap_test
Also - with the optimizer flag patches applied, mmap_test segfault goes away. atexit segfault remains. mmap_test also stops generating a misaligned access. However, interestingly, it does cause tst_strtod to start segfaulting. Build log and tst_strtod are attached.
Created attachment 472375 [details] Build log with the 3 git patches and the 3 optimizer flag patches applied
Created attachment 472376 [details] segfaulting tst_strtod
I detected and fixed two problems in mmap_test: the mmap() function was completely broken, and exit() executes random code (which caused the segfault). https://github.com/ensc/dietlibc/commit/542652118de1889d18c4608f1a31a0e4ee640f5d.diff https://github.com/ensc/dietlibc/commit/6747a03d7683e970c35ac147a7dfc16217b024ac.diff
Created attachment 472395 [details] Build log with only the git patches so far (no -O fixup) Attached is the build log with only the provided git patches applied. My optimizer flag patches weren't applied in this build. tst-strtod segfaults.
Created attachment 472396 [details] segfaulting tst_strtod
The tst_strtod segfaults on other architectures too and happens due to an endless recursion in __dtostr(). Must be fixed another time... Else, can you test rawhide please? I fixed time(2) + getrlimit(2) issues there and it would be nice when these changes (especially getrlimit()) could be tested on a real platform.
OK, thank you for clarifying the tst_strtod. What about the other 6-7 unknown test failures as per the build logs above? If these are also expected, then is it also expected that the package has to be built with --define='%check exit 0' ? If that is the case then perhaps it should be set in the spec file until the relevant fix-ups are in place. Also, what about the optimizer flags? Having multiple -O and -g parameters being passed is at least confusing. In a number of cases gcc is invoked with -O2 -g -Os -g3. Also in the findcflags.sh -march is being set purely according to the gcc version number - that doesn't seem right. Surely -march should only come from rpmrc, should it not?
Re: Rawhide, I have been testing with the rawhide package 0.33-1500 since you first mentioned it in comment 2. Is there a newer rawhide version now? I can't seem to see it on my local mirror at the moment.
http://koji.fedoraproject.org/koji/buildinfo?buildID=213379 See http://pkgs.fedoraproject.org/gitweb/?p=dietlibc.git;a=blob;f=runtests-X.sh;h=a0dbfc46d17d32f320a410b9c7a00e298a67d5d1;hb=HEAD for explanations of failed tests. The expected failures will be ignored so that %check is expected to succeed. Multiple -O or -g flags are ok (last one wins). I will remove the -g3 in one of the next release (which was added to debug something, afair). Results from findcflags.sh are ignored (they are overridden by the CFLAGS make option)
I understand that some tests are known to fail, but if you look at the build logs I attached, you'll see a few tests failed that WEREN'T expected to fail, and the package build still fails. Also, I notice you mention you fixed something to do with time(2). Interestingly, when I try to build util-vserver which builds against dietlibc by default, it fails to link with this error: diet -Os gcc -O2 -g -march=armv5te -std=c99 -Wall -pedantic -W -funit-at-a-time -o src/filetime src/filetime.o lib/libvserver.a src/filetime.o: In function `main': /root/rpmbuild/BUILD/util-vserver-0.30.216-pre2926/src/filetime.c:74: undefined reference to `time' collect2: ld returned 1 exit status make[2]: *** [src/filetime] Error 1 make[2]: Leaving directory `/usr/src/redhat/BUILD/util-vserver-0.30.216-pre2926' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/src/redhat/BUILD/util-vserver-0.30.216-pre2926' make: *** [all] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.qfEQdh (%build) It builds fine against glibc. Can you hazard a guess as to what might be going wrong? I thought it might be related since it is complaining about undefinied reference "time" when linking against dietlibc but not when linking against glibc.
all the unexpectedly failing tests seem to be due to "No such file or directory". This should be fixed in the comment 22 build which adds previously missing time(2) + getrlimit(2) implementations.
Oh, my bad. Are the other three patches from git also required against that package? Or are they already rolled in there?
I just tried the 0.33-1502 build. Builds cleanly on ARM, and doesn't trigger any alignment issues at all - awesome. :) Still can't build util-vserver against it, though: diet -Os gcc -O2 -g -march=armv5te -std=c99 -Wall -pedantic -W -funit-at-a-time -o src/lockfile src/lockfile.o src/lockfile.o: In function `main': /root/rpmbuild/BUILD/util-vserver-0.30.216-pre2926/src/lockfile.c:124: undefined reference to `alarm' collect2: ld returned 1 exit status make[2]: *** [src/lockfile] Error 1 make[2]: Leaving directory `/usr/src/redhat/BUILD/util-vserver-0.30.216-pre2926' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/src/redhat/BUILD/util-vserver-0.30.216-pre2926' make: *** [all] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.xSKMxN (%build) I take it alarm() isn't implemented yet on ARM?
Please try recent rawhide git (http://koji.fedoraproject.org/koji/taskinfo?taskID=2722215)
I cannot find the 0.33-1503 source rpms in koji. Can you provide a direct link?
Any chance of a src.rpm for this?
OK, I've finally found 0.33-1504 in rawhide. It builds OK with alignment fix-up enabled, but something still causes alignment errors. Thousands of these flood the logs. From a single build run I see 32904 of these, all spewed in a 5 second window: Alignment trap: lt-regression-t (30975) PC=0x4014a014 Instr=0xe0d310b2 Address=0x000e23d7 FSR 0x001
The alignment fault appears to happen in test tst-strtod, which segfaults anyway, so the build actually still succeeds even with alignment fix-up disabled. This may, however, indicate an additional fault somewhere, on top of the known cause of the segfault.
Here is what gdb says: # gdb --quiet tst-strtod core.20299 Reading symbols from /usr/src/redhat/BUILD/dietlibc-0.33.20101223/test/stdlib/tst-strtod...done. [New Thread 20299] Core was generated by `./tst-strtod'. Program terminated with signal 11, Segmentation fault. #0 0x0000c448 in __aeabi_dcmple () Backtrace attached in a separate file.
Created attachment 479735 [details] backtrace of alignment-faulting tst-strtod Lines like this continue pretty much indefinitely, so the file is truncated. I killed GDB when the output got to 10MB.
Also, in the context of the util-vserver building mentioned earlier, which requires dietlibc, it now gets further, but still fails: diet -Os gcc -O2 -g -march=armv5te -std=c99 -Wall -pedantic -W -funit-at-a-time -o src/vunify src/vunify.o lib_internal/libinternal-diet.a lib/libvserver.a /usr/lib/dietlibc/lib-arm/libc.a(utime.o): In function `utime': (.text+0x18): undefined reference to `__NR_utime' collect2: ld returned 1 exit status Does that mean that particular function isn't implemented in dietlibc on ARM yet?
try again; I hope that all missing syscall are now available...
0.33-1505 doesn't seem to build successfully on ARM: gcc -D__dietlibc__ -I. -isystem include -O2 -g -march=armv5te -fomit-frame-pointer -fno-exceptions -fno-asynchronous-unwind-tables -fno-stack-protector -Os -g3 -Werror-implicit-function-declaration -c lib/__utime.c -o bin-arm/__utime.o -D__dietlibc__ lib/__utime.c: In function 'utime': lib/__utime.c:8: error: implicit declaration of function 'utimes' make: *** [bin-arm/__utime.o] Error 1
0.33-1600 seems to resolve all of the build problems on ARM I mentioned so far. :) The unaligned access is gone and the build errors I was seeing due to missing functions are gone, too. There is one other problem that cropped up (a bus error) that I need to investigate further as I am not certain whether it is an issue in dietlibc.
Doing a little more digging into that bus error, it's possible that there may be something funny happening in this specific case on ARM in mmap or madvise. See the thread I posted here: http://archives.linux-vserver.org/201102/0058.html It also seems reminiscent of this bug (symptoms are identical but there isn't much attached in the bug report that would indicate whether it is in fact a similar issue): https://bugzilla.redhat.com/show_bug.cgi?id=442346 The package I am building against dietlibc is this one: http://people.linux-vserver.org/~dhozac/t/uv-testing/util-vserver-0.30.216-pre2935.tar.bz2 (rpmbuild -tb)
I found a bug in dietlibc's sigjmp() code which might be responsible for the seen issue. Please try recent rawhide.
The latest rawhide (dietlibc-0.33-0.1600.20110311.fc16.src.rpm) builds cleanly now on F13/armv5tel, and util-vserver builds cleanly against it. Thank you for fixing this, please close the bug.
Any chance that this latest, rawhide version could be pushed to F13, F14 and F15? Otherwise we won't have a working dietlibc in ARM Fedora distro until F16.
This message is a reminder that Fedora 13 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '13'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 13's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 13 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.