Description of problem: When I'm trying to build rawhide coreutils package with koji, it fails when make build done immediately after commit and make tag (output in http://koji.fedoraproject.org/koji/taskinfo?taskID=565149). When I downloaded the srpm from koji and tried with koji build --scratch dist-f9 <srpm> , it passed without troubles (http://koji.fedoraproject.org/koji/taskinfo?taskID=565190). It is not the first time - therefore I created the bugzilla as it seems to be not the random failure with rawhide (I never had this failure with F8/F7), could be some problem with SELinux on some build machines(but I'm not sure). Any other idea? Version-Release number of selected component (if applicable): Note: I'm not sure which component should I choose about that issue, therefore I have selected koji. Feel free to reattach the component to correct one. How reproducible: Sometimes, not always Steps to Reproduce: 1. make tag && make build for coreutils package 2. 3. Actual results: Failure as in http://koji.fedoraproject.org/koji/taskinfo?taskID=565149 Expected results: Pass as in http://koji.fedoraproject.org/koji/taskinfo?taskID=565190 Additional info: I don't know about any possibility how to check files in buildroot(as there will be some output about the error in no-create-missing.log) - because afaik there is no account for accessing it. Strange fact is that after few hours build usually passes without failure without any change (the test is in fact just simple test with touch, nothing which can fail in common conditions).
Created attachment 304003 [details] strace of test case failure Few additional informations: - problem reproducible ONLY but ALWAYS by building coreutils srpm between 10AM UTC and ~3PM UTC - problem caused by wrong return code from utimensat() call (280 instead of -1 EBADF) - therefore could be RHEL-5 xen kernel issue (but unsure)
Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
hrm, this is a tough one. I don't think it's anything in koji specific, especially given that it's a time issue. I'm going to punt to RHEL5 kernel.
Just want to mention coreutils-6.12 issues (#449910) Could be problem on koji build machine as well, as most of the failures with the current version I got on certain machine and architecture ( Host xenbuilder2.fedora.redhat.com Arch x86_64 ). Problem does occur not only for utimensat() call, but for utimens() as well. Return code is always 280. Current situation is to workaround that buggy return code(replace by zero to skip failure), although this could break timestamps (as is not sure if utimens() did his job. Build log with the new failure is available e.g. http://koji.fedoraproject.org/koji/taskinfo?taskID=649923 , with strace in one test case and additional debug info enabled.
Current workaround (covering return codes 280 from both - utimensat and futimens, attached in #449910) is available in coreutils-6.12-3.fc10, seems to be working ok so far - so it's likely that the priority of the problem is slightly lowered. Anyway it would be great to know what is causing that strange behaviour.
Ugh. The best I can get out of a mock build of coreutils-6.12-3.fc10 against Fedora 9 within a RHEL5.2 kvm guest is: =========================================== 4 of 335 tests failed (73 tests were not run) See tests/test-suite.log Please report it to bug-coreutils =========================================== What's the magic to get this building, period? Guess I should try with a physical machine and a 5.1 kernel next...
Yep mock builds of coreutils are quiet strange. On my FC-6 machine some tests (like runcon-no-reorder) fail always. Anyway-that issue with utimensat/futimens call I had only on koji build machines (and as I mentioned, usually only between 10AM - 3PM UTC). The most failing machines were xenbuilder2 and xenbuilder4 (I don't remember failure on different build machine).
Hi Jarod, Would you please attach tests/test-suite.log (or, slightly better for me/upstream, post it to bug-coreutils)? I'll be happy to look into it. Ondrej, this is the first I've heard about the runcon-no-reorder failure. Same deal there. I'd appreciate any details ;-)
Ondrej, thanks for the private strace, which showed runcon getting ENOENT like this: open("/selinux/context", O_RDWR|O_LARGEFILE) = -1 ENOENT (No such file or directory) which is probably because mock is running in an incomplete chroot, i.e., one in which /selinux/context does not exist. If that is indeed the problem, I wonder if that can be considered a system configuration bug.
Looks like I might be hitting the same failure as Ondrej on FC6. The four tests that failed in my case: FAIL: misc/runcon-no-reorder.log (exit: 1) FAIL: cp/cp-a-selinux.log (exit: 1) FAIL: cp/preserve-gid.log (exit: 1) FAIL: rm/fail-2eperm.log (exit: 1) This is using mock from EPEL on RHEL5. Apparently, the actual builders use something newer that handles the selinux stuff better... Somehow, I got an outdated version of mock installed. I'll hold off on sending anything more until I get things rolling with the correct mock version. :)
Drat. Got the same failure w/mock 0.9.7 (which matches the builders) on rhel5 and selinux in permissive mode trying to build coreutils-6.12-3.fc10. Log files forthcoming... Will attach them here and send them to the upstream list.
Created attachment 309269 [details] mock build.log from coreutils build failure
Created attachment 309270 [details] coreutils test-suite.log from failed mock build
Upstream discussion thread: http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/13803
Thanks again, Jarod, I've pushed patches for the remaining two problems: http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/13803/focus=13841
Excellent, with all patches in play, a mock build completes successfully. Now to back out the work-around mentioned in comment #5 and see what still breaks... :)
After some further poking, it seems the work-around we have in play in rawhide coreutils is probably the right answer for now. There simply is no utimensat() support in the RHEL5 kernel -- this syscall doesn't exist there, period. 280 (the funky error number we were seeing) is the syscall number for it upstream on x86_64, but the RHEL5 syscall table ends at 279. I'm thinking its not worth backporting if this is the only place its an issue for us. utimensat() debuted in 2.6.22, initial upstream commit here: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=1c710c896eb461895d3c399e15bb5f20b39c9073
hi Jarod, I confess I haven't looked recently, but what you report makes me think of incomplete syscall support or missing glibc wrapper. coreutils tests for the existence of the function and finds it. In that case, it should either work as advertised or it return -1 and sets errno = ENOSYS. FYI, this affects futimens, too, which might end up being used more than utimensat.
From irc discussion: There's a link test done by coreutils to check for utimensat() support: ----8<---- #define utimensat innocuous_utimensat #ifdef __STDC__ # include <limits.h> #else # include <assert.h> #endif #undef utimensat #ifdef __cplusplus extern "C" #endif char utimensat (); #if defined __stub_utimensat || defined __stub___utimensat choke me #endif int main () { return utimensat (); return 0; } ----8<---- If gcc exits with a return code of 0, in theory, we have the needed support. Inside the mock chroot: mock-chroot> gcc -o utimensat-test utimensat-test.c mock-chroot> echo $? 0 Outside the mock chroot, directly on a bare-metal RHEL5.2 kernel: $ gcc -o utimensat-test utimensat-test.c /tmp/cc88x9qi.o: In function `main': utimensat-test.c:(.text+0xa): undefined reference to `utimensat' collect2: ld returned 1 exit status $ echo $? 1 So it appears we're not actually testing directly against the capabilities the kernel claims to have, or we should get the same failure inside the chroot.
The fun is that we're building against 2.6.26-ish kernel headers, but actually running on a 2.6.18 kernel. The 2.6.26 kernel headers link fine, because they properly say that a 2.6.26 kernel has the needed support. Meanwhile, we try to run tests using a binary build for a 2.6.26 kernel on a 2.6.18 kernel. Kaboom. So its a somewhat expected corner case failure. There's still some question as to why we're getting the 280 return code, need to investigate exactly where that's coming from.
Actually executing the utimensat-test within the mock chroot seems to get the appropriate exit code: execve("./utimensat-test", ["./utimensat-test"], [/* 21 vars */]) = 0 brk(0) = 0x1a3ec000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b4 293bb7000 uname({sys="Linux", node="xw4400.usersys.redhat.com", ...}) = 0 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0644, st_size=10691, ...}) = 0 mmap(NULL, 10691, PROT_READ, MAP_PRIVATE, 5, 0) = 0x2b4293bb8000 close(5) = 0 open("/lib64/libc.so.6", O_RDONLY) = 5 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \345\1\0\0\0\0\0"..., 83 2) = 832 fstat(5, {st_mode=S_IFREG|0755, st_size=1510768, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b4 293bbb000 mmap(NULL, 3617400, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x2b 4293db8000 mprotect(0x2b4293f23000, 2093056, PROT_NONE) = 0 mmap(0x2b4294122000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENY WRITE, 5, 0x16a000) = 0x2b4294122000 mmap(0x2b4294127000, 17016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANON YMOUS, -1, 0) = 0x2b4294127000 close(5) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b4 29412c000 arch_prctl(ARCH_SET_FS, 0x2b429412c6f0) = 0 mprotect(0x2b4294122000, 16384, PROT_READ) = 0 mprotect(0x2b4293db6000, 4096, PROT_READ) = 0 munmap(0x2b4293bb8000, 10691) = 0 utimensat(1, "%^\<F1>^V<FF>^?", {{140733578288182, 140733578288217}, {1407335782 88228, 140733578288244}}, 0) = -1 ENOSYS (Function not implemented) exit_group(-1) = ? Now to look into what the failing tests are doing differently...
Hi Jarod, note comment #1 where it says the failure depends on the time of day at which the test is run.
Yeah, I just realized that a bit ago, and noticed I was outside that window. I've reset the clock on the host to something in that window and am re-testing now.
Hrm, clock change doesn't seem to have made a lick of difference to the return code. This part could well be xen-specific, will try yet again on a xen kernel...
Okay, so I'm convinced this is a xen bug. Inside a xen guest, I get: mock-chroot> strace ./utimensat-test execve("./utimensat-test", ["./utimensat-test"], [/* 21 vars */]) = 0 brk(0) = 0x16049000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b04f2a96000 uname({sys="Linux", node="localhost.localdomain", ...}) = 0 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0644, st_size=10750, ...}) = 0 mmap(NULL, 10750, PROT_READ, MAP_PRIVATE, 5, 0) = 0x2b04f2a97000 close(5) = 0 open("/lib64/libc.so.6", O_RDONLY) = 5 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \345\1\0\0\0\0\0"..., 832) = 832 fstat(5, {st_mode=S_IFREG|0755, st_size=1510768, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b04f2a9a000 mmap(NULL, 3617400, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x2b04f2c97000 mprotect(0x2b04f2e02000, 2093056, PROT_NONE) = 0 mmap(0x2b04f3001000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x16a000) = 0x2b04f3001000 mmap(0x2b04f3006000, 17016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b04f3006000 close(5) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b04f300b000 arch_prctl(ARCH_SET_FS, 0x2b04f300b6f0) = 0 mprotect(0x2b04f3001000, 16384, PROT_READ) = 0 mprotect(0x2b04f2c95000, 4096, PROT_READ) = 0 munmap(0x2b04f2a97000, 10750) = 0 utimensat(1, "�+��", {{140736280603661, 140736280603692}, {140736280603703, 140736280603719}}, 0) = 280 exit_group(24) = ?
Does this sound at all familiar to any xen folks?
I seem to have somewhat of an explanation why this hasn't been seen on i686... In the i686 case, syscalls are passed unaltered to the kernel. On x86_64, they all have to pass through the hypervisor, which does some mucking with things[*]... Now, based on xen/arch/x86/x86_64/entry.S, if I'm reading this right (I don't know assembly or the xen hv for squat), it looks like its *supposed* to ultimately return an -ENOSYS here when we leave syscall_enter, but for whatever reason, its returning the attempted syscall number instead. Of course, like I said, I know neither assembly nor the xen hv very well, so I could be barking up the wrong tree... [*] http://lists.xensource.com/archives/html/xen-devel/2006-11/msg01246.html
Jarod, Yep, your explanation about how syscalls work on i686 vs. x86_64 is (almost) spot-on. For i686, kernel is in ring1 and userland is in ring3; when the guest boots, the hypervisor and guest kernel set it up so that syscalls do a direct ring transfer from ring3 to ring1; the hypervisor is not involved at all. On x86_64, you can't do this because there is no ring1 (not technically true, but true enough for this explanation). So both the kernel and userland live in ring3, protected from each other by page tables. When userland syscalls, it traps to the hypervisor, which swaps out the page tables for the kernel page tables, and resumes inside the guest kernel in ring3 again. Kind of odd that this new utimenstat() is returning such a bogus number though. You would think we would have a problem all of the time, not just between 10pm and 3am, no? In any case, this should be easy enough to find, if it is indeed the HV mucking up the system call somehow. Is there a test binary we can use to show this? Chris Lalancette
Hey Chris! Yep, I've been able to reproduce the 280 return code (between 8am and 3pm, in my own experience) using the following code snippet: ----8<---- #define utimensat innocuous_utimensat #ifdef __STDC__ # include <limits.h> #else # include <assert.h> #endif #undef utimensat #ifdef __cplusplus extern "C" #endif char utimensat (); #if defined __stub_utimensat || defined __stub___utimensat choke me #endif int main () { return utimensat (); return 0; } ----8<---- This is more or less the exact same test code that coreutils uses to test linking functionality. The code has to be compiled and run inside a rawhide (or f9) chroot to trigger, but its relatively easy to get going with mock. On an x86_64 RHEL5 xen box: 1) rpm -ivh http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-3.noarch.rpm 2) yum install mock.noarch 3) wget http://kojipkgs.fedoraproject.org/packages/coreutils/6.12/4.fc10/src/coreutils-6.12-4.fc10.src.rpm 4) mock -r fedora-devel-x86_64 coreutils-6.12-4.fc10.src.rpm Once that fails, you should have enough bits in the chroot to build and run the sample code: 5) cp utimensat-test.c /var/lib/mock/fedora-development-x86_64/root/ 6) mock -r fedora-devel-x86_64 shell 7) gcc -o utimensat-test utimensat-test.c 8) ./utimensat-test (may also be worth installing strace into the chroot, via 'mock -r fedora-devel-x86_64 --install strace').
Nb: I've not actually verified that I *don't* get the 280 outside of 8am and 3pm, but definitely did see it with the clock set to roughly 8am utc the other day.
Yeah, I tried it out here, and it is failing at any time. I also followed the code paths some; what happens is: Guest userland: utimensat() Trap -> HV HV -> switch out page tables, trapbounce into guest (xen/arch/x86/x86_64/entry.S) Guest -> Process system call (linux/arch/x86_64/kernel/entry.S:system_call) - note that there is a check here to check if the syscall in question is above NR_syscall_max (which is 279), in which case it should return -ENOSYS. Guest -> ret from syscall (sysretq), which should trap back to the HV and thus back to userland. So, from a quick glance, I don't see it; I'll have to add some debugging code next week to see where this is breaking down. Chris lalancette
Crap. Well, it looks like Jarod did find a bug in the Xen kernel, however, I don't think that bug is what is causing the overall problems. Basically the bug in the Xen kernel is that if, and *only* if, you are ptrace'ing (i.e. strace) an application, a syscall above the NR_syscalls macro will return the number of the syscall, not -ENOSYS. However, in the normal case (no ptrace), it is doing the right thing (at least, as far as I can tell from code inspection). I'll open a BZ about the bad ptrace syscall thing, but I think there is still something else going on with this bug. Chris Lalancette
Arg. Open mouth, insert foot. I should really do more thorough research before posting. Let me try from the beginning. Here's my test program: #include <stdio.h> #include <stdlib.h> #include <sys/syscall.h> #include <sys/types.h> int main() { int foo; foo = syscall(289); fprintf(stderr, "Error is %d\n",foo); return 0; } and I'm compiling with: # gcc -g -Wall test_syscall.c 2.6.18-92.el5xen: [root@localhost ~]# ./a.out Error is 289 [root@localhost ~]# [root@localhost ~]# strace ./a.out <snip> syscall_289(0x7fff61de4a48, 0x7fff61de4a58, 0x400580, 0x38bc34e2b0, 0x38bbc0d630, 0x7fff61de4a40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) = 0x121 write(2, "Error is 289\n", 13Error is 289 ) = 13 exit_group(0) = ? [root@localhost ~]# 2.6.18-92.el5xen + arch/x86_64/kernel/entry-xen.S patch for this issue: [root@localhost ~]# ./a.out Error is -1 [root@localhost ~]# [root@localhost ~]# strace ./a.out <snip> syscall_289(0x7fff4450a148, 0x7fff4450a158, 0x400580, 0x38bc34e2b0, 0x38bbc0d630, 0x7fff4450a140, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) = -1 (errno 38) write(2, "Error is -1\n", 12Error is -1 ) = 12 exit_group(0) = ? [root@localhost ~]# So, the patch on the tracesys path also affects when you aren't systracing. I'm guessing the reason for that is the audit stuff, but I will have to confirm that (just to make sure I completely understand the issue). Chris Lalancette
Nb: this last bit was indeed a bug in the x86_64 xen kernel's syscall handling -- see bug 453394. Chris' patch has gone upstream, and should go into the RHEL5.3 kernel. Not sure if we're going to see this on in z-stream or not, will have to check...
So we've got a work-around patch in place right now, but it probably could stand to be tweaked a bit to better explain the issue, and note that its a xen kernel issue that should be fixed in RHEL5.3... Do we want to throw some of the failure check fixups from Jim into the rawhide package? (I have them here locally in a build, I can add them to rawhide coreutils if so desired). If so desired, I can rework the work-around patch a bit as well.
What kind of changes you want to do in the Jim's workaround? Some diagnostics? I'm ok with changes in rawhide if the package will be still buildable in koji (no failures in test-suite). Thanks for working on that.
Sorry for the delay, got distracted by other stuff... What I was intending to add were the patches from Jim to fix up some of the tests, then a clarification of the comments in the source regarding what's going on when a bogus exit code of 280 is received.
Jarod, Hm, actually, I forgot about this since it went into 5.3 already. Did we want to nominate this for the z-stream instead of applying a userspace workaround? Or is it too late for that? Chris Lalancette
(In reply to comment #38) > Sorry for the delay, got distracted by other stuff... What I was intending to > add were the patches from Jim to fix up some of the tests, then a clarification > of the comments in the source regarding what's going on when a bogus exit code > of 280 is received. No problem, if you still want to do that in coreutils, feel free to request commit ACL's for coreutils devel at https://admin.fedoraproject.org/pkgdb/packages/name/coreutils . Or you could just send me patches and I will apply it to devel branch.
Re comment #39: given that we've apparently got other syscalls triggering similar bugs in other builds, I think infra would really appreciate it if we can get a patched kernel onto the build systems, and it being an official z-stream kernel would be highly preferred. Re comment #40: I've requested commit privs for coreutils, will go ahead and commit that stuff when I get a chance.
For cross-referencing purposes... bug 459442 is the 'other syscalls triggering' bug referred to in comment #41.
Jarod, OK, thanks for the info. I'll request z-stream on the other bug (BZ 453394). Chris Lalancette
Ugh, I take forever sometimes. Finally just committed the testsuite workarounds from upstream, as well as the minor modification of the comments regarding why we were getting that 280 return code. Tagged and building now. Hm. Maybe this bug should be flipped back to rawhide and component coreutils... I think all the RHEL stuff has been taken care of now.