Bug 442352

Summary: Strange koji build failures of coreutils rawhide builds
Product: [Fedora] Fedora Reporter: Ondrej Vasik <ovasik>
Component: coreutilsAssignee: Ondrej Vasik <ovasik>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: rawhideCC: clalance, dennis, kdudka, lwang, meyering, mikem, ovasik, twaugh, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-11-07 17:37:30 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Description Flags
strace of test case failure
mock build.log from coreutils build failure
coreutils test-suite.log from failed mock build none

Description Ondrej Vasik 2008-04-14 11:03:46 EDT
Description of problem:
When I'm trying to build rawhide coreutils package with koji, it fails when make
build done immediately after commit and make tag (output in
http://koji.fedoraproject.org/koji/taskinfo?taskID=565149). When I downloaded
the srpm from koji and tried with koji build --scratch dist-f9 <srpm> , it
passed without troubles
(http://koji.fedoraproject.org/koji/taskinfo?taskID=565190). It is not the first
time - therefore I created the bugzilla as it seems to be not the random failure
with rawhide (I never had this failure with F8/F7), could be some problem with
SELinux on some build machines(but I'm not sure). Any other idea?

Version-Release number of selected component (if applicable): 
Note: I'm not sure which component should I choose about that issue, therefore I
have selected koji. Feel free to reattach the component to correct one.

How reproducible: Sometimes, not always

Steps to Reproduce:
1. make tag && make build for coreutils package
Actual results:
Failure as in

Expected results:
Pass as in

Additional info:
I don't know about any possibility how to check files in buildroot(as there will
be some output about the error in no-create-missing.log) - because afaik there
is no account for accessing it. Strange fact is that after few hours build
usually passes without failure without any change (the test is in fact just
simple test with touch, nothing which can fail in common conditions).
Comment 1 Ondrej Vasik 2008-04-28 12:28:53 EDT
Created attachment 304003 [details]
strace of test case failure

Few additional informations:
- problem reproducible ONLY but ALWAYS by building coreutils srpm between 10AM
UTC and ~3PM UTC
- problem caused by wrong return code from utimensat() call (280 instead of -1
- therefore could be RHEL-5 xen kernel issue (but unsure)
Comment 2 Bug Zapper 2008-05-14 05:24:11 EDT
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
Comment 3 Jesse Keating 2008-05-16 15:48:55 EDT
hrm, this is a tough one.  I don't think it's anything in koji specific,
especially given that it's a time issue.  I'm going to punt to RHEL5 kernel.
Comment 4 Ondrej Vasik 2008-06-06 06:45:47 EDT
Just want to mention coreutils-6.12 issues (#449910)
Could be problem on koji build machine as well, as most of the failures with the
current version I got on certain machine and architecture ( Host  	
xenbuilder2.fedora.redhat.com Arch x86_64 ). Problem does occur not only for
utimensat() call, but for utimens() as well. Return code is always 280. Current
situation is to workaround that buggy return code(replace by zero to skip
failure), although this could break timestamps (as is not sure if utimens() did
his job. Build log with the new failure is available e.g.
http://koji.fedoraproject.org/koji/taskinfo?taskID=649923 , with strace in one
test case and additional debug info enabled.
Comment 5 Ondrej Vasik 2008-06-06 15:23:52 EDT
Current workaround (covering return codes 280 from both - utimensat and
futimens, attached in #449910) is available in coreutils-6.12-3.fc10, seems to
be working ok so far - so it's likely that the priority of the problem is
slightly lowered. Anyway it would be great to know what is causing that strange
Comment 6 Jarod Wilson 2008-06-11 15:22:13 EDT
Ugh. The best I can get out of a mock build of coreutils-6.12-3.fc10 against
Fedora 9 within a RHEL5.2 kvm guest is:

 4 of 335 tests failed
 (73 tests were not run)
 See tests/test-suite.log
 Please report it to bug-coreutils@gnu.org

What's the magic to get this building, period? Guess I should try with a
physical machine and a 5.1 kernel next...
Comment 7 Ondrej Vasik 2008-06-12 04:39:44 EDT
Yep mock builds of coreutils are quiet strange. On my FC-6 machine some tests
(like runcon-no-reorder) fail always. Anyway-that issue with utimensat/futimens
call I had only on koji build machines (and as I mentioned, usually only between
10AM - 3PM UTC). The most failing machines were xenbuilder2 and xenbuilder4 (I
don't remember failure on different build machine). 
Comment 8 Jim Meyering 2008-06-12 05:08:42 EDT
Hi Jarod,
Would you please attach tests/test-suite.log (or, slightly better for
me/upstream, post it to bug-coreutils@gnu.org)?  I'll be happy to look into it.

Ondrej, this is the first I've heard about the runcon-no-reorder failure.
Same deal there.  I'd appreciate any details ;-)
Comment 9 Jim Meyering 2008-06-12 07:27:54 EDT
Ondrej, thanks for the private strace, which showed runcon getting ENOENT like this:

open("/selinux/context", O_RDWR|O_LARGEFILE) = -1 ENOENT (No such file or directory)

which is probably because mock is running in an incomplete chroot, i.e., one in
which /selinux/context does not exist.  If that is indeed the problem, I wonder
if that can be considered a system configuration bug.
Comment 10 Jarod Wilson 2008-06-13 12:14:02 EDT
Looks like I might be hitting the same failure as Ondrej on FC6. The four tests
that failed in my case:

FAIL: misc/runcon-no-reorder.log (exit: 1)
FAIL: cp/cp-a-selinux.log (exit: 1)
FAIL: cp/preserve-gid.log (exit: 1) 
FAIL: rm/fail-2eperm.log (exit: 1)

This is using mock from EPEL on RHEL5. Apparently, the actual builders use
something newer that handles the selinux stuff better... Somehow, I got an
outdated version of mock installed. I'll hold off on sending anything more until
I get things rolling with the correct mock version. :)
Comment 11 Jarod Wilson 2008-06-13 14:39:18 EDT
Drat. Got the same failure w/mock 0.9.7 (which matches the builders) on rhel5
and selinux in permissive mode trying to build coreutils-6.12-3.fc10. Log files
forthcoming... Will attach them here and send them to the upstream list.
Comment 12 Jarod Wilson 2008-06-13 14:40:01 EDT
Created attachment 309269 [details]
mock build.log from coreutils build failure
Comment 13 Jarod Wilson 2008-06-13 14:40:38 EDT
Created attachment 309270 [details]
coreutils test-suite.log from failed mock build
Comment 14 Jarod Wilson 2008-06-18 14:43:10 EDT
Upstream discussion thread:

Comment 15 Jim Meyering 2008-06-19 03:47:38 EDT
Thanks again, Jarod,

I've pushed patches for the remaining two problems:
Comment 16 Jarod Wilson 2008-06-19 10:06:35 EDT
Excellent, with all patches in play, a mock build completes successfully. Now to
back out the work-around mentioned in comment #5 and see what still breaks... :)
Comment 17 Jarod Wilson 2008-06-23 09:44:39 EDT
After some further poking, it seems the work-around we have in play in rawhide
coreutils is probably the right answer for now. There simply is no utimensat()
support in the RHEL5 kernel -- this syscall doesn't exist there, period. 280
(the funky error number we were seeing) is the syscall number for it upstream on
x86_64, but the RHEL5 syscall table ends at 279. I'm thinking its not worth
backporting if this is the only place its an issue for us.

utimensat() debuted in 2.6.22, initial upstream commit here:
Comment 19 Jim Meyering 2008-06-23 10:57:41 EDT
hi Jarod,

I confess I haven't looked recently, but what you report makes me think of
incomplete syscall support or missing glibc wrapper.  coreutils tests for the
existence of the function and finds it.  In that case, it should either work as
advertised or it return -1 and sets errno = ENOSYS.

FYI, this affects futimens, too, which might end up being used more than utimensat.
Comment 20 Jarod Wilson 2008-06-23 14:59:25 EDT
From irc discussion:

There's a link test done by coreutils to check for utimensat() support:

#define utimensat innocuous_utimensat
#ifdef __STDC__
# include <limits.h>
# include <assert.h>
#undef utimensat
#ifdef __cplusplus
extern "C"
char utimensat ();
#if defined __stub_utimensat || defined __stub___utimensat
choke me

int main () { return utimensat (); return 0; }

If gcc exits with a return code of 0, in theory, we have the needed support.

Inside the mock chroot:
mock-chroot> gcc -o utimensat-test utimensat-test.c 
mock-chroot> echo $?

Outside the mock chroot, directly on a bare-metal RHEL5.2 kernel:
$ gcc -o utimensat-test utimensat-test.c 
/tmp/cc88x9qi.o: In function `main':
utimensat-test.c:(.text+0xa): undefined reference to `utimensat'
collect2: ld returned 1 exit status
$ echo $?

So it appears we're not actually testing directly against the capabilities the
kernel claims to have, or we should get the same failure inside the chroot.
Comment 21 Jarod Wilson 2008-06-23 15:31:25 EDT
The fun is that we're building against 2.6.26-ish kernel headers, but actually
running on a 2.6.18 kernel. The 2.6.26 kernel headers link fine, because they
properly say that a 2.6.26 kernel has the needed support. Meanwhile, we try to
run tests using a binary build for a 2.6.26 kernel on a 2.6.18 kernel. Kaboom.
So its a somewhat expected corner case failure.

There's still some question as to why we're getting the 280 return code, need to
investigate exactly where that's coming from.
Comment 22 Jarod Wilson 2008-06-24 09:55:26 EDT
Actually executing the utimensat-test within the mock chroot seems to get the
appropriate exit code:

execve("./utimensat-test", ["./utimensat-test"], [/* 21 vars */]) = 0
brk(0)                                  = 0x1a3ec000
uname({sys="Linux", node="xw4400.usersys.redhat.com", ...}) = 0
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=10691, ...}) = 0
mmap(NULL, 10691, PROT_READ, MAP_PRIVATE, 5, 0) = 0x2b4293bb8000
close(5)                                = 0
open("/lib64/libc.so.6", O_RDONLY)      = 5
read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \345\1\0\0\0\0\0"..., 83
2) = 832
fstat(5, {st_mode=S_IFREG|0755, st_size=1510768, ...}) = 0
mprotect(0x2b4293f23000, 2093056, PROT_NONE) = 0
WRITE, 5, 0x16a000) = 0x2b4294122000
YMOUS, -1, 0) = 0x2b4294127000
close(5)                                = 0
arch_prctl(ARCH_SET_FS, 0x2b429412c6f0) = 0
mprotect(0x2b4294122000, 16384, PROT_READ) = 0
mprotect(0x2b4293db6000, 4096, PROT_READ) = 0
munmap(0x2b4293bb8000, 10691)           = 0
utimensat(1, "%^\<F1>^V<FF>^?", {{140733578288182, 140733578288217}, {1407335782
88228, 140733578288244}}, 0) = -1 ENOSYS (Function not implemented)
exit_group(-1)                          = ?

Now to look into what the failing tests are doing differently...
Comment 23 Jim Meyering 2008-06-24 11:07:21 EDT
Hi Jarod, note comment #1 where it says the failure depends on the time of day
at which the test is run.
Comment 24 Jarod Wilson 2008-06-24 11:19:18 EDT
Yeah, I just realized that a bit ago, and noticed I was outside that window.
I've reset the clock on the host to something in that window and am re-testing now.
Comment 25 Jarod Wilson 2008-06-24 12:07:21 EDT
Hrm, clock change doesn't seem to have made a lick of difference to the return
code. This part could well be xen-specific, will try yet again on a xen kernel...
Comment 26 Jarod Wilson 2008-06-24 15:46:27 EDT
Okay, so I'm convinced this is a xen bug. Inside a xen guest, I get:

mock-chroot> strace ./utimensat-test
execve("./utimensat-test", ["./utimensat-test"], [/* 21 vars */]) = 0
brk(0)                                  = 0x16049000
uname({sys="Linux", node="localhost.localdomain", ...}) = 0
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=10750, ...}) = 0
mmap(NULL, 10750, PROT_READ, MAP_PRIVATE, 5, 0) = 0x2b04f2a97000
close(5)                                = 0
open("/lib64/libc.so.6", O_RDONLY)      = 5
read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \345\1\0\0\0\0\0"...,
832) = 832
fstat(5, {st_mode=S_IFREG|0755, st_size=1510768, ...}) = 0
mprotect(0x2b04f2e02000, 2093056, PROT_NONE) = 0
mmap(0x2b04f3001000, 20480, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x16a000) = 0x2b04f3001000
mmap(0x2b04f3006000, 17016, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b04f3006000
close(5)                                = 0
arch_prctl(ARCH_SET_FS, 0x2b04f300b6f0) = 0
mprotect(0x2b04f3001000, 16384, PROT_READ) = 0
mprotect(0x2b04f2c95000, 4096, PROT_READ) = 0
munmap(0x2b04f2a97000, 10750)           = 0
utimensat(1, "�+��", {{140736280603661, 140736280603692}, {140736280603703,
140736280603719}}, 0) = 280
exit_group(24)                          = ?
Comment 27 Jarod Wilson 2008-06-24 15:48:26 EDT
Does this sound at all familiar to any xen folks?
Comment 28 Jarod Wilson 2008-06-26 16:22:08 EDT
I seem to have somewhat of an explanation why this hasn't been seen on i686...
In the i686 case, syscalls are passed unaltered to the kernel. On x86_64, they
all have to pass through the hypervisor, which does some mucking with
things[*]... Now, based on xen/arch/x86/x86_64/entry.S, if I'm reading this
right (I don't know assembly or the xen hv for squat), it looks like its
*supposed* to ultimately return an -ENOSYS here when we leave syscall_enter, but
for whatever reason, its returning the attempted syscall number instead. Of
course, like I said, I know neither assembly nor the xen hv very well, so I
could be barking up the wrong tree...

[*] http://lists.xensource.com/archives/html/xen-devel/2006-11/msg01246.html
Comment 29 Chris Lalancette 2008-06-27 03:00:01 EDT
    Yep, your explanation about how syscalls work on i686 vs. x86_64 is (almost)
spot-on.  For i686, kernel is in ring1 and userland is in ring3; when the guest
boots, the hypervisor and guest kernel set it up so that syscalls do a direct
ring transfer from ring3 to ring1; the hypervisor is not involved at all.
    On x86_64, you can't do this because there is no ring1 (not technically
true, but true enough for this explanation).  So both the kernel and userland
live in ring3, protected from each other by page tables.  When userland
syscalls, it traps to the hypervisor, which swaps out the page tables for the
kernel page tables, and resumes inside the guest kernel in ring3 again.
     Kind of odd that this new utimenstat() is returning such a bogus number
though.  You would think we would have a problem all of the time, not just
between 10pm and 3am, no?  In any case, this should be easy enough to find, if
it is indeed the HV mucking up the system call somehow.  Is there a test binary
we can use to show this?

Chris Lalancette
Comment 30 Jarod Wilson 2008-06-27 10:21:13 EDT
Hey Chris!

Yep, I've been able to reproduce the 280 return code (between 8am and 3pm, in my
own experience) using the following code snippet:

#define utimensat innocuous_utimensat
#ifdef __STDC__
# include <limits.h>
# include <assert.h>
#undef utimensat
#ifdef __cplusplus
extern "C"
char utimensat ();
#if defined __stub_utimensat || defined __stub___utimensat
choke me

int main () { return utimensat (); return 0; }

This is more or less the exact same test code that coreutils uses to test
linking functionality. The code has to be compiled and run inside a rawhide (or
f9) chroot to trigger, but its relatively easy to get going with mock.

On an x86_64 RHEL5 xen box:

1) rpm -ivh
2) yum install mock.noarch
3) wget
4) mock -r fedora-devel-x86_64 coreutils-6.12-4.fc10.src.rpm

Once that fails, you should have enough bits in the chroot to build and run the
sample code:

5) cp utimensat-test.c /var/lib/mock/fedora-development-x86_64/root/
6) mock -r fedora-devel-x86_64 shell
7) gcc -o utimensat-test utimensat-test.c
8) ./utimensat-test

(may also be worth installing strace into the chroot, via 'mock -r
fedora-devel-x86_64 --install strace').
Comment 31 Jarod Wilson 2008-06-27 10:23:58 EDT
Nb: I've not actually verified that I *don't* get the 280 outside of 8am and
3pm, but definitely did see it with the clock set to roughly 8am utc the other day.
Comment 32 Chris Lalancette 2008-06-27 12:41:41 EDT
Yeah, I tried it out here, and it is failing at any time.  I also followed the
code paths some; what happens is:

Guest userland: utimensat()
Trap -> HV
HV -> switch out page tables, trapbounce into guest (xen/arch/x86/x86_64/entry.S)
Guest -> Process system call (linux/arch/x86_64/kernel/entry.S:system_call) -
note that there is a check here to check if the syscall in question is above
NR_syscall_max (which is 279), in which case it should return -ENOSYS.
Guest -> ret from syscall (sysretq), which should trap back to the HV and thus
back to userland.

So, from a quick glance, I don't see it; I'll have to add some debugging code
next week to see where this is breaking down.

Chris lalancette
Comment 33 Chris Lalancette 2008-06-30 07:10:52 EDT
Crap.  Well, it looks like Jarod did find a bug in the Xen kernel, however, I
don't think that bug is what is causing the overall problems.  Basically the bug
in the Xen kernel is that if, and *only* if, you are ptrace'ing (i.e. strace) an
application, a syscall above the NR_syscalls macro will return the number of the
syscall, not -ENOSYS.  However, in the normal case (no ptrace), it is doing the
right thing (at least, as far as I can tell from code inspection).  I'll open a
BZ about the bad ptrace syscall thing, but I think there is still something else
going on with this bug.

Chris Lalancette
Comment 34 Chris Lalancette 2008-06-30 08:58:35 EDT
Arg.  Open mouth, insert foot.

I should really do more thorough research before posting.  Let me try from the
beginning.  Here's my test program:

#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <sys/types.h>

int main()
        int foo;
        foo = syscall(289);
        fprintf(stderr, "Error is %d\n",foo);
        return 0;

and I'm compiling with:

# gcc -g -Wall test_syscall.c


[root@localhost ~]# ./a.out
Error is 289
[root@localhost ~]# 

[root@localhost ~]# strace ./a.out
syscall_289(0x7fff61de4a48, 0x7fff61de4a58, 0x400580, 0x38bc34e2b0,
0x38bbc0d630, 0x7fff61de4a40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0) = 0x121
write(2, "Error is 289\n", 13Error is 289
)          = 13
exit_group(0)                           = ?
[root@localhost ~]# 

2.6.18-92.el5xen + arch/x86_64/kernel/entry-xen.S patch for this issue:

[root@localhost ~]# ./a.out
Error is -1
[root@localhost ~]# 

[root@localhost ~]# strace ./a.out
syscall_289(0x7fff4450a148, 0x7fff4450a158, 0x400580, 0x38bc34e2b0,
0x38bbc0d630, 0x7fff4450a140, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0) = -1 (errno 38)
write(2, "Error is -1\n", 12Error is -1
)           = 12
exit_group(0)                           = ?
[root@localhost ~]# 

So, the patch on the tracesys path also affects when you aren't systracing.  I'm
guessing the reason for that is the audit stuff, but I will have to confirm that
(just to make sure I completely understand the issue).

Chris Lalancette
Comment 35 Jarod Wilson 2008-07-05 23:01:59 EDT
Nb: this last bit was indeed a bug in the x86_64 xen kernel's syscall handling
-- see bug 453394. Chris' patch has gone upstream, and should go into the
RHEL5.3 kernel. Not sure if we're going to see this on in z-stream or not, will
have to check...
Comment 36 Jarod Wilson 2008-07-09 21:14:41 EDT
So we've got a work-around patch in place right now, but it probably could stand
to be tweaked a bit to better explain the issue, and note that its a xen kernel
issue that should be fixed in RHEL5.3... Do we want to throw some of the failure
check fixups from Jim into the rawhide package? (I have them here locally in a
build, I can add them to rawhide coreutils if so desired). If so desired, I can
rework the work-around patch a bit as well.
Comment 37 Ondrej Vasik 2008-07-10 04:01:31 EDT
What kind of changes you want to do in the Jim's workaround? Some diagnostics?
I'm ok with changes in rawhide if the package will be still buildable in koji
(no failures in test-suite). Thanks for working on that.
Comment 38 Jarod Wilson 2008-08-21 09:46:04 EDT
Sorry for the delay, got distracted by other stuff... What I was intending to add were the patches from Jim to fix up some of the tests, then a clarification of the comments in the source regarding what's going on when a bogus exit code of 280 is received.
Comment 39 Chris Lalancette 2008-08-21 10:01:34 EDT
     Hm, actually, I forgot about this since it went into 5.3 already.  Did we want to nominate this for the z-stream instead of applying a userspace workaround?  Or is it too late for that?

Chris Lalancette
Comment 40 Ondrej Vasik 2008-08-21 10:06:47 EDT
(In reply to comment #38)
> Sorry for the delay, got distracted by other stuff... What I was intending to
> add were the patches from Jim to fix up some of the tests, then a clarification
> of the comments in the source regarding what's going on when a bogus exit code
> of 280 is received.

No problem, if you still want to do that in coreutils, feel free to request commit ACL's for coreutils devel at https://admin.fedoraproject.org/pkgdb/packages/name/coreutils . Or you could just send me patches and I will apply it to devel branch.
Comment 41 Jarod Wilson 2008-08-28 17:49:02 EDT
Re comment #39: given that we've apparently got other syscalls triggering similar bugs in other builds, I think infra would really appreciate it if we can get a patched kernel onto the build systems, and it being an official z-stream kernel would be highly preferred.

Re comment #40: I've requested commit privs for coreutils, will go ahead and commit that stuff when I get a chance.
Comment 42 Jarod Wilson 2008-08-28 17:50:12 EDT
For cross-referencing purposes... bug 459442 is the 'other syscalls triggering' bug referred to in comment #41.
Comment 43 Chris Lalancette 2008-08-29 03:17:57 EDT
     OK, thanks for the info.  I'll request z-stream on the other bug (BZ 453394).

Chris Lalancette
Comment 44 Jarod Wilson 2008-10-06 16:31:11 EDT
Ugh, I take forever sometimes. Finally just committed the testsuite workarounds from upstream, as well as the minor modification of the comments regarding why we were getting that 280 return code. Tagged and building now.

Hm. Maybe this bug should be flipped back to rawhide and component coreutils... I think all the RHEL stuff has been taken care of now.