Red Hat Bugzilla – Bug 659382
compat x86 ERESTARTNOINTR leaking to userspace
Last modified: 2013-06-13 02:51:14 EDT
Description of problem: vfork() can fail with errno==513 which strace's linux/errnoent.h says is ERESTARTNOINTR. 'make' does not recover; 'make' should just try again.
Version-Release number of selected component (if applicable):
How reproducible: about 90%
Steps to Reproduce:
1. strace -f -o strace.out setarch i686 rpmbuild -ba --target i686 glibc.spec >rpmbuild.out 2>&1 # on x86_64 machine using glibc-2.12.90-19.src.rpm
make: vfork: Invalid argument
and 'make' aborts.
Expected results: 'make' recovers from ERESTARTNOINTR by trying vfork() again.
This errno really isn't supposed to make it to userspace. I wonder if this is an artifact of 32-on-64...
Reassigning to kernel...
Just for clarity, have you ever seen this in a native environment, or only when using setarch i686 on x86_64?
Oh, hang on, is this only happening under strace?
It happens both with and without strace. I'm testing for with/without "setarch i686". It takes a while.
OK, good. I suspect it won't happen without setarch, I think the problem is in the ia32 compat syscall return path.
I'm poking at it as well.
Hrrmph; inconclusive so far.
Without strace, then both with and without "setarch i686" the first two
[re-]runs failed for a different reason (the same reason in both cases.) Using
'strace' is an attempt to figure out why the build fails in bug #658236 which
is really a Request For Enhancement.
The ERESTARTNOINTR does not happen under strace when $HOME/.rpmmacros has "%_smp_flags -j1" [7GB output from strace.] So some parallelism apparently is needed, but "-j2" is enough to cause problems even on a 2-core CPU.
in glibc.spec as the first attempt at getting "--target i686" to work when
building on x86_64.
I'm not sure I can devote a lot more time to this now.
Now it looks like the key is strace with -j2, and independent of setarch.
Sorry, comments are unclear, are you saying:
strace + -j2, both x86_64 and i686, leak of ERESTARTNOINTR
anything else, and there's no leak?
(Strace leaking this is expected, due to the way syscall restarts are handled...)
Yes. strace + -j2 leaks ERESTARTNOINTR, on all configurations: 100% native x86_64, 100% native i686, i686 on x86_64. This makes strace not usable, at least for trying to get more information to investigate bug #658236.
To reduce the number of variables, I went to Fedora 14 on i686 (kernel-PAE-18.104.22.168-48.fc14.i686, strace-4.5.20-1.fc14.i686) with no cross-arch involved. Running "strace -f -o \"|gzip - >strace.out.gz\" rpmbuild -ba --target i686 glibc.spec >rpmbuild.out 2>&1" and applying "zmore strace.out.gz" I see:
18703 vfork() = ? ERESTARTNOINTR (To be restarted)
18703 --- SIGCHLD (Child exited) @ 0 (0) ---
18703 close(5) = 0
18703 sigreturn() = ? (mask now [HUP INT QUIT TERM XCPU XFSZ])
18703 clone(child_stack=0x8055763, flags=CLONE_FILES|CLONE_IDLETASK|CLONE_PTRACE|CLONE_
8000000|SIGFPE, parent_tidptr=0, child_tidptr=0x1) = -1 EINVAL (Invalid argument)
18703 rt_sigprocmask(SIG_SETMASK, , NULL, 8) = 0
18703 write(2, "make: ", 9) = 9
18703 write(2, "vfork: Invalid argument", 23) = 23
18703 write(2, "\n", 1) = 1
18703 write(4, "+", 1) = 1
and this causes 'make' to terminate abnormally. Furthermore, that "clone(" is strange. It's the only 'clone' that 'make' does (it does hundreds of vfork.) Is this strace trying to recover from ERESTARTNOINTR? The child_stack is not aligned, asking for CLONE_PARENT_SETTID with parent_tidptr=0 is bad news, asking for CLONE_CHILD_SETTID with child_tidptr=0x1 also is bad news. It's a total mess.
This is not probably relevant for Fedora, however it can be relevant for this bug: I observe similar problem on 2.6.36 Linux compiled for mips64 with n32 user space on Gentoo. When running make -j2, there is small but non-zero probability to get ENOSYS on vfork() call make(1) uses. Unfortunately the probability is so low it occurs only on large jobs like glibc compilation. I've never been able to create short reproducer. I was unable to trigger the bug by simple vfork() loop.
(In reply to comment #9)
> Yes. strace + -j2 leaks ERESTARTNOINTR,
I don't think it does, please see below.
> This makes strace not usable,
_perhaps_ strace is buggy, at first glance. IIRC, it always
did something strange with vfork(), but I am not sure.
> 18703 vfork() = ? ERESTARTNOINTR (To be restarted)
Note that user-space does not see this ERESTARTNOINTR,
it is only visible to ptrace.
The caller received SIGCHLD, vfork() should be restarted
transparently after return from the signal handler.
> 18703 --- SIGCHLD (Child exited) @ 0 (0) ---
> 18703 close(5) = 0
> 18703 sigreturn() = ? (mask now [HUP INT QUIT TERM XCPU
Return from the signal handler. vfork() should be restarted.
Once again, ERESTARTNOINTR was not reported to the user-space.
> 18703 clone(child_stack=0x8055763,
> 8000000|SIGFPE, parent_tidptr=0, child_tidptr=0x1) = -1 EINVAL
Surely, this doesn't look like sys_vfork().
> It's a total mess.
I'll try to dig more later.
(In reply to comment #11)
> _perhaps_ strace is buggy, at first glance. IIRC, it always
> did something strange with vfork(), but I am not sure.
Yes. strace is buggy. This nearly killed me, I forgot everything
I knew about strace's sources. But I am very sure my analysis is
correct even if I misunderstood some details in setbpt/clearbpt.
1. A tracee calls vfork(). This syscall has no arguments, it can
enter the kernel with the random value in rdi register. Let's
denote rdi = RDI_RANDOM.
2. Before the tracee actually calls vfork(), it reports this
syscall to strace.
3. strace does setbpt(), "case SYS_vfork" transforms this syscall
into clone(CLONE_PTRACE|SIGCHLD). Note: it changes ->orig_ax from
__NR_vfork to __NR_clone, and it changes arg0 (rdi) from RDI_RANDOM
to clone_flags = CLONE_PTRACE|SIGCHLD.
But! it also records the old value of the unused arg0 (rdi) in
tcp->inst. Again, this is RDI_RANDOM.
4. The tracee receives a signal (SIGCHLD in this case).
5. strace resumes the tracee. The tracee calls copy_process()
which fails with ERESTARTNOINTR. The tracee reports syscall_exit
6. strace does clearbpt() which restores rdi = RDI_RANDOM.
But! it does _not_ change ->orig_ax/scno! it is still equal
to __NR_clone, not __NR_vfork.
7. Eventually the tracee restarts the syscall, but this time it
does clone(RDI_RANDOM), and this is what we see in the logs.
It calls clone() because orig_ax/scno was not restored by
clearbpt(), it passes the "random" clone_flags because
suddenly rdi becomes the first argument of this syscall.
I am changing the component.
Thanks, that analysis looks sound to me.
Note that upstream strace is on the verge of switching to using PTRACE_O_TRACEVFORK et al instead of the syscall-changing hackery. So it could see this failure mode disappear without the actual bug having been fixed. But since it will continue to support the old method for ancient kernels, it still deserves a fix upstream, though that is not really a Fedora concern.
Fixed upstream by commit 55980f5b72000406e3fd843b098b5c1328a21e45.
This code is not triggered on Fedora anyway because we use PTRACE_O_TRACEVFORK.
strace-4.8-1.fc19 has been submitted as an update for Fedora 19.
strace-4.8-1.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report.