Bug 659382

Summary:	compat x86 ERESTARTNOINTR leaking to userspace
Product:	[Fedora] Fedora	Reporter:	John Reiser <jreiser>
Component:	strace	Assignee:	Jeff Law <law>
Status:	CLOSED ERRATA	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	low
Version:	rawhide	CC:	aquini, dougsland, dvlasenk, gansalmon, itamar, jonathan, kernel-maint, kmcmartin, madhu.chinakonda, mnewsome, onestero, pahan, pmachata, ppisar, roland, scottt.tw
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	strace-4.8-1.fc19	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-05-14 14:43:51 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description John Reiser 2010-12-02 17:04:02 UTC

Description of problem: vfork() can fail with errno==513 which strace's linux/errnoent.h says is ERESTARTNOINTR.  'make' does not recover; 'make' should just try again.


Version-Release number of selected component (if applicable):
make-3.82-4.fc15.x86_64
kernel-2.6.36.1-10.fc15.x86_64


How reproducible: about 90%


Steps to Reproduce:
1. strace -f -o strace.out setarch i686 rpmbuild -ba --target i686 glibc.spec >rpmbuild.out 2>&1    # on x86_64 machine using glibc-2.12.90-19.src.rpm

2.
3.
  
Actual results:
make[2]: vfork: Invalid argument
and 'make' aborts.


Expected results: 'make' recovers from ERESTARTNOINTR by trying vfork() again.


Additional info:

Comment 1 Kyle McMartin 2010-12-02 17:35:02 UTC

This errno really isn't supposed to make it to userspace. I wonder if this is an artifact of 32-on-64...

--Kyle

Comment 2 Kyle McMartin 2010-12-02 18:07:03 UTC

Reassigning to kernel...

Comment 3 Kyle McMartin 2010-12-02 18:36:46 UTC

Just for clarity, have you ever seen this in a native environment, or only when using setarch i686 on x86_64?

Oh, hang on, is this only happening under strace?

Comment 4 John Reiser 2010-12-02 19:15:19 UTC

It happens both with and without strace.  I'm testing for with/without "setarch i686".  It takes a while.

Comment 5 Kyle McMartin 2010-12-02 21:11:54 UTC

OK, good. I suspect it won't happen without setarch, I think the problem is in the ia32 compat syscall return path.
I'm poking at it as well.

Comment 6 John Reiser 2010-12-02 21:16:07 UTC

Hrrmph; inconclusive so far.

Without strace, then both with and without "setarch i686" the first two
[re-]runs failed for a different reason (the same reason in both cases.)  Using
'strace' is an attempt to figure out why the build fails in bug #658236 which
is really a Request For Enhancement.

The ERESTARTNOINTR does not happen under strace when $HOME/.rpmmacros has "%_smp_flags -j1"  [7GB output from strace.]  So some parallelism apparently is needed, but "-j2" is enough to cause problems even on a 2-core CPU.

I'm using
-----
%ifarch %{ix86}
GCC="gcc -m32"
%else
GCC=gcc
%endif
-----
in glibc.spec as the first attempt at getting "--target i686" to work when
building on x86_64.

I'm not sure I can devote a lot more time to this now.

Comment 7 John Reiser 2010-12-02 21:51:15 UTC

Now it looks like the key is strace with -j2, and independent of setarch.
strace-4.5.20-1.fc14.x86_64

Comment 8 Kyle McMartin 2010-12-02 22:20:26 UTC

Sorry, comments are unclear, are you saying:

strace + -j2, both x86_64 and i686, leak of ERESTARTNOINTR

anything else, and there's no leak?

(Strace leaking this is expected, due to the way syscall restarts are handled...)

Comment 9 John Reiser 2010-12-03 14:25:22 UTC

Yes.  strace + -j2 leaks ERESTARTNOINTR,  on all configurations: 100% native x86_64, 100% native i686, i686 on x86_64.  This makes strace not usable, at least for trying to get more information to investigate bug #658236.

To reduce the number of variables, I went to Fedora 14 on i686 (kernel-PAE-2.6.35.6-48.fc14.i686, strace-4.5.20-1.fc14.i686) with no cross-arch involved.  Running "strace -f -o \"|gzip - >strace.out.gz\" rpmbuild -ba --target i686 glibc.spec >rpmbuild.out 2>&1" and applying "zmore strace.out.gz" I see:
-----
18703 vfork()                           = ? ERESTARTNOINTR (To be restarted)
18703 --- SIGCHLD (Child exited) @ 0 (0) ---
18703 close(5)                          = 0
18703 sigreturn()                       = ? (mask now [HUP INT QUIT TERM XCPU XFSZ])
18703 clone(child_stack=0x8055763, flags=CLONE_FILES|CLONE_IDLETASK|CLONE_PTRACE|CLONE_
VFORK|CLONE_THREAD|CLONE_NEWNS|CLONE_PARENT_SETTID|CLONE_UNTRACED|CLONE_CHILD_SETTID|0x
8000000|SIGFPE, parent_tidptr=0, child_tidptr=0x1) = -1 EINVAL (Invalid argument)
18703 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
18703 write(2, "make[2]: ", 9)          = 9
18703 write(2, "vfork: Invalid argument", 23) = 23
18703 write(2, "\n", 1)                 = 1
18703 write(4, "+", 1)                  = 1
-----
and this causes 'make' to terminate abnormally.  Furthermore, that "clone(" is strange.  It's the only 'clone' that 'make' does (it does hundreds of vfork.)  Is this strace trying to recover from ERESTARTNOINTR?  The child_stack is not aligned, asking for CLONE_PARENT_SETTID with parent_tidptr=0 is bad news, asking for CLONE_CHILD_SETTID with child_tidptr=0x1 also is bad news.  It's a total mess.

Comment 10 Petr Pisar 2010-12-03 14:42:39 UTC

This is not probably relevant for Fedora, however it can be relevant for this bug: I observe similar problem on 2.6.36 Linux compiled for mips64 with n32 user space on Gentoo. When running make -j2, there is small but non-zero probability to get ENOSYS on vfork() call make(1) uses. Unfortunately the probability is so low it occurs only on large jobs like glibc compilation. I've never been able to create short reproducer. I was unable to trigger the bug by simple vfork() loop.

Comment 11 Oleg Nesterov 2010-12-05 15:21:42 UTC

(In reply to comment #9)
>
> Yes.  strace + -j2 leaks ERESTARTNOINTR,

I don't think it does, please see below.

> This makes strace not usable,

_perhaps_ strace is buggy, at first glance. IIRC, it always
did something strange with vfork(), but I am not sure.

> 18703 vfork()                           = ? ERESTARTNOINTR (To be restarted)

Note that user-space does not see this ERESTARTNOINTR,
it is only visible to ptrace.

The caller received SIGCHLD, vfork() should be restarted
transparently after return from the signal handler.

> 18703 --- SIGCHLD (Child exited) @ 0 (0) ---
> 18703 close(5)                          = 0
> 18703 sigreturn()                       = ? (mask now [HUP INT QUIT TERM XCPU
> XFSZ])

Return from the signal handler. vfork() should be restarted.
Once again, ERESTARTNOINTR was not reported to the user-space.

> 18703 clone(child_stack=0x8055763,
> flags=CLONE_FILES|CLONE_IDLETASK|CLONE_PTRACE|CLONE_VFORK|CLONE_THREAD|CLONE_NEWNS|CLONE_PARENT_SETTID|CLONE_UNTRACED|CLONE_CHILD_SETTID|0x
> 8000000|SIGFPE, parent_tidptr=0, child_tidptr=0x1) = -1 EINVAL

Surely, this doesn't look like sys_vfork().

> It's a total mess.

Agreed ;)

I'll try to dig more later.

Comment 12 Oleg Nesterov 2010-12-05 18:55:43 UTC

(In reply to comment #11)
>
> _perhaps_ strace is buggy, at first glance. IIRC, it always
> did something strange with vfork(), but I am not sure.

Yes. strace is buggy. This nearly killed me, I forgot everything
I knew about strace's sources. But I am very sure my analysis is
correct even if I misunderstood some details in setbpt/clearbpt.


1. A tracee calls vfork(). This syscall has no arguments, it can
enter the kernel with the random value in rdi register. Let's
denote rdi = RDI_RANDOM.

2. Before the tracee actually calls vfork(), it reports this
syscall to strace.

3. strace does setbpt(), "case SYS_vfork" transforms this syscall
into clone(CLONE_PTRACE|SIGCHLD). Note: it changes ->orig_ax from
__NR_vfork to __NR_clone, and it changes arg0 (rdi) from RDI_RANDOM
to clone_flags = CLONE_PTRACE|SIGCHLD.

But! it also records the old value of the unused arg0 (rdi) in
tcp->inst[0]. Again, this is RDI_RANDOM.

4. The tracee receives a signal (SIGCHLD in this case).

5. strace resumes the tracee. The tracee calls copy_process()
which fails with ERESTARTNOINTR. The tracee reports syscall_exit
to strace.

6. strace does clearbpt() which restores rdi = RDI_RANDOM.
But! it does _not_ change ->orig_ax/scno! it is still equal
to __NR_clone, not __NR_vfork.

7. Eventually the tracee restarts the syscall, but this time it
does clone(RDI_RANDOM), and this is what we see in the logs.

It calls clone() because orig_ax/scno was not restored by
clearbpt(), it passes the "random" clone_flags because
suddenly rdi becomes the first argument of this syscall.


I am changing the component.

Comment 13 Roland McGrath 2010-12-06 23:26:03 UTC

Thanks, that analysis looks sound to me.

Note that upstream strace is on the verge of switching to using PTRACE_O_TRACEVFORK et al instead of the syscall-changing hackery.  So it could see this failure mode disappear without the actual bug having been fixed.  But since it will continue to support the old method for ancient kernels, it still deserves a fix upstream, though that is not really a Fedora concern.

Comment 15 Denys Vlasenko 2012-05-14 14:43:51 UTC

Fixed upstream by commit 55980f5b72000406e3fd843b098b5c1328a21e45.

This code is not triggered on Fedora anyway because we use PTRACE_O_TRACEVFORK.

Comment 16 Fedora Update System 2013-06-05 23:33:45 UTC

strace-4.8-1.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/strace-4.8-1.fc19

Comment 17 Fedora Update System 2013-06-13 06:51:14 UTC

strace-4.8-1.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.