Bug 1113143
| Summary: | failing test exec-9.7 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Karel Srot <ksrot> | ||||||||
| Component: | kernel | Assignee: | Steve Whitehouse <swhiteho> | ||||||||
| kernel sub component: | File Systems - Other | QA Contact: | Filesystem QE <fs-qe> | ||||||||
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||||||
| Severity: | medium | ||||||||||
| Priority: | high | CC: | akarlsso, aviro, dhowells, esandeen, jskarvad, kernel-mgr, xzhou | ||||||||
| Version: | 7.0 | ||||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2019-06-03 08:29:21 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1116720, 876602 | ||||||||||
| Attachments: |
|
||||||||||
(In reply to Karel Srot from comment #0) > How reproducible: > you can execute the upstream test from exec.test > I am running the upstream test suite on ppc64 machines from beaker second day and I wasn't able to trigger the problem. > Or > I distilled the test into various (smaller) variants that are also able to > reproduce the problem. There are exec.tst exec.tst2 exec.tst3 exec.tst4 > included in the attached archive. > Could you provide the archive? To my surprise I am also unable to reproduce the problem now, even when trying the same compose. I will give it few more tries... Regarding the archive, it seems that I forgot to attach it. :-( I guess it is lost now. Anyway, if I succeed in reproducing the problem again I will try to create it again. Created attachment 926075 [details] Stripped down reproducer I probably got the problem. It's explicitly seeking on duped FDs. So upon specific cases there can be a race condition. It's probably not kernel bug, because it's documented. From the lseek man: > Note that file descriptors created by dup(2) or fork(2) share the current file position pointer, so seeking on such files may be subject to race conditions. The workaround can be explicit fsync: --- ./test.c.orig 2014-08-12 21:36:45.000000000 +0800 +++ ./test.c 2014-08-12 21:44:47.459804077 +0800 @@ -49,6 +49,7 @@ pid2 = clone(b, stacktop2, CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID | SIGCHLD, NULL, NULL, NULL, &ctid1); waitpid(pid1, NULL, 0); waitpid(pid2, NULL, 0); + fsync(fd); lseek(fd, 0, SEEK_SET); loop = read(fd, buf, 22) == 22; close(fd); Currently I am unsure whether it is tcl or shell bug, continuing in investigation. > The workaround can be explicit fsync:
> --- ./test.c.orig 2014-08-12 21:36:45.000000000 +0800
> +++ ./test.c 2014-08-12 21:44:47.459804077 +0800
> @@ -49,6 +49,7 @@
> pid2 = clone(b, stacktop2, CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID |
> SIGCHLD, NULL, NULL, NULL, &ctid1);
> waitpid(pid1, NULL, 0);
> waitpid(pid2, NULL, 0);
> + fsync(fd);
> lseek(fd, 0, SEEK_SET);
> loop = read(fd, buf, 22) == 22;
> close(fd);
>
I also triggered the problem with this workaround after several hours of run. So the problem isn't in the unsynced buffer before seek, but the race occurs earlier, in the write itself, i.e.:
20512 write(1, "error msg2\n", 11 <unfinished ...>
20509 write(1, "error msg1\n", 11 <unfinished ...>
20512 <... write resumed> ) = 11
20509 <... write resumed> ) = 11
Here the file pointer gets mangled by the race condition and one string rewrites the other. The problem occur more often on slow or loaded machines. I have thought the write is safe operation, investigating.
Created attachment 928482 [details]
Stripped down reproducer
More simplified reproducer.
Definitely kernel bug, kernel upstream fix: http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=9c225f2655e36a470c4f58dbbc99244c5fc7f2d4 Created attachment 928486 [details]
Backported fix
I am unable to reproduce the problem any more on the ibm-p720-02-lp9.rhts.eng.nay.redhat.com with the patched kernel. Hello, any chance of fixing it in 7.1? |
Description of problem: I am observing the failure of upstream test exec-9.7 on PPC64 with RHEL-7.0. I have a suspect that the failure is actually caused by a race condition that appeared due to kernel changes because I encountered the issue in last RHEL-7 testing composes while the tcl didn't change but I am not able to provide further details (yet). Version-Release number of selected component (if applicable): kernel-3.10.0-123.el7.ppc64 tcl-8.5.13-4.el7.ppc64 tested on compose 20140507.0 How reproducible: you can execute the upstream test from exec.test Or I distilled the test into various (smaller) variants that are also able to reproduce the problem. There are exec.tst exec.tst2 exec.tst3 exec.tst4 included in the attached archive. Steps to Reproduce: exec.tst is reproducing the problem reliably on ppc64. # tclsh exec.tst ==== exec-9.7 commands returning errors FAILED ==== Contents of test case: list [catch {exec [interpreter] "$path(sh)" -c "\"$path(echo)\" error msg 1>&2 ; \"$path(sleep)\" 1" | [interpreter] "$path(sh)" -c "\"$path(echo)\" error msg2 1>&2 ; \"$path(sleep)\" 1"} msg] $msg ---- Result was: 1 {error msg2} ---- Result should have been (exact matching): 1 {error msg error msg2} ==== exec-9.7 FAILED exec.tst: Total 1 Passed 0 Skipped 0 Failed 1 ------------ simplified version $ tclsh exec.tst2 is also able to reproduce the problem but not that frequently (e.g. in 1 out of 20 runs) ------------ same applies for exec.tst3 which doesn't contain sleep command. ------------ # for I in `seq 10`; do tclsh exec.tst4; echo; done error msg error msg2 error msg error msg2 error msg error msg2 error msg error msg2 error msg error msg2 error msg error msg2 error msg2 <----- This is the problem error msg error msg2 error msg error msg2 error msg error msg2 Actual results: just one message is received (either first or second) Expected results: both messages are received Additonal info: There is an unresolved upstream bug mentioning the problem http://sourceforge.net/p/tcl/bugs/3974/