Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1113143

Summary: failing test exec-9.7
Product: Red Hat Enterprise Linux 7 Reporter: Karel Srot <ksrot>
Component: kernelAssignee: Steve Whitehouse <swhiteho>
kernel sub component: File Systems - Other QA Contact: Filesystem QE <fs-qe>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: high CC: akarlsso, aviro, dhowells, esandeen, jskarvad, kernel-mgr, xzhou
Version: 7.0   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-03 08:29:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1116720, 876602    
Attachments:
Description Flags
Stripped down reproducer
none
Stripped down reproducer
none
Backported fix none

Description Karel Srot 2014-06-25 13:50:12 UTC
Description of problem:

I am observing the failure of upstream test exec-9.7 on PPC64 with RHEL-7.0.
I have a suspect that the failure is actually caused by a race condition that appeared due to kernel changes because I encountered the issue in last RHEL-7 testing composes while the tcl didn't change but I am not able to provide further details (yet).


Version-Release number of selected component (if applicable):
kernel-3.10.0-123.el7.ppc64
tcl-8.5.13-4.el7.ppc64
tested on compose 20140507.0

How reproducible:
you can execute the upstream test from exec.test
Or
I distilled the test into various (smaller) variants that are also able to reproduce the problem. There are exec.tst exec.tst2 exec.tst3 exec.tst4 included in the attached archive.


Steps to Reproduce:

exec.tst is reproducing the problem reliably on ppc64.

# tclsh exec.tst
==== exec-9.7 commands returning errors FAILED
==== Contents of test case:

    list [catch {exec [interpreter] "$path(sh)" -c "\"$path(echo)\" error msg 1>&2 ; \"$path(sleep)\" 1"  | [interpreter] "$path(sh)" -c "\"$path(echo)\" error msg2 1>&2 ; \"$path(sleep)\" 1"} msg] $msg

---- Result was:
1 {error msg2}
---- Result should have been (exact matching):
1 {error msg
error msg2}
==== exec-9.7 FAILED

exec.tst:	Total	1	Passed	0	Skipped	0	Failed	1

------------

simplified version
$ tclsh exec.tst2
is also able to reproduce the problem but not that frequently (e.g. in 1 out of 20 runs)

------------

same applies for exec.tst3 which doesn't contain sleep command.

------------

# for I in `seq 10`; do tclsh exec.tst4; echo; done
error msg
error msg2

error msg
error msg2

error msg
error msg2

error msg
error msg2

error msg
error msg2

error msg
error msg2

error msg2     <----- This is the problem

error msg
error msg2

error msg
error msg2

error msg
error msg2


Actual results:
just one message is received (either first or second)

Expected results:
both messages are received

Additonal info:
There is an unresolved upstream bug mentioning the problem
http://sourceforge.net/p/tcl/bugs/3974/

Comment 5 Jaroslav Škarvada 2014-08-01 16:26:04 UTC
(In reply to Karel Srot from comment #0)
> How reproducible:
> you can execute the upstream test from exec.test
>
I am running the upstream test suite on ppc64 machines from beaker second day and I wasn't able to trigger the problem.

> Or
> I distilled the test into various (smaller) variants that are also able to
> reproduce the problem. There are exec.tst exec.tst2 exec.tst3 exec.tst4
> included in the attached archive.
> 
Could you provide the archive?

Comment 6 Karel Srot 2014-08-05 11:48:30 UTC
To my surprise I am also unable to reproduce the problem now, even when trying the same compose. I will give it few more tries...

Regarding the archive, it seems that I forgot to attach it. :-( I guess it is lost now. Anyway, if I succeed in reproducing the problem again I will try to create it again.

Comment 11 Jaroslav Škarvada 2014-08-12 13:47:42 UTC
Created attachment 926075 [details]
Stripped down reproducer

I probably got the problem. It's explicitly seeking on duped FDs. So upon specific cases there can be a race condition. It's probably not kernel bug, because it's documented. From the lseek man:

> Note  that  file  descriptors created by dup(2) or fork(2) share the current file position pointer, so seeking on such files may be subject to race conditions.

The workaround can be explicit fsync:
--- ./test.c.orig	2014-08-12 21:36:45.000000000 +0800
+++ ./test.c	2014-08-12 21:44:47.459804077 +0800
@@ -49,6 +49,7 @@
     pid2 = clone(b, stacktop2, CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID | SIGCHLD, NULL, NULL, NULL, &ctid1);
     waitpid(pid1, NULL, 0);
     waitpid(pid2, NULL, 0);
+    fsync(fd);
     lseek(fd, 0, SEEK_SET);
     loop = read(fd, buf, 22) == 22;
     close(fd);

Currently I am unsure whether it is tcl or shell bug, continuing in investigation.

Comment 12 Jaroslav Škarvada 2014-08-13 08:02:37 UTC
> The workaround can be explicit fsync:
> --- ./test.c.orig	2014-08-12 21:36:45.000000000 +0800
> +++ ./test.c	2014-08-12 21:44:47.459804077 +0800
> @@ -49,6 +49,7 @@
>      pid2 = clone(b, stacktop2, CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID |
> SIGCHLD, NULL, NULL, NULL, &ctid1);
>      waitpid(pid1, NULL, 0);
>      waitpid(pid2, NULL, 0);
> +    fsync(fd);
>      lseek(fd, 0, SEEK_SET);
>      loop = read(fd, buf, 22) == 22;
>      close(fd);
> 
I also triggered the problem with this workaround after several hours of run. So the problem isn't in the unsynced buffer before seek, but the race occurs earlier, in the write itself, i.e.:

20512 write(1, "error msg2\n", 11 <unfinished ...>
20509 write(1, "error msg1\n", 11 <unfinished ...>
20512 <... write resumed> )             = 11
20509 <... write resumed> )             = 11

Here the file pointer gets mangled by the race condition and one string rewrites the other. The problem occur more often on slow or loaded machines. I have thought the write is safe operation, investigating.

Comment 13 Jaroslav Škarvada 2014-08-19 18:27:43 UTC
Created attachment 928482 [details]
Stripped down reproducer

More simplified reproducer.

Comment 14 Jaroslav Škarvada 2014-08-19 18:57:43 UTC
Definitely kernel bug, kernel upstream fix:
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=9c225f2655e36a470c4f58dbbc99244c5fc7f2d4

Comment 15 Jaroslav Škarvada 2014-08-19 18:58:19 UTC
Created attachment 928486 [details]
Backported fix

Comment 16 Jaroslav Škarvada 2014-08-19 19:00:58 UTC
I am unable to reproduce the problem any more on the ibm-p720-02-lp9.rhts.eng.nay.redhat.com with the patched kernel.

Comment 18 Karel Srot 2014-09-29 12:56:02 UTC
Hello,
any chance of fixing it in 7.1?