1361246 – rlRun -s doesn't save data immediately after ends - race condition

Bug 1361246 - rlRun -s doesn't save data immediately after ends - race condition

Summary: rlRun -s doesn't save data immediately after ends - race condition

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	beakerlib
Sub Component:
Version:	26
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	---
Assignee:	Dalibor Pospíšil
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1541923 (view as bug list)
Depends On:
Blocks:	1416796
TreeView+	depends on / blocked

Reported:	2016-07-28 15:22 UTC by Robin Hack
Modified:	2018-03-18 09:20 UTC (History)
CC List:	9 users (show)
Fixed In Version:	beakerlib-1.17-13
Clone Of:
Environment:
Last Closed:	2018-03-18 09:20:11 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Proposed patch (1.67 KB, patch) 2018-02-06 12:25 UTC, Michal Sekletar	no flags	Details \| Diff
Proposed patch #2 (909 bytes, patch) 2018-02-06 13:12 UTC, Frantisek Sumsal	no flags	Details \| Diff
Proposed patch #3 (1.76 KB, patch) 2018-02-07 17:29 UTC, Michal Sekletar	no flags	Details \| Diff
Proposed patch #4 (not working) (4.06 KB, patch) 2018-02-07 19:00 UTC, Jiri Jaburek	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1541923	1	None	None	None	2021-01-20 06:05:38 UTC

Internal Links: 1541923

Description Robin Hack 2016-07-28 15:22:44 UTC

Description of problem:
When program generates lot of output (for example journalctl from systemd), then data aren't written to file after rlRun finishes. Which is not nice and leads to race conditions.

Version-Release number of selected component (if applicable):


How reproducible:
always/it depends on how output is big

Steps to Reproduce:
0. Fill journal with many lines.
1. rlRun -s "journalctl"
2. grep blah ${rlRun_LOG}


Actual results:
No blah found.

After some sleep, blah is here

Expected results:
It would be fine to know that all data are written when rlRun finishes.
Maybe you should wait to all proceses created by that nasty pipe below.

Additional info:
Why is this broken?
        eval "$UNBUFFER$command" 2> >(sed -u -e "s/^/$TAG_ERR/g" |
                tee -a $LOG_FILE) 1> >(sed -u -e "s/^/$TAG_OUT/g" | tee -a $LOG_FILE)

Because it's magic :).

You can imagine this line simple as:

ls > >(sleep 10; cat -)

$ ls > >(sleep 10; cat -); echo $?
0
... after 10 seconds ...
$ analyze.sh
beakerlib.sh
dictionary.vim
infrastructure.sh
journal.sh
libraries.sh
logging.sh
performance.sh
rpms.sh
storage.sh
synchronisation.sh
testing.sh
virtualX.sh

Comment 1 Dalibor Pospíšil 2016-07-29 08:16:43 UTC

Nice, I use rlRun -s a lot but never faced something like this.
Are you able to propose a patch to speed up fixing?

Comment 2 Jiri Jaburek 2016-09-29 12:52:17 UTC

Using the 'coproc' bash builtin would be one way, but that one is not available on RHEL5.

Another way would be using process substitution (like the example in comment #0) - the problem with the example is that the 'sed' is created as a child of the 'ls', being moved under pid 1 as soon as the 'ls' ends. Spawning multiple such subshells also prevents catching all with $!.

The solution, without using 'coproc', is to spawn the subshells separately:

# create sed processes in background
exec 10> >(sed 's/^/BLAH /')
proc1=$!
exec 11> >(sed 's/^/BLEH /')
proc2=$!

# write data to them
echo abc >&10
echo abc >&10
echo def >&11
echo def >&11

# close their inputs, so they flush buffers & exit cleanly,
# guaranteeing that all data has been written
exec 10>&-
exec 11>&-

# wait for them to exit (using spinlock)
# note: the 'wait' builtin works only on jobs
while kill -0 $proc1 2>/dev/null; do :; done;
while kill -0 $proc2 2>/dev/null; do :; done;

This can be further enhanced by using named file descriptors to not have arbitrary numbers hardcoded, but it doesn't work on RHEL5:

# create
exec {BEAKERLIB_WHATEVER}> >(sed 's/^/WHEE /')
# use
echo abc >&$BEAKERLIB_WHATEVER
# close
exec {BEAKERLIB_WHATEVER}>&-


(From: The horrors of Bash, vol. 1.)

Comment 3 Dalibor Pospíšil 2016-09-29 16:03:33 UTC

fixed by https://git.fedorahosted.org/cgit/beakerlib.git/commit/?id=688273145292c75fb3e5ea013621746bd713086d

Please try to apply the patch and provide feedback.

Comment 4 Dalibor Pospíšil 2017-01-26 12:59:24 UTC

It looks like I will need to revert this patch as it may cause more problems then it solves, see bz1416796.

Comment 5 Robin Hack 2017-02-02 08:23:13 UTC

Can we just remove -s in rlRun? It doesn't work What I remember.

Comment 6 Frantisek Sumsal 2017-02-02 09:18:32 UTC

We can't do that as many tests use and depend on that function.

Comment 7 Robin Hack 2017-02-02 09:59:55 UTC

Well. Then we should recomment not to do... :).

Comment 8 Dalibor Pospíšil 2017-02-02 21:45:28 UTC

Personally I did not noticed the issue yet and I prefer to use -s option rather than '| tee' or so. I believe this is really only a corner case.
We can put into the documentation that it may not save all the output in certain situations, though.

Comment 9 Robin Hack 2017-02-03 13:54:02 UTC

Good. Instead of solving issues we will document them :).

Then.. why I should use rlRun -s at all when I don't know if this courner case involves my case? Maybe I should grab a crystal ball and try to guess :(.

Basically. rlRun -s SHOULD save output of command... great progress :).
Maybe we save your data.. or maybe not. We will see! :D.

Comment 10 Dalibor Pospíšil 2017-02-05 20:37:22 UTC

Until we have reliable fix, it is good to document it.

Comment 11 Fedora End Of Life 2017-02-28 10:01:22 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 26 development cycle.
Changing version to '26'.

Comment 12 Michal Sekletar 2018-02-06 12:25:23 UTC

Created attachment 1392045 [details]
Proposed patch

Comment 13 Frantisek Sumsal 2018-02-06 13:12:37 UTC

Created attachment 1392049 [details]
Proposed patch #2

I came up with the attached patch, which should solve the issue as well. Unfortunately, both proposed patches fail with the reproducer from BZ#1416796 - the pipeline/spinlock won't continue, until the corresponding scope is stopped.

Comment 14 Jiri Jaburek 2018-02-06 14:08:01 UTC

What exactly is the problem cause here?

I'm still puzzled why 688273145292c75fb3e5ea013621746bd713086d did have the behavior described - as far as I can tell, systemd-run exits cleanly without any stray fds and whatever it spawns is not relevant to the current bash process.

A hang like this can happen if there's an open fd somewhere near the beakerlib's bash process that's never closed, but I'm not seeing any. Did the system run out of PIDs or something?

Patch #2 seems like a sidestep and if it doesn't work for BZ#1416796, then maybe that BZ is not related to this BZ, or, the logic from 688273145 is not to blame and the bug has to be in the code around it if a completely different approach (patch #2) has the same behavior.

Comment 15 Dalibor Pospíšil 2018-02-06 21:38:41 UTC

Unfortunately both patches, in comment 12 and comment 13 have one important drawback - it breaks setting a variable inside the rlRun.

Normally rlRun 'X=text' works also with option -s, the content of X is available outside the rlRun. After applying either of the patches it stops to work with -s option.

Fix https://github.com/beakerlib/beakerlib/commit/688273145292c75fb3e5ea013621746bd713086d did not suffer by this. Unfortunately I do not remember the exact reason why I reverted it. I can put it to some separate branch for testing purposes.

Comment 16 Dalibor Pospíšil 2018-02-06 22:12:06 UTC

Ok, so it is now available here https://github.com/beakerlib/beakerlib/tree/bz1361246-rlRun-output-flush

Comment 18 Frantisek Sumsal 2018-02-07 09:05:28 UTC

The mentioned patch[0] was reverted due to same reason I mentioned above - under certain circumstances rlRun ends in an endless loop (see BZ#1416796).

[0] https://github.com/beakerlib/beakerlib/commit/688273145292c75fb3e5ea013621746bd713086d

Comment 19 Jiri Jaburek 2018-02-07 11:56:55 UTC

Okay, some debugging later and I now know what the exact issue is. TL;DR - it's technically a feature / user error, not a beakerlib bug.

----

The core problem is file descriptor (fd) inheritance during fork(), the reproducer in BZ#1416796 doesn't close stdout/stderr before forking the subshell with `&', so the subshell process has (still) both open, which holds sed/tee open (listening for any possible output from the subshell), which makes beakerlib wait for them via `kill -0', which causes the endless loop.

There is no unix-like way, from bash/C/python/etc., to "solve" this. This is an inherent feature of unix-like systems.

Imagine if you want to run a badly coded (doesn't close I/O) daemon via `rlRun -s' and want to log its messages before it goes into background and then grep those messages. Currently, there is no good way to do that with sed/tee - if you don't wait, you might not get the pre-daemonization messages, but if you want to wait, you cannot tell when the messages are actually written if both the daemon's original parent and child processes have the same output fd open (because it is never fully closed).

Additionally, this is "probably" not what you want, because it leaves stray daemon processes on the system that you have no control over (PID or parent-child relationship). There are good testing use cases for this, but not in combination with the `-s' flag as you need functionality-specific logic to determine how to find the daemon and when to kill it.

So my proposal would be to
1) make the waiting finite (~10 minutes, should be much faster in
theory, will depend on disk speed and the amount of writes)
2) if the timeout times out, SIGTERM processes holding the pipe
open - this includes the `sed', its parent subshell and the
stray daemon
3) tell the user the PIDs and process names of those processes and
that they have been killed, pointing to a beakerlib manpage
describing this issue in detail
4) optionally do something like rlFail so the user notices

This won't actually change the current user-perceived behavior. If the user test now PASSes, it is because the race between `tee' writing output and rlAssertGrep (or whatever) checking for it is won by `tee' and, even if the process daemonizes, the important output has been observed way before the 10 minute mark (in milliseconds). Killing the stray daemon and the logging processes is then done way after the important output (that makes the test PASS) is written.

OTOH, doing rlFail might be better as the stray daemon would otherwise continue to run forever on the system, potentially filling the disk (which is what happens with current beakerlib if you start a verbose daemon with `rlRun -s').

As an alternate solution - don't sleep. Check (once) if there are processes other than sed/tee (and their subshell parents) using the fds and kill them as they're not supposed to be running at this point (for the reasons listed above). Then you're /guaranteed/ to be waiting only for sed/tee, which are fairly likely to quit.

----

Finally, there is another way.

The entire reason we wait for sed/tee to exit is to guarantee that they have flushed their buffers, so the output from the user-specified process is written to the file in its entirety. If we can guarantee this in another way, we can get our hands clean off the messy daemon the user might have left on the system and let it destroy or flood the system because it's not our concern ***as long as whatever it wrote before it daemonized is flushed to the log file when rlRun returns***.

This would need sed/tee to have some side channel through which we could watch their I/O requests and buffers, to see when the data are flushed out of userspace. While this is not possible via pure sed/tee, it could be done in C/python and with a bit of creativity in bash as well.

The problem is that it is not bulletproof. When the user process fork()s, both the parent and the child are running with the output fds open. The child (daemon) can then start writing *before* the parent exists and we have no way of knowing where the data came from. This could be countered with our sed/tee wrapper not accepting any more input after we tell it that the daemon's parent has exited, but it's a fairly messy workaround that will not work in some corner cases (child writing a LOT of data in one syscall before parent quits).

----

Given all the drawbacks and custom solutions that algorithm would need, I would vote for just killing the stray daemon(s).

... and DOCUMENTING EVERYTHING !

Comment 20 Jiri Jaburek 2018-02-07 13:30:43 UTC

To avoid breaking any compatibility with killing stray daemons, such as when the user *wants* to start a background process and then repeatedly check its log, like

  rlRun -s "start_my_daemon &"
  rlWaitForCmd "grep \"Daemon Started\" $rlRun_LOG"
  rlRun "send_msg_to_daemon"
  rlWaitForCmd "grep \"Got Message\" $rlRun_LOG"
  rlRun "send_shutdown_to_daemon"

we could introduce a new flag that would have the waiting behavior instead of modifying -s, ie.

  rlRun -S "some simple blocking cmd which stays in foreground"
  rlAssertGrep "65535-bit keys generated" "$rlRun_LOG"

with the documentation clearly stating that any processes left running and using stdout/stderr after the rlRun command finishes will be killed to prevent infinite looping.

Comment 21 Michal Sekletar 2018-02-07 16:37:08 UTC

(In reply to Dalibor Pospíšil from comment #15)

> Normally rlRun 'X=text' works also with option -s, the content of X is
> available outside the rlRun. After applying either of the patches it stops
> to work with -s option.

Problem in patch that I proposed is that bash runs all but the last command from the pipeline in subshells. FWIW, zsh seems to do the same, but documentation for zsh mentions that there are shells which do it another way around, i.e. first command runs in parent shell (given it is a built-in) and rest is run in subshells. Too bad bash is not one of those "other" shells.

Comment 22 Michal Sekletar 2018-02-07 17:29:54 UTC

Created attachment 1392787 [details]
Proposed patch #3

Throwing another approach into the mix. This time I am using named pipes in the filesystem and I am running sed + tee pipeline in the background.

Comment 23 Jiri Jaburek 2018-02-07 17:48:54 UTC

(In reply to Michal Sekletar from comment #22)
> Created attachment 1392787 [details]
> Proposed patch #3
> 
> Throwing another approach into the mix. This time I am using named pipes in
> the filesystem and I am running sed + tee pipeline in the background.

There's no need for named pipes, the approach from 688273145 works well and it doesn't use a subshell for the eval. The only "problem" is the waiting logic, which should be further conditioned by ie. a check for a different option (`-S' in my example).

It's true that 688273145 could use some touch-ups to not pass 111 and 112 to the subshells, but that's a minor tweak.

Comment 24 Jiri Jaburek 2018-02-07 19:00:47 UTC

Created attachment 1392833 [details]
Proposed patch #4 (not working)

Attaching a specific implementation based on 688273145 that I had in mind. It's non-functional right now (no time to debug), but should give you the basic idea of what I meant.

I can add documentation and make it work if this specific solution is desirable.

Comment 25 Frantisek Sumsal 2018-02-12 09:48:33 UTC

After a follow up discussion with Michal, we came to the conclusion that the behavior described in BZ#1416796 is actually expected, thus it would be unwise to implement any kind of timeout workaround for it. To be honest, I can't even remember where I stumbled upon this issue, because it's not in the systemd testsuite.

Thus, I'm proposing to close BZ#1416796 as WONTFIX (once again) and select one of the proposed patches (without any timeout workarounds) to make rlRun -s reliable to use.

Comment 26 Dalibor Pospíšil 2018-02-14 21:00:18 UTC

I tend to limit the loop. I've done a commit [1] to the dedicated branch. I think that solution could solve most of the issues.

[1] https://github.com/beakerlib/beakerlib/commit/a40ddf7b90be09a94ceab0d3340d352e6d330335

Comment 27 Dalibor Pospíšil 2018-02-20 07:54:21 UTC

*** Bug 1541923 has been marked as a duplicate of this bug. ***

Comment 28 Dalibor Pospíšil 2018-02-20 08:00:49 UTC

fixed by https://github.com/beakerlib/beakerlib/commit/9885cc64394d3e307d4a17690f4233215af96575

Note You need to log in before you can comment on or make changes to this bug.