Description of problem: When program generates lot of output (for example journalctl from systemd), then data aren't written to file after rlRun finishes. Which is not nice and leads to race conditions. Version-Release number of selected component (if applicable): How reproducible: always/it depends on how output is big Steps to Reproduce: 0. Fill journal with many lines. 1. rlRun -s "journalctl" 2. grep blah ${rlRun_LOG} Actual results: No blah found. After some sleep, blah is here Expected results: It would be fine to know that all data are written when rlRun finishes. Maybe you should wait to all proceses created by that nasty pipe below. Additional info: Why is this broken? eval "$UNBUFFER$command" 2> >(sed -u -e "s/^/$TAG_ERR/g" | tee -a $LOG_FILE) 1> >(sed -u -e "s/^/$TAG_OUT/g" | tee -a $LOG_FILE) Because it's magic :). You can imagine this line simple as: ls > >(sleep 10; cat -) $ ls > >(sleep 10; cat -); echo $? 0 ... after 10 seconds ... $ analyze.sh beakerlib.sh dictionary.vim infrastructure.sh journal.sh libraries.sh logging.sh performance.sh rpms.sh storage.sh synchronisation.sh testing.sh virtualX.sh
Nice, I use rlRun -s a lot but never faced something like this. Are you able to propose a patch to speed up fixing?
Using the 'coproc' bash builtin would be one way, but that one is not available on RHEL5. Another way would be using process substitution (like the example in comment #0) - the problem with the example is that the 'sed' is created as a child of the 'ls', being moved under pid 1 as soon as the 'ls' ends. Spawning multiple such subshells also prevents catching all with $!. The solution, without using 'coproc', is to spawn the subshells separately: # create sed processes in background exec 10> >(sed 's/^/BLAH /') proc1=$! exec 11> >(sed 's/^/BLEH /') proc2=$! # write data to them echo abc >&10 echo abc >&10 echo def >&11 echo def >&11 # close their inputs, so they flush buffers & exit cleanly, # guaranteeing that all data has been written exec 10>&- exec 11>&- # wait for them to exit (using spinlock) # note: the 'wait' builtin works only on jobs while kill -0 $proc1 2>/dev/null; do :; done; while kill -0 $proc2 2>/dev/null; do :; done; This can be further enhanced by using named file descriptors to not have arbitrary numbers hardcoded, but it doesn't work on RHEL5: # create exec {BEAKERLIB_WHATEVER}> >(sed 's/^/WHEE /') # use echo abc >&$BEAKERLIB_WHATEVER # close exec {BEAKERLIB_WHATEVER}>&- (From: The horrors of Bash, vol. 1.)
fixed by https://git.fedorahosted.org/cgit/beakerlib.git/commit/?id=688273145292c75fb3e5ea013621746bd713086d Please try to apply the patch and provide feedback.
It looks like I will need to revert this patch as it may cause more problems then it solves, see bz1416796.
Can we just remove -s in rlRun? It doesn't work What I remember.
We can't do that as many tests use and depend on that function.
Well. Then we should recomment not to do... :).
Personally I did not noticed the issue yet and I prefer to use -s option rather than '| tee' or so. I believe this is really only a corner case. We can put into the documentation that it may not save all the output in certain situations, though.
Good. Instead of solving issues we will document them :). Then.. why I should use rlRun -s at all when I don't know if this courner case involves my case? Maybe I should grab a crystal ball and try to guess :(. Basically. rlRun -s SHOULD save output of command... great progress :). Maybe we save your data.. or maybe not. We will see! :D.
Until we have reliable fix, it is good to document it.
This bug appears to have been reported against 'rawhide' during the Fedora 26 development cycle. Changing version to '26'.
Created attachment 1392045 [details] Proposed patch
Created attachment 1392049 [details] Proposed patch #2 I came up with the attached patch, which should solve the issue as well. Unfortunately, both proposed patches fail with the reproducer from BZ#1416796 - the pipeline/spinlock won't continue, until the corresponding scope is stopped.
What exactly is the problem cause here? I'm still puzzled why 688273145292c75fb3e5ea013621746bd713086d did have the behavior described - as far as I can tell, systemd-run exits cleanly without any stray fds and whatever it spawns is not relevant to the current bash process. A hang like this can happen if there's an open fd somewhere near the beakerlib's bash process that's never closed, but I'm not seeing any. Did the system run out of PIDs or something? Patch #2 seems like a sidestep and if it doesn't work for BZ#1416796, then maybe that BZ is not related to this BZ, or, the logic from 688273145 is not to blame and the bug has to be in the code around it if a completely different approach (patch #2) has the same behavior.
Unfortunately both patches, in comment 12 and comment 13 have one important drawback - it breaks setting a variable inside the rlRun. Normally rlRun 'X=text' works also with option -s, the content of X is available outside the rlRun. After applying either of the patches it stops to work with -s option. Fix https://github.com/beakerlib/beakerlib/commit/688273145292c75fb3e5ea013621746bd713086d did not suffer by this. Unfortunately I do not remember the exact reason why I reverted it. I can put it to some separate branch for testing purposes.
Ok, so it is now available here https://github.com/beakerlib/beakerlib/tree/bz1361246-rlRun-output-flush
The mentioned patch[0] was reverted due to same reason I mentioned above - under certain circumstances rlRun ends in an endless loop (see BZ#1416796). [0] https://github.com/beakerlib/beakerlib/commit/688273145292c75fb3e5ea013621746bd713086d
Okay, some debugging later and I now know what the exact issue is. TL;DR - it's technically a feature / user error, not a beakerlib bug. ---- The core problem is file descriptor (fd) inheritance during fork(), the reproducer in BZ#1416796 doesn't close stdout/stderr before forking the subshell with `&', so the subshell process has (still) both open, which holds sed/tee open (listening for any possible output from the subshell), which makes beakerlib wait for them via `kill -0', which causes the endless loop. There is no unix-like way, from bash/C/python/etc., to "solve" this. This is an inherent feature of unix-like systems. Imagine if you want to run a badly coded (doesn't close I/O) daemon via `rlRun -s' and want to log its messages before it goes into background and then grep those messages. Currently, there is no good way to do that with sed/tee - if you don't wait, you might not get the pre-daemonization messages, but if you want to wait, you cannot tell when the messages are actually written if both the daemon's original parent and child processes have the same output fd open (because it is never fully closed). Additionally, this is "probably" not what you want, because it leaves stray daemon processes on the system that you have no control over (PID or parent-child relationship). There are good testing use cases for this, but not in combination with the `-s' flag as you need functionality-specific logic to determine how to find the daemon and when to kill it. So my proposal would be to 1) make the waiting finite (~10 minutes, should be much faster in theory, will depend on disk speed and the amount of writes) 2) if the timeout times out, SIGTERM processes holding the pipe open - this includes the `sed', its parent subshell and the stray daemon 3) tell the user the PIDs and process names of those processes and that they have been killed, pointing to a beakerlib manpage describing this issue in detail 4) optionally do something like rlFail so the user notices This won't actually change the current user-perceived behavior. If the user test now PASSes, it is because the race between `tee' writing output and rlAssertGrep (or whatever) checking for it is won by `tee' and, even if the process daemonizes, the important output has been observed way before the 10 minute mark (in milliseconds). Killing the stray daemon and the logging processes is then done way after the important output (that makes the test PASS) is written. OTOH, doing rlFail might be better as the stray daemon would otherwise continue to run forever on the system, potentially filling the disk (which is what happens with current beakerlib if you start a verbose daemon with `rlRun -s'). As an alternate solution - don't sleep. Check (once) if there are processes other than sed/tee (and their subshell parents) using the fds and kill them as they're not supposed to be running at this point (for the reasons listed above). Then you're /guaranteed/ to be waiting only for sed/tee, which are fairly likely to quit. ---- Finally, there is another way. The entire reason we wait for sed/tee to exit is to guarantee that they have flushed their buffers, so the output from the user-specified process is written to the file in its entirety. If we can guarantee this in another way, we can get our hands clean off the messy daemon the user might have left on the system and let it destroy or flood the system because it's not our concern ***as long as whatever it wrote before it daemonized is flushed to the log file when rlRun returns***. This would need sed/tee to have some side channel through which we could watch their I/O requests and buffers, to see when the data are flushed out of userspace. While this is not possible via pure sed/tee, it could be done in C/python and with a bit of creativity in bash as well. The problem is that it is not bulletproof. When the user process fork()s, both the parent and the child are running with the output fds open. The child (daemon) can then start writing *before* the parent exists and we have no way of knowing where the data came from. This could be countered with our sed/tee wrapper not accepting any more input after we tell it that the daemon's parent has exited, but it's a fairly messy workaround that will not work in some corner cases (child writing a LOT of data in one syscall before parent quits). ---- Given all the drawbacks and custom solutions that algorithm would need, I would vote for just killing the stray daemon(s). ... and DOCUMENTING EVERYTHING !
To avoid breaking any compatibility with killing stray daemons, such as when the user *wants* to start a background process and then repeatedly check its log, like rlRun -s "start_my_daemon &" rlWaitForCmd "grep \"Daemon Started\" $rlRun_LOG" rlRun "send_msg_to_daemon" rlWaitForCmd "grep \"Got Message\" $rlRun_LOG" rlRun "send_shutdown_to_daemon" we could introduce a new flag that would have the waiting behavior instead of modifying -s, ie. rlRun -S "some simple blocking cmd which stays in foreground" rlAssertGrep "65535-bit keys generated" "$rlRun_LOG" with the documentation clearly stating that any processes left running and using stdout/stderr after the rlRun command finishes will be killed to prevent infinite looping.
(In reply to Dalibor Pospíšil from comment #15) > Normally rlRun 'X=text' works also with option -s, the content of X is > available outside the rlRun. After applying either of the patches it stops > to work with -s option. Problem in patch that I proposed is that bash runs all but the last command from the pipeline in subshells. FWIW, zsh seems to do the same, but documentation for zsh mentions that there are shells which do it another way around, i.e. first command runs in parent shell (given it is a built-in) and rest is run in subshells. Too bad bash is not one of those "other" shells.
Created attachment 1392787 [details] Proposed patch #3 Throwing another approach into the mix. This time I am using named pipes in the filesystem and I am running sed + tee pipeline in the background.
(In reply to Michal Sekletar from comment #22) > Created attachment 1392787 [details] > Proposed patch #3 > > Throwing another approach into the mix. This time I am using named pipes in > the filesystem and I am running sed + tee pipeline in the background. There's no need for named pipes, the approach from 688273145 works well and it doesn't use a subshell for the eval. The only "problem" is the waiting logic, which should be further conditioned by ie. a check for a different option (`-S' in my example). It's true that 688273145 could use some touch-ups to not pass 111 and 112 to the subshells, but that's a minor tweak.
Created attachment 1392833 [details] Proposed patch #4 (not working) Attaching a specific implementation based on 688273145 that I had in mind. It's non-functional right now (no time to debug), but should give you the basic idea of what I meant. I can add documentation and make it work if this specific solution is desirable.
After a follow up discussion with Michal, we came to the conclusion that the behavior described in BZ#1416796 is actually expected, thus it would be unwise to implement any kind of timeout workaround for it. To be honest, I can't even remember where I stumbled upon this issue, because it's not in the systemd testsuite. Thus, I'm proposing to close BZ#1416796 as WONTFIX (once again) and select one of the proposed patches (without any timeout workarounds) to make rlRun -s reliable to use.
I tend to limit the loop. I've done a commit [1] to the dedicated branch. I think that solution could solve most of the issues. [1] https://github.com/beakerlib/beakerlib/commit/a40ddf7b90be09a94ceab0d3340d352e6d330335
*** Bug 1541923 has been marked as a duplicate of this bug. ***
fixed by https://github.com/beakerlib/beakerlib/commit/9885cc64394d3e307d4a17690f4233215af96575