Bug 1592376 - Restraint suppresses broken pipe handling in shell pipelines invoked by a task
Summary: Restraint suppresses broken pipe handling in shell pipelines invoked by a task
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Restraint
Classification: Retired
Component: general
Version: 0.1.32
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: 0.1.36
Assignee: Matt Tyson 🤬
QA Contact: Jacob McKenzie
URL:
Whiteboard:
: 1623391 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-18 13:18 UTC by Ondrej Mejzlik
Modified: 2018-09-17 05:50 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-17 05:50:11 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Beaker Project Gerrit 6166 0 None MERGED Reset SIGPIPE action to default in child process. 2020-08-12 07:05:15 UTC

Comment 1 Dan Callaghan 2018-06-19 03:35:25 UTC
By "doesn't work", you mean that specifically the task gets stuck and eventually hits the local watchdog when it's run understand restraint, whereas under beah the task completes in a matter of seconds.

It's an interesting example. The task appears to just run this pipeline a bunch of times:

: [ 08:57:25 ] :: [  BEGIN   ] :: Running 'tr -cd '[:lower:]' < /dev/urandom | fold -w33 | head -n1'
yuqthjhzpqapfnpsxnxvjqjqemlkqqoei
:: [ 08:57:25 ] :: [   PASS   ] :: Command 'tr -cd '[:lower:]' < /dev/urandom | fold -w33 | head -n1' (Expected 0, got 0)

But when restraint runs it, it gets stuck on the very first invocation for 2 hours 4 minutes:

:: [ 08:47:53 ] :: [  BEGIN   ] :: Running 'tr -cd '[:lower:]' < /dev/urandom | fold -w33 | head -n1'
zoiuzafevzskpgapbtdkyjltswqwuyaig

The ps-lwd.log file from restraint gives some clue why it's getting stuck:

[...]
0 S root      1620     1  0  80   0 -  8877 -      08:47 ?        00:00:00   make run
0 S root      1632  1620  0  80   0 -  6681 -      08:47 ?        00:00:00     /bin/bash ./runtest.sh
0 R root      1732  1632 98  80   0 -  1480 -      08:47 ?        02:37:58       tr -cd [:lower:]
0 R root      1733  1632 78  80   0 -  1475 -      08:47 ?        02:04:50       fold -w33

From the TIME column you can see that the tr process has been burning 100% CPU for 2 hours 40 minutes. Presumably reading nonstop from /dev/urandom and feeding it to fold. But importantly, the final head process from the pipeline is missing from the ps output.

It means head did exit straight away after reading one line, as it should. Then the earlier processes in the pipeline (tr and fold) *should* have received SIGPIPE or EPIPE and also exited. Clearly that is what happens under beah, and is what happens if you run the same pipeline on your workstation.

But under restraint, the tr and fold processes don't realise the end of the pipe has gone away and so they sit there endlessly writing random bytes to nowhere.

There must be something unusual in the way restraint is executing 'make run' that is causing the usual SIGPIPE/EPIPE mechanism not to work properly.

Comment 2 Matt Tyson 🤬 2018-06-24 23:01:12 UTC
This is caused by glib2.  When a GSocket is created, the SIGPIPE signal is ignored.

https://developer.gnome.org/gio/stable/GSocket.html

> Note that creating a GSocket causes the signal SIGPIPE to be ignored for the remainder of the program

The fix here is to set SIGPIPE back to its default action in the forked child process.

Comment 3 Dan Callaghan 2018-06-24 23:43:39 UTC
Nice catch. I'm surprised that the GLib stuff we are using to execute subprocesses is not smart enough to reset signal handlers for us, since it is GLib messing with the signal handlers in the first place.

Are there any other process characteristics we need to worry about resetting in the subprocesses? Do we have CLOEXEC set on all file descriptors? Is there anything else inherited that might be a problem? (I didn't even know that the signal handler table is inherited so I would never have even guessed this.)

Comment 4 Matt Tyson 🤬 2018-06-24 23:55:27 UTC
(In reply to Dan Callaghan from comment #3)
> Nice catch. I'm surprised that the GLib stuff we are using to execute
> subprocesses is not smart enough to reset signal handlers for us.

Restraint doesn't actually use any of glib's process spawning routines.  Restraint does an old fashioned fork() and exec().

> Are there any other process characteristics we need to worry about resetting
> in the subprocesses? 

It looks like there is.  Restraint sets signal handlers for SIGINT and SIGTERM.  We should set those back to default.

> Do we have CLOEXEC set on all file descriptors? Is there anything else inherited that might be a problem?

CLOEXEC is set in one place it seems.  I'd have to check in a little more detail to make sure that covers everything.

Comment 5 Roman Joost 2018-06-25 00:07:11 UTC
(In reply to Matt Tyson from comment #4)
> (In reply to Dan Callaghan from comment #3)
> > Nice catch. I'm surprised that the GLib stuff we are using to execute
> > subprocesses is not smart enough to reset signal handlers for us.
> 
> Restraint doesn't actually use any of glib's process spawning routines. 
> Restraint does an old fashioned fork() and exec().

Should we perhaps?

Comment 6 Matt Tyson 🤬 2018-06-25 00:39:52 UTC
(In reply to Roman Joost from comment #5)
> > Restraint doesn't actually use any of glib's process spawning routines. 
> > Restraint does an old fashioned fork() and exec().
> 
> Should we perhaps?

https://developer.gnome.org/glib/stable/glib-Spawning-Processes.html

glib's process spawning routines seem to have some limitations on what can be done in the child process config routine.  We do some setup between the fork and exec.  I'd have to check to see if we can use the glib routines or not, and if they even solve the problem of SIGPIPE being changed.

Comment 7 Dan Callaghan 2018-06-25 00:51:42 UTC
Ah right, I think I remember now... we needed some custom code to handle setting up a pty for the task subprocess.

Comment 8 Matt Tyson 🤬 2018-07-02 05:53:58 UTC
I've added a routine that sets every signal handler back to its default value.  This should prevent tests exhibiting strange behaviour if they trip signals that restraint has handlers for.

Comment 9 Jacob McKenzie 2018-07-10 00:30:22 UTC
This patch is available in builds > restraint-0.1.35-1.git.4.*, which are now available in beaker-devel.

Comment 10 Jacob McKenzie 2018-07-10 03:37:52 UTC
This patch seems to have fixed the aforementioned bug. I re-ran the task mentioned in the #c1 and it worked as expected. 

Test can be seen here:
https://beaker-devel.app.eng.bos.redhat.com/jobs/14740

The test was also performed with an older version of restraint, and it failed as it did for omejzlik.

Test can be seen here:
https://beaker-devel.app.eng.bos.redhat.com/jobs/14729

Comment 11 Jacob McKenzie 2018-07-10 03:38:57 UTC
(In reply to Jacob McKenzie from comment #10)
> This patch seems to have fixed the aforementioned bug.

The bug seems to be fixed by mtyson's patch...

Comment 12 Jiri Hladky 2018-08-09 07:51:38 UTC
I have run into the same issue with this command:

(while :; do echo ""; done ) | sensors-detect

Further debugging has shown that Bash with restraint harness ignores SIGPIPE signal:

trap -- '' SIGPIPE

I have tried to work around the issue by canceling the trap

trap - SIGPIPE; (while :; do echo ""; done ) | sensors-detect

but it didn't help. I will try https://beaker-devel.app.eng.bos.redhat.com to verify the fix. 

Thanks
Jirka

Comment 13 Jiri Hladky 2018-08-09 11:14:15 UTC
The fix works for me:-) 

https://beaker-devel.app.eng.bos.redhat.com/recipes/25253#task163205

Comment 14 Dan Callaghan 2018-08-30 00:21:45 UTC
*** Bug 1623391 has been marked as a duplicate of this bug. ***

Comment 17 Dan Callaghan 2018-09-17 05:50:11 UTC
Restraint 0.1.36 was released on 24 August.


Note You need to log in before you can comment on or make changes to this bug.