Bug 1298186 - Please consider adding a -k timeout as well to commands run by sosreport
Please consider adding a -k timeout as well to commands run by sosreport
Status: CLOSED DUPLICATE of bug 1183244
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sos (Show other bugs)
7.1
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Pavel Moravec
BaseOS QE - Apps
:
Depends On:
Blocks: 1420851
  Show dependency treegraph
 
Reported: 2016-01-13 07:24 EST by Paulo Andrade
Modified: 2017-03-24 12:55 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-03-24 12:55:28 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Paulo Andrade 2016-01-13 07:24:41 EST
An user has reported an issue with large journal logs,
that also happened to be corrupted. The only solution to
be able to collect a sosreport was to clean the journal
files and run sosreport without the logs.

  The suggestion would be something like adding a 600
seconds kill timeout, and the pseudo patch:

sos/utilities.py:
    if timeout and is_executable("timeout"):
-        command = "timeout %ds %s" % (timeout, command)
+        command = "timeout %ds --kill-after=%ds %s" % (timeout, timeout * 2, command)
Comment 2 Bryn M. Reeves 2016-01-13 08:07:38 EST
https://github.com/sosreport/sos/issues/337
Comment 4 Bryn M. Reeves 2016-04-26 12:30:23 EDT
I'm not sure I see what the patch is supposed to accomplish?

The timeout handling is intended to deal with cases where the command we are calling is blocked (for e.g. due to unavailable block devices/network file systems, or kernel failures).

That is not the case for the journalctl commands: they are actually running, but dealing with an unreasonable quantity of data. If the command completes before the timeout expires then we have seen an out-of-memory condition in sos's postprocessing handlers when trying to read the entire set of data into memory.

This is a known problem but addressing it requires very significant change and we do not as yet have an estimate when that would appear upstream.

Sending a kill to the subordinate process from /usr/bin/timeout does not affect this - to be honest, I am not even sure from a quick test that the -k option does anything unless the -s switch has been used (to deliver a signal after the timeout):

$ time timeout -k 15 10 /usr/bin/sleep 100
real	0m10.002s
user	0m0.001s
sys	0m0.002s

It would be easier to understand what was being requested if this bug included the standard bugzilla template sections - I don't really understand from the limited description given what it is trying to achieve or how it is supposed to work.
Comment 9 Bryn M. Reeves 2016-04-27 06:39:34 EDT
> We seem to get timeouts for several plugins during sos report collection on 
> controller nodes:

The requested change will have no effect on this behaviour.
Comment 10 Bryn M. Reeves 2016-04-27 06:47:14 EDT
> If the journalctl command was still running and ignored the signal received, 
> the proposed change would wait for another 300s and send KILL signal.

Which still won't do anything useful if the task is blocked in uninterrubtible sleep (the most common cause of apparent command hangs in sos).

I don't see any evidence  that this change would improve behaviour for this or any other case. In the case that a process has gone off into the weeds it is something of a gamble whether sending a signal to it will achieve anything and even so, signaling the process (even if it did respond) has nothing to do with the actual problem reported - that the report does not contain expected content due to command timeouts.

We can try to make the collection of journal data more robust (which is what this bug really seems to be asking for). The mentioned command was directly requested by the systemd folk:

  https://bugzilla.redhat.com/show_bug.cgi?id=879621

Unfortunately while this is probably fine for most desktop systems (that are rebooted reasonably frequently and that use on-disk journals subject to a rotation schedule), it appears problematic on systems with very long uptimes that are using either in-memory (tmpfs), or on-disk (but possibly not rotated) journal files.

I'm not certain of the best answer right now but fiddling with timeout signal parameters is not going to give us any improvement in this area.
Comment 11 GE Scott Knauss 2016-05-09 01:38:08 EDT
Bryan,
    Thanks. Do you think this should be owned by someone else?  If not a sosreport issue, who do you think should be looking at this?


Thanks,
Scott
Comment 12 Pavel Moravec 2016-05-09 02:32:50 EDT
(In reply to Bryn M. Reeves from comment #10)
> I'm not certain of the best answer right now but fiddling with timeout
> signal parameters is not going to give us any improvement in this area.

Collect 

journalctl --all --this-boot --no-pager -o verbose

with all_logs option only, plus newly collect:

journalctl --since="-7d" --all --no-pager -o verbose

?
Comment 13 Bryn M. Reeves 2016-05-09 09:53:24 EDT
> journalctl --all --this-boot --no-pager -o verbose
> 
> with all_logs option only, plus newly collect:

Sure but if someone runs it with a journal that's going to dump 2GiB of data to stdout we'll still fail the same way.

It's not really a workaround - just restricting the problem to a smaller set of users.
Comment 14 Bryn M. Reeves 2016-05-09 09:56:40 EDT
> Thanks. Do you think this should be owned by someone else?  If not a sosreport 
> issue, who do you think should be looking at this?

The proposed patch does not help but this is very definitely an sos problem; even with changes to journalctl that would make managing these large archives easier/more efficient we would still have to properly deal with the volume of data produced.

It is not that it is someone else's problem (although some of the beahviours mentioned here definitely indicate problems in other components), rather that there is not small/simple change that will fix it: it needs a wholesale re-write of the plugin IO mechanism and that is not really something that we can get done in time for 7.3.

I did have an idea at the weekend that would specifically help in the journalctl case however I've not had time to work on any patches to implement it.
Comment 16 Bryn M. Reeves 2016-05-09 14:36:55 EDT
There is no file: we are reading from journalctl's stdout into memory (that's why you see out-of-memory errors in some cases).

The timeouts are also handle by an external program, /usr/bin/timeout (from coreutils) so we do not have the ability to insert our own handling in this way.

Rewriting the timeout handling to allow that would be as complex and invasive as fixing the bug properly by moving to streamed IO for plugins.

There is no quick/easy way to fix this, sorry.
Comment 17 GE Scott Knauss 2016-05-19 10:49:13 EDT
Bryan,
    Thanks. Understand there is no quick easy fix. Are we working on the long and hard one?

-Scott
Comment 19 GE Scott Knauss 2016-06-14 04:03:46 EDT
Bryan,
       I was looking through BZs today and came across this one: 

https://bugzilla.redhat.com/show_bug.cgi?id=1183244 "sosreport tracebacks with memory error with huge system journall"  

The goal of this BZ seems to be re-writing the command IO paths. Would that also clear up this issue? If so... should we close this as a dupe and do you think there will be any likelihood of getting this done in RHEL 7.3?  

Thanks,
Scott
Comment 20 Bryn M. Reeves 2016-06-14 06:08:50 EDT
> Bryan,

It's still Bryn.

> The goal of this BZ seems to be re-writing the command IO paths.

Yes, but this is still a large (likely several months) piece of work. There is no time to do this before the next release deadline.

I've started pushing some pieces upstream that will help us to avoid these problems specifically for journal commands.

If you are interested in this work the best thing to do is to get a GitHub account and follow along there.

In particular Issue #825 (now closed) sets out the foundation of this work:

  https://github.com/sosreport/sos/issues/825
Comment 25 Pavel Moravec 2017-03-24 12:55:28 EDT
(In reply to Pavel Moravec from comment #24)
> Technically, this BZ shall be dup of 1183244 that we would like to have in
> 7.4. Work is pending in upstream, though, no definitive promise we will fix
> it in 7.4 so far :(

I am closing this BZ as dup of 1183244. Regrettably, no fix will be available in 7.4 time frame :(

*** This bug has been marked as a duplicate of bug 1183244 ***

Note You need to log in before you can comment on or make changes to this bug.