Red Hat Bugzilla – Bug 1298186
Please consider adding a -k timeout as well to commands run by sosreport
Last modified: 2017-03-24 12:55:28 EDT
An user has reported an issue with large journal logs,
that also happened to be corrupted. The only solution to
be able to collect a sosreport was to clean the journal
files and run sosreport without the logs.
The suggestion would be something like adding a 600
seconds kill timeout, and the pseudo patch:
if timeout and is_executable("timeout"):
- command = "timeout %ds %s" % (timeout, command)
+ command = "timeout %ds --kill-after=%ds %s" % (timeout, timeout * 2, command)
I'm not sure I see what the patch is supposed to accomplish?
The timeout handling is intended to deal with cases where the command we are calling is blocked (for e.g. due to unavailable block devices/network file systems, or kernel failures).
That is not the case for the journalctl commands: they are actually running, but dealing with an unreasonable quantity of data. If the command completes before the timeout expires then we have seen an out-of-memory condition in sos's postprocessing handlers when trying to read the entire set of data into memory.
This is a known problem but addressing it requires very significant change and we do not as yet have an estimate when that would appear upstream.
Sending a kill to the subordinate process from /usr/bin/timeout does not affect this - to be honest, I am not even sure from a quick test that the -k option does anything unless the -s switch has been used (to deliver a signal after the timeout):
$ time timeout -k 15 10 /usr/bin/sleep 100
It would be easier to understand what was being requested if this bug included the standard bugzilla template sections - I don't really understand from the limited description given what it is trying to achieve or how it is supposed to work.
> We seem to get timeouts for several plugins during sos report collection on
> controller nodes:
The requested change will have no effect on this behaviour.
> If the journalctl command was still running and ignored the signal received,
> the proposed change would wait for another 300s and send KILL signal.
Which still won't do anything useful if the task is blocked in uninterrubtible sleep (the most common cause of apparent command hangs in sos).
I don't see any evidence that this change would improve behaviour for this or any other case. In the case that a process has gone off into the weeds it is something of a gamble whether sending a signal to it will achieve anything and even so, signaling the process (even if it did respond) has nothing to do with the actual problem reported - that the report does not contain expected content due to command timeouts.
We can try to make the collection of journal data more robust (which is what this bug really seems to be asking for). The mentioned command was directly requested by the systemd folk:
Unfortunately while this is probably fine for most desktop systems (that are rebooted reasonably frequently and that use on-disk journals subject to a rotation schedule), it appears problematic on systems with very long uptimes that are using either in-memory (tmpfs), or on-disk (but possibly not rotated) journal files.
I'm not certain of the best answer right now but fiddling with timeout signal parameters is not going to give us any improvement in this area.
Thanks. Do you think this should be owned by someone else? If not a sosreport issue, who do you think should be looking at this?
(In reply to Bryn M. Reeves from comment #10)
> I'm not certain of the best answer right now but fiddling with timeout
> signal parameters is not going to give us any improvement in this area.
journalctl --all --this-boot --no-pager -o verbose
with all_logs option only, plus newly collect:
journalctl --since="-7d" --all --no-pager -o verbose
> journalctl --all --this-boot --no-pager -o verbose
> with all_logs option only, plus newly collect:
Sure but if someone runs it with a journal that's going to dump 2GiB of data to stdout we'll still fail the same way.
It's not really a workaround - just restricting the problem to a smaller set of users.
> Thanks. Do you think this should be owned by someone else? If not a sosreport
> issue, who do you think should be looking at this?
The proposed patch does not help but this is very definitely an sos problem; even with changes to journalctl that would make managing these large archives easier/more efficient we would still have to properly deal with the volume of data produced.
It is not that it is someone else's problem (although some of the beahviours mentioned here definitely indicate problems in other components), rather that there is not small/simple change that will fix it: it needs a wholesale re-write of the plugin IO mechanism and that is not really something that we can get done in time for 7.3.
I did have an idea at the weekend that would specifically help in the journalctl case however I've not had time to work on any patches to implement it.
There is no file: we are reading from journalctl's stdout into memory (that's why you see out-of-memory errors in some cases).
The timeouts are also handle by an external program, /usr/bin/timeout (from coreutils) so we do not have the ability to insert our own handling in this way.
Rewriting the timeout handling to allow that would be as complex and invasive as fixing the bug properly by moving to streamed IO for plugins.
There is no quick/easy way to fix this, sorry.
Thanks. Understand there is no quick easy fix. Are we working on the long and hard one?
I was looking through BZs today and came across this one:
https://bugzilla.redhat.com/show_bug.cgi?id=1183244 "sosreport tracebacks with memory error with huge system journall"
The goal of this BZ seems to be re-writing the command IO paths. Would that also clear up this issue? If so... should we close this as a dupe and do you think there will be any likelihood of getting this done in RHEL 7.3?
It's still Bryn.
> The goal of this BZ seems to be re-writing the command IO paths.
Yes, but this is still a large (likely several months) piece of work. There is no time to do this before the next release deadline.
I've started pushing some pieces upstream that will help us to avoid these problems specifically for journal commands.
If you are interested in this work the best thing to do is to get a GitHub account and follow along there.
In particular Issue #825 (now closed) sets out the foundation of this work:
(In reply to Pavel Moravec from comment #24)
> Technically, this BZ shall be dup of 1183244 that we would like to have in
> 7.4. Work is pending in upstream, though, no definitive promise we will fix
> it in 7.4 so far :(
I am closing this BZ as dup of 1183244. Regrettably, no fix will be available in 7.4 time frame :(
*** This bug has been marked as a duplicate of bug 1183244 ***