Bug 1183244

Summary: sosreport tracebacks with memory error with huge system journall
Product: Red Hat Enterprise Linux 7 Reporter: David Kutálek <dkutalek>
Component: sosAssignee: Pavel Moravec <pmoravec>
Status: CLOSED ERRATA QA Contact: Miroslav Hradílek <mhradile>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.1CC: agk, apmukher, asadawar, bandrade, bmr, bubrown, cshereme, cww, dbodnarc, dwojewod, ederevea, erich, goneri, isenfeld, jeharris, knakayam, mhradile, mmielke, msekleta, pandrade, paygupta, pchavan, pdwyer, pkhedeka, pkshiras, pmoravec, pportant, ptalbert, sakulkar, snejoshi, stanislav.moravec, toneata, vwalek
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
URL: https://github.com/sosreport/sos/issues/1029
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1584547 1584548 (view as bug list) Environment:
Last Closed: 2018-10-30 10:31:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1295396, 1369123, 1464262, 1473612, 1477664, 1546815, 1552138, 1584547, 1584548    
Attachments:
Description Flags
potential journalctl sizelimit patch for 7.5
none
potential journalctl sizelimit patch for 7.5 (fixed) none

Description David Kutálek 2015-01-17 15:35:58 UTC
Description of problem:

With huge journal present on system, sosreport logs plugin tracebacks with Memory error:

...
 Setting up archive ...
[archive:TarFileArchive] initialised empty FileCacheArchive at '/tmp/tmp.wyaguXLNq8/sosreport-auto-x86-64-001.ss.eng.bos.redhat.com-20150117095533'
[sos.sosreport:setup] executing 'sosreport -v -v --batch --tmp-dir=/tmp/tmp.wyaguXLNq8 -o logs --all-logs --log-size=15'
 Setting up plugins ...
[plugin:logs] added copyspec '/etc/syslog.conf'
[plugin:logs] added copyspec '/etc/rsyslog.conf'
[plugin:logs] added copyspec '/var/log/boot.log'
[plugin:logs] packed command tuple: ('journalctl --all --this-boot --no-pager', 'None', 'None', 300, 'None')
[plugin:logs] added cmd output 'journalctl --all --this-boot --no-pager'
[plugin:logs] packed command tuple: ('journalctl --all --this-boot --no-pager -o verbose', 'None', 'None', 300, 'None')
[plugin:logs] added cmd output 'journalctl --all --this-boot --no-pager -o verbose'
fileobj: /etc/syslog.conf could not be opened
[plugin:logs] added string '...id="18922" x-info="http://www.rsyslog.com"] exiting on signal 15.' as 'var.log.messages.tailed'
[plugin:logs] added copyspec '/var/log/secure'
[plugin:logs] added copyspec '/var/log/boot.log'
[plugin:logs] added string '...id="18922" x-info="http://www.rsyslog.com"] exiting on signal 15.' as 'var.log.everything.tailed'
[plugin:logs] added string '...id="18922" x-info="http://www.rsyslog.com"] exiting on signal 15.' as 'var.log.everything.tailed'
 Running plugins. Please wait ...

  Running 1/1: logs...        
[plugin:logs] collecting path '/var/log/secure'
[plugin:logs] copying path '/var/log/secure' to archive:'/var/log/secure'
[plugin:logs] collecting path '/etc/rsyslog.conf'
[plugin:logs] copying path '/etc/rsyslog.conf' to archive:'/etc/rsyslog.conf'
[plugin:logs] collecting path '/var/log/boot.log'
[plugin:logs] copying path '/var/log/boot.log' to archive:'/var/log/boot.log'
[plugin:logs] unpacked command tuple: ('journalctl --all --this-boot --no-pager', 'None', 'None', 300, 'None')
[plugin:logs] collecting output of 'journalctl --all --this-boot --no-pager'
[plugin:logs] command 'journalctl --all --this-boot --no-pager' timed out after 300s
[plugin:logs] collected output of 'journalctl' in 301.554158926
[plugin:logs] unpacked command tuple: ('journalctl --all --this-boot --no-pager -o verbose', 'None', 'None', 300, 'None')
[plugin:logs] collecting output of 'journalctl --all --this-boot --no-pager -o verbose'
[plugin:logs] command 'journalctl --all --this-boot --no-pager -o verbose' timed out after 300s
[plugin:logs] collected output of 'journalctl' in 536.92585206
logs
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/sos/sosreport.py", line 1164, in collect
    plug.collect()
  File "/usr/lib/python2.7/site-packages/sos/plugins/__init__.py", line 630, in collect
    self._collect_cmd_output()
  File "/usr/lib/python2.7/site-packages/sos/plugins/__init__.py", line 609, in _collect_cmd_output
    timeout=timeout, runat=runat)
  File "/usr/lib/python2.7/site-packages/sos/plugins/__init__.py", line 567, in get_cmd_output_now
    self.archive.add_string(result['output'], outfn)
  File "/usr/lib/python2.7/site-packages/sos/archive.py", line 178, in add_string
    f.write(content)
  File "/usr/lib64/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib64/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
MemoryError


Creating compressed archive...
[archive:TarFileArchive] finalizing archive '/tmp/tmp.wyaguXLNq8/sosreport-auto-x86-64-001.ss.eng.bos.redhat.com-20150117095533' using method 'auto'
[archive:TarFileArchive] built archive at '/tmp/tmp.wyaguXLNq8/sosreport-auto-x86-64-001.ss.eng.bos.redhat.com-20150117095533.tar' (size=149985280)

Your sosreport has been generated and saved in:
  /tmp/tmp.wyaguXLNq8/sosreport-auto-x86-64-001.ss.eng.bos.redhat.com-20150117095533.tar.xz

The checksum is: 7eb98d831bf48c22690e8e8ca96c5a9a

Please send this file to your support representative.

---

du -h /var/log/journal/6dd863494c2a45f3977173c55cef4a9d/
4.0G    /var/log/journal/6dd863494c2a45f3977173c55cef4a9d/


Version-Release number of selected component (if applicable):

sos-3.2-11.el7.noarch

How reproducible:

With so big journal always

Steps to Reproduce:
1. Prepare huge journal files
2. Run sosreport with limit:
3. sosreport -v -v --batch -o logs --all-logs --log-size=15

Actual results:

Traceback

Expected results:

No traceback

Additional info:

Comment 1 Bryn M. Reeves 2015-01-20 11:36:54 UTC
It's not so much the size of the journal files that matters as the size of the data produced for the selected log window (--this-boot in this case). I've got a few systems with ~4GiB journals that do not show this behaviour (although journalctl commands are tediously slow).

Could you redirect the output of 'journalctl --all --this-boot --no-pager -o verbose' on this system to a file so we can see how big it's getting?

Comment 2 David Kutálek 2015-01-20 12:01:15 UTC
[1]+  Done                    journalctl --all --this-boot --no-pager -o verbose > ./journall.out

# uptime
 07:00:20 up 47 days, 16:36,  1 user,  load average: 1.45, 1.84, 1.56

# ls -l ./journall.out 
-rw-r--r--. 1 root root 1702041597 Jan 20 06:58 ./journall.out

Comment 3 Bryn M. Reeves 2015-01-21 11:03:28 UTC
OK, so a little over 1600MiB. With sos's current IO strategy that is going to put considerable pressure on the host VM.

Unfortunately there's no direct way to tell journalctl to limit the volume of data it produces.

Rewriting the command IO paths to use less memory has been on the TODO for a long time but hasn't been a priority (as we've not had common cases where it was really necessary).

Comment 4 Pavel Moravec 2015-11-04 09:18:58 UTC
Scope of RHEL7.2 is closed. Re-scheduling to RHEL7.3.

Comment 6 Pavel Moravec 2016-05-08 12:20:53 UTC
*** Bug 1326422 has been marked as a duplicate of this bug. ***

Comment 7 Lee Yarwood 2016-09-16 10:03:44 UTC
*** Bug 1320182 has been marked as a duplicate of this bug. ***

Comment 8 Pavel Moravec 2016-12-14 21:15:48 UTC
*** Bug 1369122 has been marked as a duplicate of this bug. ***

Comment 10 Lee Yarwood 2017-01-19 09:49:39 UTC
*** Bug 1369123 has been marked as a duplicate of this bug. ***

Comment 13 Pavel Moravec 2017-03-24 16:54:10 UTC
Regrettably, no fix has been provided for this and due to soon devel freeze, it is not supposed this will be fixed in 7.4 :(

Re-scheduling to 7.5

Comment 14 Pavel Moravec 2017-03-24 16:55:28 UTC
*** Bug 1298186 has been marked as a duplicate of this bug. ***

Comment 15 Pavel Moravec 2017-09-01 15:44:50 UTC
No devel resources for this in 7.5, re-scheduling for potential inclusion in 7.6.

Comment 21 Peter Portante 2018-02-27 13:16:40 UTC
Effectively sosreport is DoS'ing a system when it is run.  This should be considered a high priority to fix, if not urgent.  We can't ask our customers to gather data on a problem only to have the tool that gathers that data cause further harm, can we?

Comment 27 Scott Dodson 2018-03-12 15:32:44 UTC
*** Bug 1537490 has been marked as a duplicate of this bug. ***

Comment 28 Pavel Moravec 2018-04-17 07:26:30 UTC
devel_ack+ for 7.6, expecting https://github.com/sosreport/sos/pull/1120 will be merged before sos 3.6

Comment 32 Pavel Moravec 2018-05-31 07:40:02 UTC
7.6 will have twofold fixes:

(A) don't collect "journalctl --verbose" (huge data to crunch in memory) by default (only when using --all-logs)

(B) process any command output in batches (see https://github.com/sosreport/sos/pull/1120/files)

The 7.5.z will contain just the first safe fix that 1) is really safe even for z-stream, and 2) it shrinks memory requirements significantly.


Verification steps:

(A) 
- ensure that sosreport by default does not collect sos_commands/logs/journalctl_--no-pager_--all_--boot_--output_verbose file (by not calling the adequate command)
- --all-logs will still collect that command

(B)
run sosreport prior and after this BZ fixed (esp. on systems with big journalctl or any other huge cmd output collected by sos), and compare maximal memory consumed

Comment 44 Pavel Moravec 2018-07-23 11:31:42 UTC
Created attachment 1469923 [details]
potential journalctl sizelimit patch for 7.5

> Adding to it, as RHEL 7.6 is not going to be released soon customer is insisting to include fix in RHEL 7.5. But as per flag it seems fix will not be included in RHEl 7.5. If yes can you please help with justification which will help to set customer expectation.

7.6 (to be released in autumn/winter) will have twofold fixes:

(A) don't collect "journalctl --verbose" (huge data to crunch in memory) by default (only when using --all-logs)

(B) process any command output in batches (see https://github.com/sosreport/sos/pull/1120/files)

The 7.5.z (that the customer already has) contains just the first safe fix that 1) is really safe even for z-stream, and 2) it shrinks memory requirements significantly.

(B) introduces splitting cmd output and parsing it in batches, that might have unwanted consequences we would like to test in downstream prior releasing it in _supported_ sosreport tool. 

Reviewing what (B) changes, there is still some safe improvement that I could propose into 7.5 sos - limiting the size of data collected by "journalctl --unit [unit]" calls (where the "journalctl --unit docker" is the problematic for this customer).

So there are 3 options (until we release supported version of sos-3.6 in RHEL7.6):
1) stay on fully supported sos-3.5-9.el7_5
2) use the unsupported sos-3.6-2
3) use (supported) hotfixed version over 3.5-9 that can be checked if it helps via:

- installing sos-3.5-9.el7_5
- uploading 1120-sizelimit.patch to /tmp directory
- running:
cd /usr/lib/python2.7/site-packages
cat /tmp/1120-sizelimit.patch | patch -p1

Let try the 3rd option if it prevents OOM killer and let me know if a hotfix is desired for it.

Comment 45 Pushpendra Chavan 2018-08-17 10:36:07 UTC
(In reply to Pavel Moravec from comment #44)

> 
> - installing sos-3.5-9.el7_5
> - uploading 1120-sizelimit.patch to /tmp directory
> - running:
> cd /usr/lib/python2.7/site-packages
> cat /tmp/1120-sizelimit.patch | patch -p1
> 
> Let try the 3rd option if it prevents OOM killer and let me know if a hotfix
> is desired for it.

This worked!

Comment 50 Pavel Moravec 2018-08-22 14:30:59 UTC
Created attachment 1477901 [details]
potential journalctl sizelimit patch for 7.5 (fixed)

(In reply to Pushpendra Chavan from comment #48)
> Created attachment 1477508 [details]
> sosreport generated after applying patch
> 
> sosreport generated after applying patch. Need to check if this is complete
> or not

I apologize, the patch has a typo causing almost nothing works. Use this one and apply it:

yum reinstall -y sos-3.5-9.el7_5
cd /usr/lib/python2.7/site-packages
cat /tmp/1120.patch | patch -p1

Comment 60 errata-xmlrpc 2018-10-30 10:31:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:3144