Bug 1183244
| Summary: | sosreport tracebacks with memory error with huge system journall | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | David Kutálek <dkutalek> | ||||||
| Component: | sos | Assignee: | Pavel Moravec <pmoravec> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Miroslav Hradílek <mhradile> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 7.1 | CC: | agk, apmukher, asadawar, bandrade, bmr, bubrown, cshereme, cww, dbodnarc, dwojewod, ederevea, erich, goneri, isenfeld, jeharris, knakayam, mhradile, mmielke, msekleta, pandrade, paygupta, pchavan, pdwyer, pkhedeka, pkshiras, pmoravec, pportant, ptalbert, sakulkar, snejoshi, stanislav.moravec, toneata, vwalek | ||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| URL: | https://github.com/sosreport/sos/issues/1029 | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1584547 1584548 (view as bug list) | Environment: | |||||||
| Last Closed: | 2018-10-30 10:31:19 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1295396, 1369123, 1464262, 1473612, 1477664, 1546815, 1552138, 1584547, 1584548 | ||||||||
| Attachments: |
|
||||||||
It's not so much the size of the journal files that matters as the size of the data produced for the selected log window (--this-boot in this case). I've got a few systems with ~4GiB journals that do not show this behaviour (although journalctl commands are tediously slow). Could you redirect the output of 'journalctl --all --this-boot --no-pager -o verbose' on this system to a file so we can see how big it's getting? [1]+ Done journalctl --all --this-boot --no-pager -o verbose > ./journall.out # uptime 07:00:20 up 47 days, 16:36, 1 user, load average: 1.45, 1.84, 1.56 # ls -l ./journall.out -rw-r--r--. 1 root root 1702041597 Jan 20 06:58 ./journall.out OK, so a little over 1600MiB. With sos's current IO strategy that is going to put considerable pressure on the host VM. Unfortunately there's no direct way to tell journalctl to limit the volume of data it produces. Rewriting the command IO paths to use less memory has been on the TODO for a long time but hasn't been a priority (as we've not had common cases where it was really necessary). Scope of RHEL7.2 is closed. Re-scheduling to RHEL7.3. *** Bug 1326422 has been marked as a duplicate of this bug. *** *** Bug 1320182 has been marked as a duplicate of this bug. *** *** Bug 1369122 has been marked as a duplicate of this bug. *** *** Bug 1369123 has been marked as a duplicate of this bug. *** Regrettably, no fix has been provided for this and due to soon devel freeze, it is not supposed this will be fixed in 7.4 :( Re-scheduling to 7.5 *** Bug 1298186 has been marked as a duplicate of this bug. *** No devel resources for this in 7.5, re-scheduling for potential inclusion in 7.6. Effectively sosreport is DoS'ing a system when it is run. This should be considered a high priority to fix, if not urgent. We can't ask our customers to gather data on a problem only to have the tool that gathers that data cause further harm, can we? *** Bug 1537490 has been marked as a duplicate of this bug. *** devel_ack+ for 7.6, expecting https://github.com/sosreport/sos/pull/1120 will be merged before sos 3.6 7.6 will have twofold fixes: (A) don't collect "journalctl --verbose" (huge data to crunch in memory) by default (only when using --all-logs) (B) process any command output in batches (see https://github.com/sosreport/sos/pull/1120/files) The 7.5.z will contain just the first safe fix that 1) is really safe even for z-stream, and 2) it shrinks memory requirements significantly. Verification steps: (A) - ensure that sosreport by default does not collect sos_commands/logs/journalctl_--no-pager_--all_--boot_--output_verbose file (by not calling the adequate command) - --all-logs will still collect that command (B) run sosreport prior and after this BZ fixed (esp. on systems with big journalctl or any other huge cmd output collected by sos), and compare maximal memory consumed Created attachment 1469923 [details] potential journalctl sizelimit patch for 7.5 > Adding to it, as RHEL 7.6 is not going to be released soon customer is insisting to include fix in RHEL 7.5. But as per flag it seems fix will not be included in RHEl 7.5. If yes can you please help with justification which will help to set customer expectation. 7.6 (to be released in autumn/winter) will have twofold fixes: (A) don't collect "journalctl --verbose" (huge data to crunch in memory) by default (only when using --all-logs) (B) process any command output in batches (see https://github.com/sosreport/sos/pull/1120/files) The 7.5.z (that the customer already has) contains just the first safe fix that 1) is really safe even for z-stream, and 2) it shrinks memory requirements significantly. (B) introduces splitting cmd output and parsing it in batches, that might have unwanted consequences we would like to test in downstream prior releasing it in _supported_ sosreport tool. Reviewing what (B) changes, there is still some safe improvement that I could propose into 7.5 sos - limiting the size of data collected by "journalctl --unit [unit]" calls (where the "journalctl --unit docker" is the problematic for this customer). So there are 3 options (until we release supported version of sos-3.6 in RHEL7.6): 1) stay on fully supported sos-3.5-9.el7_5 2) use the unsupported sos-3.6-2 3) use (supported) hotfixed version over 3.5-9 that can be checked if it helps via: - installing sos-3.5-9.el7_5 - uploading 1120-sizelimit.patch to /tmp directory - running: cd /usr/lib/python2.7/site-packages cat /tmp/1120-sizelimit.patch | patch -p1 Let try the 3rd option if it prevents OOM killer and let me know if a hotfix is desired for it. (In reply to Pavel Moravec from comment #44) > > - installing sos-3.5-9.el7_5 > - uploading 1120-sizelimit.patch to /tmp directory > - running: > cd /usr/lib/python2.7/site-packages > cat /tmp/1120-sizelimit.patch | patch -p1 > > Let try the 3rd option if it prevents OOM killer and let me know if a hotfix > is desired for it. This worked! Created attachment 1477901 [details] potential journalctl sizelimit patch for 7.5 (fixed) (In reply to Pushpendra Chavan from comment #48) > Created attachment 1477508 [details] > sosreport generated after applying patch > > sosreport generated after applying patch. Need to check if this is complete > or not I apologize, the patch has a typo causing almost nothing works. Use this one and apply it: yum reinstall -y sos-3.5-9.el7_5 cd /usr/lib/python2.7/site-packages cat /tmp/1120.patch | patch -p1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:3144 |
Description of problem: With huge journal present on system, sosreport logs plugin tracebacks with Memory error: ... Setting up archive ... [archive:TarFileArchive] initialised empty FileCacheArchive at '/tmp/tmp.wyaguXLNq8/sosreport-auto-x86-64-001.ss.eng.bos.redhat.com-20150117095533' [sos.sosreport:setup] executing 'sosreport -v -v --batch --tmp-dir=/tmp/tmp.wyaguXLNq8 -o logs --all-logs --log-size=15' Setting up plugins ... [plugin:logs] added copyspec '/etc/syslog.conf' [plugin:logs] added copyspec '/etc/rsyslog.conf' [plugin:logs] added copyspec '/var/log/boot.log' [plugin:logs] packed command tuple: ('journalctl --all --this-boot --no-pager', 'None', 'None', 300, 'None') [plugin:logs] added cmd output 'journalctl --all --this-boot --no-pager' [plugin:logs] packed command tuple: ('journalctl --all --this-boot --no-pager -o verbose', 'None', 'None', 300, 'None') [plugin:logs] added cmd output 'journalctl --all --this-boot --no-pager -o verbose' fileobj: /etc/syslog.conf could not be opened [plugin:logs] added string '...id="18922" x-info="http://www.rsyslog.com"] exiting on signal 15.' as 'var.log.messages.tailed' [plugin:logs] added copyspec '/var/log/secure' [plugin:logs] added copyspec '/var/log/boot.log' [plugin:logs] added string '...id="18922" x-info="http://www.rsyslog.com"] exiting on signal 15.' as 'var.log.everything.tailed' [plugin:logs] added string '...id="18922" x-info="http://www.rsyslog.com"] exiting on signal 15.' as 'var.log.everything.tailed' Running plugins. Please wait ... Running 1/1: logs... [plugin:logs] collecting path '/var/log/secure' [plugin:logs] copying path '/var/log/secure' to archive:'/var/log/secure' [plugin:logs] collecting path '/etc/rsyslog.conf' [plugin:logs] copying path '/etc/rsyslog.conf' to archive:'/etc/rsyslog.conf' [plugin:logs] collecting path '/var/log/boot.log' [plugin:logs] copying path '/var/log/boot.log' to archive:'/var/log/boot.log' [plugin:logs] unpacked command tuple: ('journalctl --all --this-boot --no-pager', 'None', 'None', 300, 'None') [plugin:logs] collecting output of 'journalctl --all --this-boot --no-pager' [plugin:logs] command 'journalctl --all --this-boot --no-pager' timed out after 300s [plugin:logs] collected output of 'journalctl' in 301.554158926 [plugin:logs] unpacked command tuple: ('journalctl --all --this-boot --no-pager -o verbose', 'None', 'None', 300, 'None') [plugin:logs] collecting output of 'journalctl --all --this-boot --no-pager -o verbose' [plugin:logs] command 'journalctl --all --this-boot --no-pager -o verbose' timed out after 300s [plugin:logs] collected output of 'journalctl' in 536.92585206 logs Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/sos/sosreport.py", line 1164, in collect plug.collect() File "/usr/lib/python2.7/site-packages/sos/plugins/__init__.py", line 630, in collect self._collect_cmd_output() File "/usr/lib/python2.7/site-packages/sos/plugins/__init__.py", line 609, in _collect_cmd_output timeout=timeout, runat=runat) File "/usr/lib/python2.7/site-packages/sos/plugins/__init__.py", line 567, in get_cmd_output_now self.archive.add_string(result['output'], outfn) File "/usr/lib/python2.7/site-packages/sos/archive.py", line 178, in add_string f.write(content) File "/usr/lib64/python2.7/codecs.py", line 691, in write return self.writer.write(data) File "/usr/lib64/python2.7/codecs.py", line 351, in write data, consumed = self.encode(object, self.errors) MemoryError Creating compressed archive... [archive:TarFileArchive] finalizing archive '/tmp/tmp.wyaguXLNq8/sosreport-auto-x86-64-001.ss.eng.bos.redhat.com-20150117095533' using method 'auto' [archive:TarFileArchive] built archive at '/tmp/tmp.wyaguXLNq8/sosreport-auto-x86-64-001.ss.eng.bos.redhat.com-20150117095533.tar' (size=149985280) Your sosreport has been generated and saved in: /tmp/tmp.wyaguXLNq8/sosreport-auto-x86-64-001.ss.eng.bos.redhat.com-20150117095533.tar.xz The checksum is: 7eb98d831bf48c22690e8e8ca96c5a9a Please send this file to your support representative. --- du -h /var/log/journal/6dd863494c2a45f3977173c55cef4a9d/ 4.0G /var/log/journal/6dd863494c2a45f3977173c55cef4a9d/ Version-Release number of selected component (if applicable): sos-3.2-11.el7.noarch How reproducible: With so big journal always Steps to Reproduce: 1. Prepare huge journal files 2. Run sosreport with limit: 3. sosreport -v -v --batch -o logs --all-logs --log-size=15 Actual results: Traceback Expected results: No traceback Additional info: