Bug 1183244
Summary: | sosreport tracebacks with memory error with huge system journall | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | David Kutálek <dkutalek> | ||||||
Component: | sos | Assignee: | Pavel Moravec <pmoravec> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Miroslav Hradílek <mhradile> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 7.1 | CC: | agk, apmukher, asadawar, bandrade, bmr, bubrown, cshereme, cww, dbodnarc, dwojewod, ederevea, erich, goneri, isenfeld, jeharris, knakayam, mhradile, mmielke, msekleta, pandrade, paygupta, pchavan, pdwyer, pkhedeka, pkshiras, pmoravec, pportant, ptalbert, sakulkar, snejoshi, stanislav.moravec, toneata, vwalek | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
URL: | https://github.com/sosreport/sos/issues/1029 | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1584547 1584548 (view as bug list) | Environment: | |||||||
Last Closed: | 2018-10-30 10:31:19 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1295396, 1369123, 1464262, 1473612, 1477664, 1546815, 1552138, 1584547, 1584548 | ||||||||
Attachments: |
|
Description
David Kutálek
2015-01-17 15:35:58 UTC
It's not so much the size of the journal files that matters as the size of the data produced for the selected log window (--this-boot in this case). I've got a few systems with ~4GiB journals that do not show this behaviour (although journalctl commands are tediously slow). Could you redirect the output of 'journalctl --all --this-boot --no-pager -o verbose' on this system to a file so we can see how big it's getting? [1]+ Done journalctl --all --this-boot --no-pager -o verbose > ./journall.out # uptime 07:00:20 up 47 days, 16:36, 1 user, load average: 1.45, 1.84, 1.56 # ls -l ./journall.out -rw-r--r--. 1 root root 1702041597 Jan 20 06:58 ./journall.out OK, so a little over 1600MiB. With sos's current IO strategy that is going to put considerable pressure on the host VM. Unfortunately there's no direct way to tell journalctl to limit the volume of data it produces. Rewriting the command IO paths to use less memory has been on the TODO for a long time but hasn't been a priority (as we've not had common cases where it was really necessary). Scope of RHEL7.2 is closed. Re-scheduling to RHEL7.3. *** Bug 1326422 has been marked as a duplicate of this bug. *** *** Bug 1320182 has been marked as a duplicate of this bug. *** *** Bug 1369122 has been marked as a duplicate of this bug. *** *** Bug 1369123 has been marked as a duplicate of this bug. *** Regrettably, no fix has been provided for this and due to soon devel freeze, it is not supposed this will be fixed in 7.4 :( Re-scheduling to 7.5 *** Bug 1298186 has been marked as a duplicate of this bug. *** No devel resources for this in 7.5, re-scheduling for potential inclusion in 7.6. Effectively sosreport is DoS'ing a system when it is run. This should be considered a high priority to fix, if not urgent. We can't ask our customers to gather data on a problem only to have the tool that gathers that data cause further harm, can we? *** Bug 1537490 has been marked as a duplicate of this bug. *** devel_ack+ for 7.6, expecting https://github.com/sosreport/sos/pull/1120 will be merged before sos 3.6 7.6 will have twofold fixes: (A) don't collect "journalctl --verbose" (huge data to crunch in memory) by default (only when using --all-logs) (B) process any command output in batches (see https://github.com/sosreport/sos/pull/1120/files) The 7.5.z will contain just the first safe fix that 1) is really safe even for z-stream, and 2) it shrinks memory requirements significantly. Verification steps: (A) - ensure that sosreport by default does not collect sos_commands/logs/journalctl_--no-pager_--all_--boot_--output_verbose file (by not calling the adequate command) - --all-logs will still collect that command (B) run sosreport prior and after this BZ fixed (esp. on systems with big journalctl or any other huge cmd output collected by sos), and compare maximal memory consumed Created attachment 1469923 [details] potential journalctl sizelimit patch for 7.5 > Adding to it, as RHEL 7.6 is not going to be released soon customer is insisting to include fix in RHEL 7.5. But as per flag it seems fix will not be included in RHEl 7.5. If yes can you please help with justification which will help to set customer expectation. 7.6 (to be released in autumn/winter) will have twofold fixes: (A) don't collect "journalctl --verbose" (huge data to crunch in memory) by default (only when using --all-logs) (B) process any command output in batches (see https://github.com/sosreport/sos/pull/1120/files) The 7.5.z (that the customer already has) contains just the first safe fix that 1) is really safe even for z-stream, and 2) it shrinks memory requirements significantly. (B) introduces splitting cmd output and parsing it in batches, that might have unwanted consequences we would like to test in downstream prior releasing it in _supported_ sosreport tool. Reviewing what (B) changes, there is still some safe improvement that I could propose into 7.5 sos - limiting the size of data collected by "journalctl --unit [unit]" calls (where the "journalctl --unit docker" is the problematic for this customer). So there are 3 options (until we release supported version of sos-3.6 in RHEL7.6): 1) stay on fully supported sos-3.5-9.el7_5 2) use the unsupported sos-3.6-2 3) use (supported) hotfixed version over 3.5-9 that can be checked if it helps via: - installing sos-3.5-9.el7_5 - uploading 1120-sizelimit.patch to /tmp directory - running: cd /usr/lib/python2.7/site-packages cat /tmp/1120-sizelimit.patch | patch -p1 Let try the 3rd option if it prevents OOM killer and let me know if a hotfix is desired for it. (In reply to Pavel Moravec from comment #44) > > - installing sos-3.5-9.el7_5 > - uploading 1120-sizelimit.patch to /tmp directory > - running: > cd /usr/lib/python2.7/site-packages > cat /tmp/1120-sizelimit.patch | patch -p1 > > Let try the 3rd option if it prevents OOM killer and let me know if a hotfix > is desired for it. This worked! Created attachment 1477901 [details] potential journalctl sizelimit patch for 7.5 (fixed) (In reply to Pushpendra Chavan from comment #48) > Created attachment 1477508 [details] > sosreport generated after applying patch > > sosreport generated after applying patch. Need to check if this is complete > or not I apologize, the patch has a typo causing almost nothing works. Use this one and apply it: yum reinstall -y sos-3.5-9.el7_5 cd /usr/lib/python2.7/site-packages cat /tmp/1120.patch | patch -p1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:3144 |