Description of problem: App "sosreport" cannot be launched safely in batch mode, as it will continue if a plugin has an error. Adding a "--check" option will allow to launch a pre-flight check of plugins automatically in scripts. Adding an "exit 1" to batch mode, if plugin check fails, will make it safer. Version-Release number of selected component (if applicable): # rpm -q sos sos-1.7-9.62.el5 How reproducible: Run "sosreport" in batch mode when a process is in a D state Steps to Reproduce: 1. Create a daily cron job in order to have periodic sosreports 2. Wait until a process in the machine enters a state D 3. See the collision of sosreport against the process Actual results: sosreport hang or service crash Expected results: sosreport exits with errorlevel 0 Additional info: # diff sosreport-check /usr/sbin/sosreport 160,162d159 < __cmdParser__.add_option("--check", action="store_true", \ < dest="check", default=False, \ < help="perform plugin check only") 572,578d568 < else: < print _("Exiting") < sys.exit(1) < else: < print _("Plugin Test OK") < if __cmdLineOpts__.check: < sys.exit(0)
Running sos with a process in D state does not cause plugins to fail. It causes a warning to be printed that is mostly misleading and unhelpful to customers (and that was removed from later versions many years ago). These steps: Steps to Reproduce: 1. Create a daily cron job in order to have periodic sosreports 2. Wait until a process in the machine enters a state D 3. See the collision of sosreport against the process Actual results: sosreport hang or service crash Do not result in any problems for me. Please be more specific about what you are trying to solve here; e.g. what processes you observe causing such problems.
In RHEL 5, with sosreport 1.7 (no higher version available), and when running in batch mode, no warning is printed, and sosreport still runs. My customer claims that, when they launch sosreport because they are having an issue, and the warning about a process in state D appears, after accepting to continue, the program hangs up and sometimes can crash the machine. (Running SAP and Oracle 11g). Thay do not want to have a cron job that runs "/usr/sbin/sosreport -a -v --no-progressbar --no-multithread --batch --name=XXXXX --tmp-dir=/var/log/sosreport" because it may cause problems with the current behavior. In my humble opinion the problem to be solved here is to have a "batch" mode that behaves in a safe way, which means that if there is a problem with one plugin during checks, the program exits and the report does not get generated. > Do not result in any problems for me. It is clear that we are not running the program under the same circumstances. I'll try to gather more information and add it to this RFE, even when what I want to resolve with this bug is not the system crash, but the behavior of sosreport in batch mode. > Please be more specific about what you are trying to solve here As I wrote before, I want to solve the behavior of sosreport when running in batch mode.
Please include logs (ps ax --forest when the problem is happening, sosreport -vvv output and any panic/oops/warn/bug messages generated during a "system crash") and steps to reproduce (the steps in comment #0 are not effective so some important detail has been omitted). You haven't yet demonstrated that there is a problem with the behaviour of sosreport when run in batch mode (as evidenced by the fact that the steps do not reproduce the problem when run on a typical RHEL installation when one or more processes is in un-interruptible sleep). If there is a problem with some process when sos runs then we should fix it and not paper over it with hacks.
Warnings about D state processes are just that - warnings, not errors. They should never prevent the tool from running (and have been removed upstream and in RHEL6 because of the level of confusion they have caused). So this bug actually appears to be a very specific case; LVM2 tools hanging when run under sos due to cluster locking problems when clvmd is in use. We can address that by changing the manner in which sos invokes the LVM2 tools - we never modify metadata so there is no need for the tools to request any locks at all (and in fact as your customer has seen this could cause problems for sos and potentially other users of the clustered volume manager) - in fact this is a change we probably should have made some time ago. I will implement this upstream and clone the bug for RHEL6. If the customer is able to reproduce I'd be happy to provide packages for testing.
> Warnings about D state processes are just that - warnings, not errors. OK. Understood. This makes complete sense. Thanks a lot. May I propose a "--safe-batch" option that exists in case of warnings? :-) > I will implement this upstream and clone the bug for RHEL6. Great, thanks again!. I'll keep the customer informed.
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
Upstream: https://github.com/sosreport/sos/commit/dd478c2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1200.html