from RFE template ...
2. What is the nature and description of the request?
Customer would like sosreport to be able to catch when it is hung and kill the process that is hung and issue a message of where the report hangs. This would enable the sosreport to complete and provide information on what is causing the problem.
3. Why does the customer need this? (List the business requirements here)
Customer has many systems that failed running sosreport and it stopped progress on multiple cases while they attempted to find why the sosreport would not complete. It was found to be multiple issues(case depending) where the sosreport hung and did not produce the required information to force creation of sosreport tar.
4. How would the customer like to achieve this? (List the functional requirements here)
Customer would like sosreport to have the ability to kill the particular process or script hanging once it becomes hung for some period of time and provide a message indicating where the process hung to allow ease of troubleshooting and completion of sos so that a fix or workaround can be issued.
5. For each functional requirement listed in question 4, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.
This may be tested by implementing some bad script in the sosreport(in startup for example) that would hang the process. Once process is hung for a period of time it should cancel the script and provide message of where the problem lies and still provide the rest of the details that sos was able to capture.
6. Is there already an existing RFE upstream or in Red Hat bugzilla?
I was able to find a bugzilla that appears to have a similar issue that this request may resolve. Where a dry run would cover what will be run, what this customer requests is for sos to be more useful in finding a way to tell whwhere sos is hanging the system(in case you are not aware a problem exists). This will help the customer to work around the hang and assist in getting the problem fixed more quickly:
7. How quickly does this need resolved? (desired target release)
The customer would like this added soon, however the problem that led to this request is currently resolved. The customer had a problem in the startup scripts of sosreport that was preventing completion of sos. As a result it slowed resolution to several problems that required gathering information piece by piece, which could have been more quickly provided with sos. The customer was able to use "-n startup" option with sos once we found the problem was caused by a startup script. The customer would like this enhancement added to allow for solutions and workarounds such as this to be available more quickly in the future should this occur again.
8. Does this request meet the RHEL Inclusion criteria? (please review)
Yes. This fits into minor revision for updates within the inclusion criteria.
This is the cousin of bug 368261
Designing a resilient sosreport is something bmr@ and I are currently looking into and definitely consider a high priority for RHEL5.7 and RHEL6.1.
The current approach would also handle uninterruptible sleeps when accessing the fs, by implementing timeout management in the current sos process, and let a worker/spool of workers do all the "unsafe" operations (essential everything but application logic).
The current design draft for this worker is visible on:
*** Bug 368261 has been marked as a duplicate of this bug. ***
Product Management has reviewed and declined this request. You may appeal this
decision by reopening this request.