Bug 728127
Summary: | [abrt] sos-2.2-8.el6: sosreport.py:730:sosreport:IOError: [Errno 2] No such file or directory: '/var/spool/abrt/ccpp-2011-08-04-08:59:28-5400/dhcp-25-35-2011080408591312441169/sos_logs/sosreport-plugin-errors.txt' | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Michal Nowak <mnowak> |
Component: | sos | Assignee: | Bryn M. Reeves <bmr> |
Status: | CLOSED CANTFIX | QA Contact: | BaseOS QE - Apps <qe-baseos-apps> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 6.1 | CC: | agk, bmr, djasa, dkutalek, gavin, jmoskovc, ohudlick, plambri, prc, rmarko, sos-team, tlavigne |
Target Milestone: | rc | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Unspecified | ||
Whiteboard: | abrt_hash:87ea4927202cee0eb772b09989d4d1ee4547ce54 | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-04-02 16:06:59 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1056252 |
Description
Michal Nowak
2011-08-04 07:07:40 UTC
Package: sos-2.2-8.el6 Architecture: x86_64 OS Release: Red Hat Enterprise Linux Workstation release 6.1 (Santiago) Comment ----- (Not sure this is a problem at all.) sosreport was gathering info about a system because abrt asked for it, then I send TERM signal to sosreport (needed it to crash as I was testing something else) and found in the logs that abrt catched sosreport's crash. Did the directory listed actually exist at this point? It would be good to handle this better if not but I'm not clear on the circumstances of this problem as we're running under abrt (In reply to comment #3) > Did the directory listed actually exist at this point? No idea. > It would be good to > handle this better if not but I'm not clear on the circumstances of this > problem as we're running under abrt Nothing obvious. It's just a corner case. Best guess something removed our dstroot while sos was running - otherwise e.g. we'd have hit exceptions on the calls to os.mkdir (right after we set up logdir in sosreport.py). I'm not sure this is a case we really need to be too concerned about - if some{one,thing} is using alternate dstroot and running sos it should take care of stopping sos before removing any paths from the file system. It seems I came across this issue once again. `sosreport' is being executed by abrt event script with --tmp-dir option set to e.g. /var/spool/abrt/ccpp-2011-09-02-11:49:45-19415. `sosreport's run takes some time and it may happen user executes `abrt rm /var/spool/abrt/ccpp-2011-09-02-11:49:45-19415' before `sosreport' finishes. Can this scenario lead to the mentioned error? Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. In reply to comment #6 this scenario is outside sos's control and you would see the same result e.g. if you pass a path in /tmp with --tmp-dir and then remove it while sos is running. The "abrt rm" command needs to synchronise with the abrt action that's running sosreport and either wait for it to finish or kill it off. There might be a problem in ABRT but I think there also might be one in sosreport. Can it be possible to catch the exception in sosreport and exit w/o backtrace? Sorry but I need to reopen it. I still see the problem in 6.3 and essentially every abrt test fails because it finishes before sosreport does and since tests tend to clean after themselves I found sosreport many times attempting to write to non-existing dir or completely hung. Why can't abrt kill processes that it started before attempting to clean them up? Jiri, could you answer comment #12, please? What if abrt-cli is modified not to remove $crash_dir when some process is operating on top of it unless '-f/--force' used? This is what lsof shows when sosreport runs: sosreport 10872 root cwd DIR 253,0 4096 1573722 /var/spool/abrt/ccpp-2012-03-06-09:13:48-10857 There seems to be a race condition when ABRT runs sos first and then decides to remove the crash directory because the crash is a duplicate (or somehow broken). What if user decides to remove the $crash_dir *before* sosreport ends? That happens a lot in our tests. (In reply to comment #15) > What if user decides to remove the $crash_dir *before* sosreport ends? That > happens a lot in our tests. - that's the same scenario, it doesn't matter who removes the dir.. OK, I can reproduce this now (it seems the TERM is really needed to trigger the exception that then causes the exception in the exception handling code.. I'd missed this when testing previously): 1. sosreport --batch 2. <CTRL>-Z 3. rm -rf $DSTROOT; kill -TERM %1; fg Actual results: All processes ended, cleaning up. sosreport --batch Traceback (most recent call last): File "/usr/sbin/sosreport", line 23, in <module> sosreport(sys.argv[1:]) File "/usr/lib/python2.6/site-packages/sos/sosreport.py", line 734, in sosreport error_log = open(logdir + "/sosreport-plugin-errors.txt", "a") IOError: [Errno 2] No such file or directory: '/tmp/rhel6-vm1-2012101814331350567190/sos_logs/sosreport-plugin-errors.txt' Expected results: Give up when logging fails and print the real exception to the terminal. I tried to reproduce the error with these commands: # sosreport --tmp-dir=/root I removed the directory were sosreport was writing and send a TERM signal to the sosreport process. Apart having a broken sosreport tar file, I got this backtrace: Completed [39/55] ... All processes ended, cleaning up. error copying file /var/log/cups/error_log-20121014 Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/sos/plugintools.py", line 200, in doCopyFileOrDir tdstpath, abspath = self.__copyFile(srcpath) File "/usr/lib/python2.6/site-packages/sos/plugintools.py", line 237, in __copyFile shutil.copyfileobj(fsrc, fdst, -1) File "/usr/lib64/python2.6/shutil.py", line 31, in copyfileobj fdst.write(buf) File "/usr/lib/python2.6/site-packages/sos/sosreport.py", line 65, in exittermhandler doExitCode() File "/usr/lib/python2.6/site-packages/sos/sosreport.py", line 75, in doExitCode doExit(1) File "/usr/lib/python2.6/site-packages/sos/sosreport.py", line 86, in doExit sys.exit(error) SystemExit: 1 Completed [55/55] ... Thanks Pier - this is the kind of thing I was worried about. The patches seem to make the problem a bit less likely but when you hit it it the result is worse. I'll look at this again but I'm not sure we can really solve it without a major rewrite of the signal and exception handling that I don't think would be appropriate for sos-2.2 at this stage. *** Bug 971016 has been marked as a duplicate of this bug. *** This bug is being closed CANTFIX as it describes a rare corner case which requires the user to remove the temporary directory that sos is using while it is running. This is not recommended and may lead to incomplete data collection or unexpected exceptions within sos python code. Altering the exception handling behaviour at this stage in sos-2.2 carries an unacceptable risk of regressions and due to the low impact and likelihood of this problem occurring in the wild it will not be addressed in an update to Red Hat Enterprise Linux 6. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |