Bug 1362667
Summary: | python fails when removing a directory tree located on nfs | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Terry Bowling <tbowling> | |
Component: | sos | Assignee: | Pavel Moravec <pmoravec> | |
Status: | CLOSED ERRATA | QA Contact: | Miroslav Hradílek <mhradile> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 7.4 | CC: | agk, bmr, ddouwsma, fsorenso, gavin, hhorak, isenfeld, mhradile, plambri, pmoravec, sbradley, torsava | |
Target Milestone: | rc | Keywords: | Regression, Reproducer | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Linux | |||
URL: | https://github.com/sosreport/sos/pull/909 | |||
Whiteboard: | ||||
Fixed In Version: | sos-3.4-1.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1481910 (view as bug list) | Environment: | ||
Last Closed: | 2017-08-01 23:08:12 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1298243, 1346768, 1481910 |
Description
Terry Bowling
2016-08-02 19:48:19 UTC
Further details of the NFS/python errors. The problem occurs when python tries to remove a directory tree located on nfs, and happens as the result of how nfs deals with the deletion of in-use files. If a file or directory is removed while it is still held open, the nfs client keeps a placeholder for the deleted file, as long as the file is still open. This file is referred to as a 'sillyrenamed' file, has a name that begins with '.nfs', followed by a string of numbers, for example: .nfs000000000851330800000069 Once the file is closed, it disappears automatically. When python recursively removes a directory tree, it gets a directory listing, sees these sillyrenamed files, and tries to delete them. Because they're not regular files to be deleted, the nfs client returns an error when it's attempted, causing python to see a failure. In the case of 'sosreport', the error handling restarts the cleanup process, some of which has already completed. Because some of the directories have already been removed, and because some of the open files are now closed, python sees a _different_ error this second time through the cleanup. In this case, it (correctly) complains because it was unable to find the directory to remove, since the directory is in fact not there. Here are the operations just prior to completion when the temp directory is on nfs (the relevant ones, anyway): 17908 17:57:39.651326 rmdir("/tmp/sosreport/sos.uD3xcU/sosreport-asdf-20160628175736") = 0 <0.000661> 17925 17:57:41.660863 close(3</tmp/sosreport/sos.uD3xcU/tmp8D6vS6>) = 0 <0.043728> 17925 17:57:41.704689 close(4</tmp/sosreport/sos.uD3xcU/tmp8D6vS6>) = 0 <0.000037> 17925 17:57:41.704822 close(5</tmp/sosreport/sos.uD3xcU/tmp7vtkI8>) = 0 <0.000040> 17925 17:57:41.705208 close(8</tmp/sosreport/sos.uD3xcU/tmpFFEtUh>) = 0 <0.000053> 17925 17:57:41.705331 close(9</tmp/sosreport/sos.uD3xcU/tmpFFEtUh>) = 0 <0.000041> 17925 17:57:41.705439 close(10</tmp/sosreport/sos.uD3xcU/tmpWA9HP3>) = 0 <0.000045> 17925 17:57:41.705548 close(11</tmp/sosreport/sos.uD3xcU/tmpWA9HP3>) = 0 <0.000044> 17908 17:57:42.850021 unlink("/tmp/sosreport/sos.uD3xcU/tmp8D6vS6") = 0 <0.001250> 17908 17:57:42.851443 unlink("/tmp/sosreport/sos.uD3xcU/tmp7vtkI8") = 0 <0.000837> 17908 17:57:42.852442 close(9</tmp/sosreport/sos.uD3xcU/tmpFFEtUh>) = 0 <0.000018> 17908 17:57:42.852563 unlink("/tmp/sosreport/sos.uD3xcU/tmpFFEtUh") = 0 <0.000841> 17908 17:57:42.853576 close(11</tmp/sosreport/sos.uD3xcU/tmpWA9HP3>) = 0 <0.000020> 17908 17:57:42.853670 unlink("/tmp/sosreport/sos.uD3xcU/tmpWA9HP3") = 0 <0.000994> 17908 17:57:43.856727 openat(AT_FDCWD, "/tmp/sosreport/sos.uD3xcU", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 7 <0.000132> 17908 17:57:43.856967 getdents(7</tmp/sosreport/sos.uD3xcU>, {{d_ino=139539204, d_off=1, d_reclen=24, d_name=".", d_type=DT_UNKNOWN} {d_ino=778186, d_off=2, d_reclen=24, d_name="..", d_type=DT_UNKNOWN} {d_ino=139539207, d_off=3, d_reclen=48, d_name=".nfs000000000851330700000068", d_type=DT_REG} {d_ino=139539208, d_off=4, d_reclen=48, d_name=".nfs000000000851330800000069", d_type=DT_REG} {d_ino=139539215, d_off=5, d_reclen=48, d_name=".nfs000000000851330f0000006a", d_type=DT_REG} {d_ino=139539217, d_off=6, d_reclen=48, d_name=".nfs00000000085133110000006b", d_type=DT_REG}}, 32768) = 240 <0.000198> ^^^^ note the sillyrename files 17908 17:57:43.857228 getdents(7</tmp/sosreport/sos.uD3xcU>, {}, 32768) = 0 <0.000008> 17908 17:57:43.857270 close(7</tmp/sosreport/sos.uD3xcU>) = 0 <0.000009> 17908 17:57:43.857413 unlink("/tmp/sosreport/sos.uD3xcU/.nfs000000000851330700000068") = -1 EBUSY (Device or resource busy) <0.000012> ^^^^ these can't be removed 17908 17:57:43.857846 lstat("/tmp/sosreport/sos.uD3xcU/sosreport-asdf-20160628175736", 0x7ffc44ff6860) = -1 ENOENT (No such file or directory) <0.000106> 17908 17:57:43.858029 openat(AT_FDCWD, "/tmp/sosreport/sos.uD3xcU/sosreport-asdf-20160628175736", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000243> ^^^^ the directory was removed previously 17908 17:57:43.864222 close(4</tmp/sosreport/sos.uD3xcU/.nfs000000000851330700000068>) = 0 <0.000016> ^^^^ this now gets closed, so the system will clean up the fake directory entry, and the '/tmp/sosreport/sos.uD3xcU' directory will exist, but will be empty when listed. Hi Terry, Thank you for a very thorough report! I believe this is not a Python issue. NFS disks don't support deleting directories that contain currently opened files, even if the files themselves have been deleted and thus remain only in the form of the sillyrenamed .nfsXXXXX… files. Therefore the behaviour of Python's `shutil.rmtree` function is correct. What might be possible is to handle this eventuality in `sosreport` itself, thus I'm reassigning this bug to the `sos` component. However, it remains to be seen if using NFS disks for the temp directory is a supported feature, as it is often not the case. So the sequence is: - sosreport calls shutil.rmtree on its working directory just after creating tar.xz archive (see /usr/lib/python2.7/site-packages/sos/sosreport.py:1453 and methods finalize & cleanup in /usr/lib/python2.7/site-packages/sos/archive.py) - some file from that directory is still held open, thus the directory contains the .nfs* file(s) (why this can happen that a file is still open now?) - shutil.rmtree fails to remove them and raises exception - in error handling, a cleanup attempt (sosreport.py:1579) is made to delete the same directory - that is gone, however - but how it can be removed if the shutil.rmtree failed on some files within this dir? Can somebody explain the two "why?" Or did I misunderstand something? > - some file from that directory is still held open, thus the directory > contains the .nfs* file(s) (why this can happen that a file is still open now?) It's the temporary log files (pre copying to sos_logs): 17908 17:57:43.857270 close(7</tmp/sosreport/sos.uD3xcU>) = 0 <0.000009> 17908 17:57:43.857413 unlink("/tmp/sosreport/sos.uD3xcU/.nfs000000000851330700000068") = -1 EBUSY (Device or resource busy) <0.000012> This is currently expected behaviour since the log files remain open at the point the archive is created and the temporary directory is cleaned up; this is fine on all POSIX-like local file systems, which we recommend for sos temporary storage. There are some clear minor bugs here, from the backtrace in comment #0, and those should be fixed but we do not currently have any clear commitment to formally support NFS mounts as temporary storage for sosreport. The traceback from #c0 should be fixed - we will do so within the scope of this BZ, can be in RHEL 7.4. For the support of NFS filesystem on --tmp-dir, please raise separate RFE. This is a regression in the behaviour of sos on both RHEL6 and RHEL7. [root@rhel7 ~]# echo |/usr/sbin/sosreport --name $(hostname) --tmp-dir /nfs_sos/sosreport/ sosreport (version 3.2) ... Running 78/83: vhostmd... Running 79/83: x11... Running 80/83: xen... Running 81/83: xfs... Running 82/83: xinetd... Running 83/83: yum... Creating compressed archive... Your sosreport has been generated and saved in: /nfs_sos/sosreport/sosreport-rhel7.localdomain.localdomain-20161221012707.tar.xz The checksum is: 72fcd6105c8509c73d09f79dfa265a28 Please send this file to your support representative. [root@rhel7 ~]# rpm -q sos sos-3.2-35.el7.noarch RHEL7 sos-3.2-35.el7.noarch OK sos-3.2-35.el7_2.3.noarch FAIL sos-3.3-4.el7.noarch FAIL sos-3.3-5.el7_3.noarch FAIL RHEL6 sos-2.2-68.el6.noarch OK sos-3.2-28.el6.noarch OK sos-3.2-28.el6_7.2.noarch FAIL Note comments in https://rhn.redhat.com/errata/RHSA-2016-0152.html I think https://github.com/sosreport/sos/commit/4a9b919a7f1b9542a23982e49cc9035e84551e13#diff-599b5408ed192a10242eeaa5663803f0R1604 causes the regression. Could you please (in private comment) share with me an access to some RHEL[6-7] machine with a NFS mount, where I can play with this BZ? It's because the temporary log files are now inside the tmpnam directory (that was added in commit 4a9b919): previously these were in /tmp, and outside of any path that sos would call rmtree() for. And here is a fix :) https://github.com/sosreport/sos/pull/909 Closed #909 via 3f79120. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2203 |