Bug 1165632
Summary: | VDSM should recover from stale NFS storage domain | |||
---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Lee Yarwood <lyarwood> | |
Component: | General | Assignee: | Nir Soffer <nsoffer> | |
Status: | CLOSED WONTFIX | QA Contact: | Elad <ebenahar> | |
Severity: | high | Docs Contact: | ||
Priority: | medium | |||
Version: | --- | CC: | bugs, ebenahar, frolland, lpeer, lsurette, lyarwood, mkalinin, nicolas, pmatyas, srevivo, yoann.laissus | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | --- | Flags: | sbonazzo:
ovirt-4.3-
|
|
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Enhancement | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1520008 (view as bug list) | Environment: | ||
Last Closed: | 2019-02-18 09:33:44 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1520008 |
Description
Lee Yarwood
2014-11-19 12:00:39 UTC
Tentatively targeting for 3.5.1 until we have an RCA. Once that's achieved, we can retarget. Removing for 3.6.0 since doesn't seem urgent. Allon, what is the plan to fix this issue? Seems logical failure. (In reply to Yaniv Dary from comment #5) > Removing for 3.6.0 since doesn't seem urgent. > Allon, what is the plan to fix this issue? Seems logical failure. No RCA, no plan. Once we have one, we'll have the other too. Seems like a bug more than a RFE. Changing to reflect that. *** Bug 1411795 has been marked as a duplicate of this bug. *** Vdsm has this code to recover stale nfs mount: 707 try: 708 self.oop.os.statvfs(self.domaindir) 709 except OSError as e: 710 if e.errno == errno.ESTALE: 711 # In case it is "Stale NFS handle" we are taking preventive 712 # measures and unmounting this NFS resource. Chances are 713 # that is the most intelligent thing we can do in this 714 # situation anyway. 715 self.log.debug("Unmounting stale file system %s", 716 self.mountpoint) 717 mount.getMountFromTarget(self.mountpoint).umount() 718 raise se.FileStorageDomainStaleNFSHandle() 719 raise But it is probably not effective. Possible issues: - statvfs(self.domaindir) times out before we get a result if storage is stuck, all ioprocess thread may be block on the non-responsive storage. The statvfs call can time out waiting in ioprocess queue, or rejected immediately if the queues are full. - statvfs(self.domaindir) does not pass the error code correctly. Maybe we need a better way to detect a stale nfs that cannot block or depend on blocked threads. Checking statvfs manual, it never returns ESTALE: RETURN VALUE On success, zero is returned. On error, -1 is returned, and errno is set appropriately. ERRORS EACCES (statvfs()) Search permission is denied for a component of the path prefix of path. (See also path_resolution(7).) EBADF (fstatvfs()) fd is not a valid open file descriptor. EFAULT Buf or path points to an invalid address. EINTR This call was interrupted by a signal; see signal(7). EIO An I/O error occurred while reading from the filesystem. ELOOP (statvfs()) Too many symbolic links were encountered in translating path. ENAMETOOLONG (statvfs()) path is too long. ENOENT (statvfs()) The file referred to by path does not exist. ENOMEM Insufficient kernel memory was available. ENOSYS The filesystem does not support this call. ENOTDIR (statvfs()) A component of the path prefix of path is not a directory. EOVERFLOW Some values were too large to be represented in the returned struct. Interesting info I found in Linux NFS FAQ: http://nfs.sourceforge.net/ A client can recover when it encounters an ESTALE error during a pathname resolution, but not during a READ or WRITE operation. An NFS client prevents data corruption by notifying applications immediately when a file has been replaced during a read or write request. After all, it is usually catastrophic if an application writes to or reads from the wrong file. Thus in general, to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle. Base on, this, we may detect ESTALE error by doing pathname resolution (not sure what this means). Then we can try to unmount/remount. This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1. Solving the stale file should be done by admin, checking what is the source of the issue and not automatically by Vdsm |