Description of problem: Activation of a NFS export domain fails due to a stale file handle : Thread-4374511::DEBUG::2014-11-19 06:46:10,161::BindingXMLRPC::177::vds::(wrapper) client [10.33.20.2] flowID [2e9c1f6] Thread-4374511::DEBUG::2014-11-19 06:46:10,161::task::579::TaskManager.Task::(_updateState) Task=`f343a7bf-35a4-4a9b-b99c-028a37910b69`::moving from state init -> state preparing Thread-4374511::INFO::2014-11-19 06:46:10,162::logUtils::44::dispatcher::(wrapper) Run and protect: activateStorageDomain(sdUUID='e5d713a1-1c28-46ea-b859-27db25929b1a', spUUID='dab6c34c-51a9-4e02-92de-4489a307ce17', options=None) [..] Thread-4374511::ERROR::2014-11-19 06:46:10,172::sdc::143::Storage.StorageDomainCache::(_findDomain) domain e5d713a1-1c28-46ea-b859-27db25929b1a not found Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/nfsSD.py", line 132, in findDomain return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomainPath raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('e5d713a1-1c28-46ea-b859-27db25929b1a',) Thread-4374511::ERROR::2014-11-19 06:46:10,172::task::850::TaskManager.Task::(_setError) Task=`f343a7bf-35a4-4a9b-b99c-028a37910b69`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1242, in activateStorageDomain pool.activateSD(sdUUID) File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper return f(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1108, in activateSD dom = sdCache.produce(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 98, in produce domain.getRealDomain() File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/nfsSD.py", line 132, in findDomain return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomainPath raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('e5d713a1-1c28-46ea-b859-27db25929b1a',) root@spm # ll /rhev/data-center/mnt/10.33.20.152:_mnt_export ls: cannot access /rhev/data-center/mnt/10.33.20.152:_mnt_export: Stale file handle Version-Release number of selected component (if applicable): vdsm-4.13.2-0.9.el6ev.x86_64 vdsm-python-4.13.2-0.9.el6ev.x86_64 vdsm-xmlrpc-4.13.2-0.9.el6ev.noarch vdsm-cli-4.13.2-0.9.el6ev.noarch How reproducible: Always. Steps to Reproduce: 1. Cause the NFS mount to be become stale for a deactived NFS domain. 2. Attempt to reactivate the domain. Actual results: Domain activation fails due to the stale file handle. Expected results: VDSM attempts to remount the domain, allowing activation to continue. Additional info: Obviously, manually remounting the domain worksaround the issue. # unmount /rhev/data-center/mnt/10.33.20.152:_mnt_export # mount 10.33.20.152:/mnt/export /rhev/data-center/mnt/10.33.20.152:_mnt_export
Tentatively targeting for 3.5.1 until we have an RCA. Once that's achieved, we can retarget.
Removing for 3.6.0 since doesn't seem urgent. Allon, what is the plan to fix this issue? Seems logical failure.
(In reply to Yaniv Dary from comment #5) > Removing for 3.6.0 since doesn't seem urgent. > Allon, what is the plan to fix this issue? Seems logical failure. No RCA, no plan. Once we have one, we'll have the other too.
Seems like a bug more than a RFE. Changing to reflect that.
*** Bug 1411795 has been marked as a duplicate of this bug. ***
Vdsm has this code to recover stale nfs mount: 707 try: 708 self.oop.os.statvfs(self.domaindir) 709 except OSError as e: 710 if e.errno == errno.ESTALE: 711 # In case it is "Stale NFS handle" we are taking preventive 712 # measures and unmounting this NFS resource. Chances are 713 # that is the most intelligent thing we can do in this 714 # situation anyway. 715 self.log.debug("Unmounting stale file system %s", 716 self.mountpoint) 717 mount.getMountFromTarget(self.mountpoint).umount() 718 raise se.FileStorageDomainStaleNFSHandle() 719 raise But it is probably not effective. Possible issues: - statvfs(self.domaindir) times out before we get a result if storage is stuck, all ioprocess thread may be block on the non-responsive storage. The statvfs call can time out waiting in ioprocess queue, or rejected immediately if the queues are full. - statvfs(self.domaindir) does not pass the error code correctly. Maybe we need a better way to detect a stale nfs that cannot block or depend on blocked threads.
Checking statvfs manual, it never returns ESTALE: RETURN VALUE On success, zero is returned. On error, -1 is returned, and errno is set appropriately. ERRORS EACCES (statvfs()) Search permission is denied for a component of the path prefix of path. (See also path_resolution(7).) EBADF (fstatvfs()) fd is not a valid open file descriptor. EFAULT Buf or path points to an invalid address. EINTR This call was interrupted by a signal; see signal(7). EIO An I/O error occurred while reading from the filesystem. ELOOP (statvfs()) Too many symbolic links were encountered in translating path. ENAMETOOLONG (statvfs()) path is too long. ENOENT (statvfs()) The file referred to by path does not exist. ENOMEM Insufficient kernel memory was available. ENOSYS The filesystem does not support this call. ENOTDIR (statvfs()) A component of the path prefix of path is not a directory. EOVERFLOW Some values were too large to be represented in the returned struct.
Interesting info I found in Linux NFS FAQ: http://nfs.sourceforge.net/ A client can recover when it encounters an ESTALE error during a pathname resolution, but not during a READ or WRITE operation. An NFS client prevents data corruption by notifying applications immediately when a file has been replaced during a read or write request. After all, it is usually catastrophic if an application writes to or reads from the wrong file. Thus in general, to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle. Base on, this, we may detect ESTALE error by doing pathname resolution (not sure what this means). Then we can try to unmount/remount.
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
Solving the stale file should be done by admin, checking what is the source of the issue and not automatically by Vdsm