Bug 1165632

Summary:	VDSM should recover from stale NFS storage domain
Product:	[oVirt] vdsm	Reporter:	Lee Yarwood <lyarwood>
Component:	General	Assignee:	Nir Soffer <nsoffer>
Status:	CLOSED WONTFIX	QA Contact:	Elad <ebenahar>
Severity:	high	Docs Contact:
Priority:	medium
Version:	---	CC:	bugs, ebenahar, frolland, lpeer, lsurette, lyarwood, mkalinin, nicolas, pmatyas, srevivo, yoann.laissus
Target Milestone:	---	Keywords:	ZStream
Target Release:	---	Flags:	sbonazzo: ovirt-4.3-
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:
Clones:	1520008 (view as bug list)		Environment:
Last Closed:	2019-02-18 09:33:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1520008

Description Lee Yarwood 2014-11-19 12:00:39 UTC

Description of problem:

Activation of a NFS export domain fails due to a stale file handle :

Thread-4374511::DEBUG::2014-11-19 06:46:10,161::BindingXMLRPC::177::vds::(wrapper) client [10.33.20.2] flowID [2e9c1f6]
Thread-4374511::DEBUG::2014-11-19 06:46:10,161::task::579::TaskManager.Task::(_updateState) Task=`f343a7bf-35a4-4a9b-b99c-028a37910b69`::moving from state init -> state preparing
Thread-4374511::INFO::2014-11-19 06:46:10,162::logUtils::44::dispatcher::(wrapper) Run and protect: activateStorageDomain(sdUUID='e5d713a1-1c28-46ea-b859-27db25929b1a', spUUID='dab6c34c-51a9-4e02-92de-4489a307ce17', options=None)
[..]
Thread-4374511::ERROR::2014-11-19 06:46:10,172::sdc::143::Storage.StorageDomainCache::(_findDomain) domain e5d713a1-1c28-46ea-b859-27db25929b1a not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 132, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('e5d713a1-1c28-46ea-b859-27db25929b1a',)
Thread-4374511::ERROR::2014-11-19 06:46:10,172::task::850::TaskManager.Task::(_setError) Task=`f343a7bf-35a4-4a9b-b99c-028a37910b69`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1242, in activateStorageDomain
    pool.activateSD(sdUUID)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1108, in activateSD
    dom = sdCache.produce(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 98, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 132, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('e5d713a1-1c28-46ea-b859-27db25929b1a',)

root@spm # ll /rhev/data-center/mnt/10.33.20.152:_mnt_export
ls: cannot access /rhev/data-center/mnt/10.33.20.152:_mnt_export: Stale file handle

Version-Release number of selected component (if applicable):
vdsm-4.13.2-0.9.el6ev.x86_64
vdsm-python-4.13.2-0.9.el6ev.x86_64
vdsm-xmlrpc-4.13.2-0.9.el6ev.noarch
vdsm-cli-4.13.2-0.9.el6ev.noarch

How reproducible:
Always.

Steps to Reproduce:
1. Cause the NFS mount to be become stale for a deactived NFS domain.
2. Attempt to reactivate the domain.

Actual results:
Domain activation fails due to the stale file handle.

Expected results:
VDSM attempts to remount the domain, allowing activation to continue.

Additional info:
Obviously, manually remounting the domain worksaround the issue.

# unmount /rhev/data-center/mnt/10.33.20.152:_mnt_export
# mount 10.33.20.152:/mnt/export /rhev/data-center/mnt/10.33.20.152:_mnt_export

Comment 4 Allon Mureinik 2014-11-20 10:54:06 UTC

Tentatively targeting for 3.5.1 until we have an RCA. Once that's achieved, we can retarget.

Comment 5 Yaniv Lavi 2015-01-20 14:47:20 UTC

Removing for 3.6.0 since doesn't seem urgent.
Allon, what is the plan to fix this issue? Seems logical failure.

Comment 6 Allon Mureinik 2015-01-20 16:28:06 UTC

(In reply to Yaniv Dary from comment #5)
> Removing for 3.6.0 since doesn't seem urgent.
> Allon, what is the plan to fix this issue? Seems logical failure.
No RCA, no plan.
Once we have one, we'll have the other too.

Comment 9 Yaniv Lavi 2016-12-06 08:28:27 UTC

Seems like a bug more than a RFE. Changing to reflect that.

Comment 10 Yaniv Kaul 2017-01-23 13:08:13 UTC

*** Bug 1411795 has been marked as a duplicate of this bug. ***

Comment 14 Nir Soffer 2018-02-06 19:16:24 UTC

Vdsm has this code to recover stale nfs mount:

707         try:
708             self.oop.os.statvfs(self.domaindir)
709         except OSError as e:
710             if e.errno == errno.ESTALE:
711                 # In case it is "Stale NFS handle" we are taking preventive
712                 # measures and unmounting this NFS resource. Chances are
713                 # that is the most intelligent thing we can do in this
714                 # situation anyway.
715                 self.log.debug("Unmounting stale file system %s",
716                                self.mountpoint)
717                 mount.getMountFromTarget(self.mountpoint).umount()
718                 raise se.FileStorageDomainStaleNFSHandle()
719             raise

But it is probably not effective.

Possible issues:

- statvfs(self.domaindir) times out before we get a result

  if storage is stuck, all ioprocess thread may be block on the non-responsive 
  storage. The statvfs call can time out waiting in ioprocess queue, or rejected
  immediately if the queues are full.

- statvfs(self.domaindir) does not pass the error code correctly.

Maybe we need a better way to detect a stale nfs that cannot block or depend on
blocked threads.

Comment 15 Nir Soffer 2018-02-06 19:37:24 UTC

Checking statvfs manual, it never returns ESTALE:

RETURN VALUE
       On success, zero is returned.  On error, -1 is returned, and errno is set
       appropriately.

ERRORS
       EACCES (statvfs()) Search permission is denied for a component of the path
       prefix of path.  (See also path_resolution(7).)

       EBADF  (fstatvfs()) fd is not a valid open file descriptor.

       EFAULT Buf or path points to an invalid address.

       EINTR  This call was interrupted by a signal; see signal(7).

       EIO    An I/O error occurred while reading from the filesystem.

       ELOOP  (statvfs()) Too many symbolic links were encountered in translating
       path.

       ENAMETOOLONG
              (statvfs()) path is too long.

       ENOENT (statvfs()) The file referred to by path does not exist.

       ENOMEM Insufficient kernel memory was available.

       ENOSYS The filesystem does not support this call.

       ENOTDIR
              (statvfs()) A component of the path prefix of path is not a directory.

       EOVERFLOW
              Some values were too large to be represented in the returned struct.

Comment 16 Nir Soffer 2018-02-26 20:57:55 UTC

Interesting info I found in Linux NFS FAQ:
http://nfs.sourceforge.net/

A client can recover when it encounters an ESTALE error during a pathname resolution, but not during a READ or WRITE operation. An NFS client prevents data corruption by notifying applications immediately when a file has been replaced during a read or write request. After all, it is usually catastrophic if an application writes to or reads from the wrong file.

Thus in general, to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle.

Base on, this, we may detect ESTALE error by doing pathname resolution (not sure
what this means). Then we can try to unmount/remount.

Comment 17 Sandro Bonazzola 2019-01-28 09:40:36 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 18 Fred Rolland 2019-02-18 09:33:44 UTC

Solving the stale file should be done by admin, checking what is the source of the issue and not automatically by Vdsm