Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1411795 - [vdsm] NFS domain autorecovery fails: "Internal file read failure...dom_md/metadata\': Stale file handle"
Summary: [vdsm] NFS domain autorecovery fails: "Internal file read failure...dom_md/me...
Keywords:
Status: CLOSED DUPLICATE of bug 1165632
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.19.1
Hardware: x86_64
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.1.0-beta
: ---
Assignee: Tal Nisan
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-10 13:55 UTC by Elad
Modified: 2017-01-23 13:08 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-23 13:08:13 UTC
oVirt Team: Storage
rule-engine: ovirt-4.1+


Attachments (Terms of Use)
logs from engine and hypervisors (2.40 MB, application/x-gzip)
2017-01-10 13:55 UTC, Elad
no flags Details
4.0-vdsm.log (581.01 KB, application/x-gzip)
2017-01-10 14:04 UTC, Elad
no flags Details
gluster (1.27 MB, application/x-gzip)
2017-01-11 13:44 UTC, Elad
no flags Details

Description Elad 2017-01-10 13:55:48 UTC
Created attachment 1239115 [details]
logs from engine and hypervisors

Description of problem:
VDSM fails to read NFS domain metadata after NFS server recovers. Storage domain remains inactive.

Version-Release number of selected component (if applicable):
vdsm-4.19.1-18.git79e5ea5.el7.centos.x86_64
sanlock-3.4.0-1.el7.x86_64
nfs-utils-1.3.0-0.33.el7.x86_64
ovirt-engine-4.1.0-0.4.master.20170104181027.gitab0e3f4.el7.centos.noarch

How reproducible:
Always

Steps to Reproduce:
In a DC with 1 or more active hosts and a non master NFS domain:
1. Bring down (reboot) the storage server where the non-master NFS domain resides
2. Wait for the NFS server to recover

Actual results:
Storage domain auto recovery fails, it remains inactive. VDSM fails to read its metadata. Manual intervention is required for the domain to get active (hosts maintenance and activate).


2017-01-10 12:17:39,592 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/10.35.110.11:_Storage__NFS_storage__jenkins__ge5__nfs__1/889bd4e3-99ed-475c-a164-3ccdb0
b11c81/dom_md/metadata (monitor:485)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 483, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 368, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/10.35.110.11:_Storage__NFS_storage__jenkins__ge5__nfs__1/889bd4e3-99ed-475c-a164-3ccdb0b11c81/dom_md/metadata', 1,
 bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/10.35.110.11:_Storage__NFS_storage__jenkins__ge5__nfs__1/889bd4e3-99ed-475c-a164-3ccdb0b11c81/dom_md/metadata\': Stale file h
andle\n"))
2017-01-10 12:17:39,646 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/10.35.110.11:_Storage__NFS_storage__jenkins__ge5__nfs__2/3f7ad49d-592e-489e-b205-095ef5
8ffab1/dom_md/metadata (monitor:485)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 483, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 368, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/10.35.110.11:_Storage__NFS_storage__jenkins__ge5__nfs__2/3f7ad49d-592e-489e-b205-095ef58ffab1/dom_md/metadata', 1,
 bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/10.35.110.11:_Storage__NFS_storage__jenkins__ge5__nfs__2/3f7ad49d-592e-489e-b205-095ef58ffab1/dom_md/metadata\': Stale file h
andle\n"))

Expected results:
Domain auto recovery should succeed.

Additional info:
logs from engine and hypervisors

[root@storage-jenkins-ge5-vdsm1 ~]# mount
.
.
10.35.110.11:/Storage_NFS/storage_jenkins_ge5_nfs_1 on /rhev/data-center/mnt/10.35.110.11:_Storage__NFS_storage__jenkins__ge5__nfs__1 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.35.110.11,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.35.110.11)
10.35.110.11:/Storage_NFS/storage_jenkins_ge5_nfs_2 on /rhev/data-center/mnt/10.35.110.11:_Storage__NFS_storage__jenkins__ge5__nfs__2 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.35.110.11,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.35.110.11)


[root@storage-jenkins-ge5-vdsm1 ~]# df -h
Filesystem                                                   Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup01-root                                   41G  1.9G   39G   5% /
devtmpfs                                                     1.9G     0  1.9G   0% /dev
tmpfs                                                        1.9G     0  1.9G   0% /dev/shm
tmpfs                                                        1.9G   41M  1.9G   3% /run
tmpfs                                                        1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/vda1                                                    673M  140M  498M  22% /boot
vserver-production.qa.lab.tlv.redhat.com:/iso_domain         100G   27G   74G  27% /rhev/data-center/mnt/vserver-production.qa.lab.tlv.redhat.com:_iso__domain
10.35.110.11:/Storage_NFS/                                   2.0T  192G  1.9T  10% /mnt
10.35.110.11:/Storage_NFS/storage_jenkins_ge5_nfs_1          2.0T  192G  1.9T  10% /rhev/data-center/mnt/10.35.110.11:_Storage__NFS_storage__jenkins__ge5__nfs__1
10.35.110.11:/Storage_NFS/storage_jenkins_ge5_nfs_2          2.0T  192G  1.9T  10% /rhev/data-center/mnt/10.35.110.11:_Storage__NFS_storage__jenkins__ge5__nfs__2
10.35.110.11:/Storage_NFS/export_domain_storage_jenkins_ge5  2.0T  192G  1.9T  10% /rhev/data-center/mnt/10.35.110.11:_Storage__NFS_export__domain__storage__jenkins__ge5
10.35.65.25:/storage_jenkins_ge5_volume_0                    500G  400G  101G  80% /rhev/data-center/mnt/glusterSD/10.35.65.25:_storage__jenkins__ge5__volume__0
10.35.65.25:/storage_jenkins_ge5_volume_1                    500G  400G  101G  80% /rhev/data-center/mnt/glusterSD/10.35.65.25:_storage__jenkins__ge5__volume__1
10.35.65.25:/storage_jenkins_ge5_volume_2                    500G  400G  101G  80% /rhev/data-center/mnt/glusterSD/10.35.65.25:_storage__jenkins__ge5__volume__2
10.35.110.11:/Storage_NFS/storage_jenkins_ge5_nfs_0          2.0T  192G  1.9T  10% /rhev/data-center/mnt/10.35.110.11:_Storage__NFS_storage__jenkins__ge5__nfs__0
/dev/mapper/1c9f12f9--390d--41c8--b55f--39489b9dcfbb-master  976M  1.3M  924M   1% /rhev/data-center/mnt/blockSD/1c9f12f9-390d-41c8-b55f-39489b9dcfbb/master
tmpfs                                                        379M     0  379M   0% /run/user/0

Comment 1 Elad 2017-01-10 14:04:42 UTC
Created attachment 1239117 [details]
4.0-vdsm.log

We see the same happens in 4.0.6.3. Attaching vdsm.log.


Thread-12::ERROR::2017-01-10 16:03:11,570::monitor::484::Storage.Monitor::(_pathChecked) Error checking path /rhev/data-center/mnt/10.35.110.11:_Storage__NFS_export__domain__storage__jenkins__ge4/6c3ef831-8a87-4753-8f04-b77789b57f36/dom_md/metadata
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 482, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 367, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/10.35.110.11:_Storage__NFS_export__domain__storage__jenkins__ge4/6c3ef831-8a87-4753-8f04-b77789b57f36/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/10.35.110.11:_Storage__NFS_export__domain__storage__jenkins__ge4/6c3ef831-8a87-4753-8f04-b77789b57f36/dom_md/metadata\': Stale file handle\n"))

Comment 2 Yaniv Kaul 2017-01-11 08:53:51 UTC
Elad - are you sure the NFS connection has recovered?

Comment 3 Elad 2017-01-11 09:45:23 UTC
Yes, after maintenance and re-activate to the hosts in the DC the domains become active

2017-01-10 12:17:45,084+02 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (default task-13) [1a3a15e7-3654-443b-ba0b-de0b7a8fcbc4] START, SetVdsStatusVDSCommand(HostName = host_mixed_1, SetVdsStatusVDSCommandParameters:{runAsync='true', hostId='86c91af9-76bc-4265-b788-7c6c8ac20712', status='PreparingForMaintenance', nonOperationalReason='NONE', stopSpmFailureLogged='true', maintenanceReason='null'}), log id: 2c6111b2

2017-01-10 12:20:12,507+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-22) [63aa388e-7415-44bc-a351-633643c2d17b] START, ActivateStorageDomainVDSCommand( ActivateStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='ff0e1c58-27bc-44ce-96dc-58daa6903955', ignoreFailoverLimit='false', storageDomainId='cd352bc1-8ec4-4fd4-acb6-753091042a69'}), log id: 385a8383


 ID: 63aa388e-7415-44bc-a351-633643c2d17b, Job ID: 6664c4eb-e105-4714-8549-ac57299b95b7, Call Stack: null, Custom Event ID: -1, Message: Storage Domain nfs_0 (Data Center golden_env_mixed) was activated by admin@internal-authz

Comment 4 Yaniv Kaul 2017-01-11 09:50:19 UTC
(In reply to Elad from comment #3)
> Yes, after maintenance and re-activate to the hosts in the DC the domains
> become active

Of course, because this caused umount and mount I reckon.
I'm wondering if we can auto-recover if we have stale handles - and how long it can take. It can certainly take a while, or indeed require manual intervention.
See https://access.redhat.com/solutions/2674

> 
> 2017-01-10 12:17:45,084+02 INFO 
> [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (default task-13)
> [1a3a15e7-3654-443b-ba0b-de0b7a8fcbc4] START,
> SetVdsStatusVDSCommand(HostName = host_mixed_1,
> SetVdsStatusVDSCommandParameters:{runAsync='true',
> hostId='86c91af9-76bc-4265-b788-7c6c8ac20712',
> status='PreparingForMaintenance', nonOperationalReason='NONE',
> stopSpmFailureLogged='true', maintenanceReason='null'}), log id: 2c6111b2
> 
> 2017-01-10 12:20:12,507+02 INFO 
> [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand]
> (org.ovirt.thread.pool-6-thread-22) [63aa388e-7415-44bc-a351-633643c2d17b]
> START, ActivateStorageDomainVDSCommand(
> ActivateStorageDomainVDSCommandParameters:{runAsync='true',
> storagePoolId='ff0e1c58-27bc-44ce-96dc-58daa6903955',
> ignoreFailoverLimit='false',
> storageDomainId='cd352bc1-8ec4-4fd4-acb6-753091042a69'}), log id: 385a8383
> 
> 
>  ID: 63aa388e-7415-44bc-a351-633643c2d17b, Job ID:
> 6664c4eb-e105-4714-8549-ac57299b95b7, Call Stack: null, Custom Event ID: -1,
> Message: Storage Domain nfs_0 (Data Center golden_env_mixed) was activated
> by admin@internal-authz

Comment 5 Tal Nisan 2017-01-11 10:13:46 UTC
Nir, Liron, can you please have a look and share you insights?

Comment 6 Nir Soffer 2017-01-11 10:45:09 UTC
The metadata file monitor is using dd so we don't have useful error and we cannot
detect the problem in a reliable way.

We are checking for stale file handles every 5 minutes and trying to unmount the
domain.

vdsm/storage/fileSD.py +675:

    def selftest(self):
        """ 
        Run internal self test
        """
        try:
            self.oop.os.statvfs(self.domaindir)
        except OSError as e:
            if e.errno == errno.ESTALE:
                # In case it is "Stale NFS handle" we are taking preventive
                # measures and unmounting this NFS resource. Chances are
                # that is the most intelligent thing we can do in this
                # situation anyway.
                self.log.debug("Unmounting stale file system %s",
                               self.mountpoint)
                mount.getMountFromTarget(self.mountpoint).umount()
                raise se.FileStorageDomainStaleNFSHandle()
            raise

Elad, do you see FileStorageDomainStaleNFSHandle exception in these logs?

I think we have another bug on handling stale nfs mounts, so this is probably
a duplicate.

Comment 7 Elad 2017-01-11 10:48:47 UTC
Nir, you have the logs attached

Comment 8 Elad 2017-01-11 13:44:02 UTC
Created attachment 1239438 [details]
gluster

The same with Gluster:

2017-01-11 15:40:53,199 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:515)
2017-01-11 15:40:53,557 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/10.35.65.25:_storage__local__ge2__volume__2/39008495-d3fa-42c6-bffc-a22eb8bba
f6f/dom_md/metadata (monitor:485)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 483, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 368, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/10.35.65.25:_storage__local__ge2__volume__2/39008495-d3fa-42c6-bffc-a22eb8bbaf6f/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/glusterSD/10.35.65.25:_storage__local__ge2__volume__2/39008495-d3fa-42c6-bffc-a22eb8bbaf6f/dom_md/metadata\': Transport endpoint is not connected\n"))

Comment 9 Nir Soffer 2017-01-11 14:26:52 UTC
Elad, why do you think this is a regression?

If you want to mark it as regression, please prove that this works in a previous
version by testing that version on the same environment.

Comment 10 Elad 2017-01-11 14:33:21 UTC
Nir, I will test it on older 4.0 build (as in the latest it reproduces) even tough this is a pretty common scenario (file domain auto recovery).

Comment 11 Nir Soffer 2017-01-11 14:41:02 UTC
You should test with this latest 3.6. If this is broken in 3.6, 4.0 and 4.1,
there is no regression, it simply never worked.

This is not regular auto recovery when you block access to nfs server, but auto 
recovery when you actually stop the nfs server.

Make sure to test also older engine, if this actually worked in the past, we need
to understand what was the change that broke this and it it was in engine or vdsm.

Comment 12 Nir Soffer 2017-01-11 14:45:20 UTC
Removed the regression until we have a proof that this is a regression.

Comment 13 Yaniv Kaul 2017-01-22 09:23:09 UTC
Duplicate of bug 1165632 ?

Comment 14 Elad 2017-01-23 12:30:31 UTC
(In reply to Yaniv Kaul from comment #13)
> Duplicate of bug 1165632 ?

Seems the same

Comment 15 Yaniv Kaul 2017-01-23 13:08:13 UTC

*** This bug has been marked as a duplicate of bug 1165632 ***


Note You need to log in before you can comment on or make changes to this bug.