Bug 788559

Summary: ISO domain is not umounted if network is dropped
Product: [Retired] oVirt Reporter: Jakub Libosvar <jlibosva>
Component: vdsmAssignee: Dan Kenigsberg <danken>
Status: CLOSED NOTABUG QA Contact: yeylon <yeylon>
Severity: high Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: abaron, acathrow, bazulay, iheim, srevivo, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-05-17 07:46:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
logs none

Description Jakub Libosvar 2012-02-08 13:34:07 UTC
Created attachment 560258 [details]
logs

Description of problem:
I have one host, two iscsi data domains - each on different server and one iso domain that shares storage server with master storage domain. I dropped incoming packets using iptables to server exporting iso and master domain (produces bug 787673). After connection is unblocked, ISO domain status changes to Active but backend cannot obtain statistics about it (same with former master domain but this is fixed as soon as some operation is performed there).

MainThread::DEBUG::2012-02-08 13:18:22,602::mount::111::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/umount /rhev/data-center/mnt/blockSD/a777bffb-acce-4c42-a379-1a88a840ca14/master' (cwd None)
Thread-417::ERROR::2012-02-08 13:18:29,326::domainMonitor::120::Storage.DomainMonitor::(_monitorDomain) Error while collecting domain `f1621233-13f4-4c6b-99e3-86fbd4dbdad2` monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 103, in _monitorDomain
    domain.selftest()
  File "/usr/share/vdsm/storage/nfsSD.py", line 131, in selftest
    fileSD.FileStorageDomain.selftest(self)
  File "/usr/share/vdsm/storage/fileSD.py", line 295, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/processPool.py", line 53, in wrapper
    return self.runExternally(func, *args, **kwds)
  File "/usr/share/vdsm/storage/processPool.py", line 64, in runExternally
    return self._procPool.runExternally(*args, **kwargs)
  File "/usr/share/vdsm/storage/processPool.py", line 145, in runExternally
    raise Timeout("Operation Stuck")
Timeout: Operation Stuck

...

Thread-1494::DEBUG::2012-02-08 13:47:45,570::clientIF::261::Storage.Dispatcher.Protect::(wrapper) [10.34.63.26]
Thread-1494::DEBUG::2012-02-08 13:47:45,570::task::588::TaskManager.Task::(_updateState) Task=`32285a86-3126-4527-802f-719597d25824`::moving from state init -> state preparing
Thread-1494::INFO::2012-02-08 13:47:45,571::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': '10.34.63.202:/home/iso/shared', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': '117a86c3-be84-451a-b9a6-5680d3b617b6', 'port': ''}], options=None)
Thread-1494::INFO::2012-02-08 13:47:45,571::storage_connection::167::Storage.ServerConnection::(disconnect) Request to disconnect NFS storage server
Thread-1494::WARNING::2012-02-08 13:47:45,573::storage_connection::382::Storage.ServerConnection::(__disconnectNFSServer) Cannot remove mountpoint after umount()
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_connection.py", line 378, in __disconnectNFSServer
    os.rmdir(mnt.fs_file)
OSError: [Errno 16] Device or resource busy: '/rhev/data-center/mnt/10.34.63.202:_home_iso_shared'
Thread-1494::DEBUG::2012-02-08 13:47:45,574::misc::1017::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage)
Thread-1494::DEBUG::2012-02-08 13:47:45,574::misc::1019::SamplingMethod::(__call__) Got in to sampling method
Thread-1494::DEBUG::2012-02-08 13:47:45,574::misc::1017::SamplingMethod::(__call__) Trying to enter sampling method (storage.iscsi.rescan)
Thread-1494::DEBUG::2012-02-08 13:47:45,575::misc::1019::SamplingMethod::(__call__) Got in to sampling method
Thread-1494::DEBUG::2012-02-08 13:47:45,575::iscsi::389::Storage.Misc.excCmd::(rescan) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None)
Thread-1494::DEBUG::2012-02-08 13:47:45,600::iscsi::389::Storage.Misc.excCmd::(rescan) SUCCESS: <err> = ''; <rc> = 0
Thread-1494::DEBUG::2012-02-08 13:47:45,601::misc::1027::SamplingMethod::(__call__) Returning last result



 After a while ISO domain goes to Inactive but is still mounted on the host. As soon as ISO is activated by user, it fails due to it's already mounted with:

Thread-1567::DEBUG::2012-02-08 13:49:17,569::mount::111::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6 10.34.63.202:/home/iso/shared /tmp/tmp8sLigH' (cwd None)
Thread-1567::DEBUG::2012-02-08 13:49:17,638::mount::111::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/umount /tmp/tmp8sLigH' (cwd None)
Thread-1653::DEBUG::2012-02-08 13:50:56,232::clientIF::261::Storage.Dispatcher.Protect::(wrapper) [10.34.63.26]
Thread-1653::DEBUG::2012-02-08 13:50:56,232::task::588::TaskManager.Task::(_updateState) Task=`cacaae64-48ae-4c78-b831-f34521dcdbbf`::moving from state init -> state preparing
Thread-1653::INFO::2012-02-08 13:50:56,233::logUtils::37::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': '10.34.63.202:/home/iso/shared', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': '117a86c3-be84-451a-b9a6-5680d3b617b6', 'port': ''}], options=None)
Thread-1653::INFO::2012-02-08 13:50:56,233::storage_connection::146::Storage.ServerConnection::(connect) Request to connect NFS storage server
Thread-1653::DEBUG::2012-02-08 13:50:56,237::mount::111::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6 10.34.63.202:/home/iso/shared /rhev/data-center/mnt/10.34.63.202:_home_iso_shared' (cwd None)
Thread-1653::ERROR::2012-02-08 13:50:56,347::storage_connection::255::Storage.ServerConnection::(__connectNFSServer) Error during storage connection
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_connection.py", line 253, in __connectNFSServer
    mnt.mount(getNfsOptions(con), mount.VFS_NFS, timeout=CON_TIMEOUT)
  File "/usr/share/vdsm/storage/mount.py", line 107, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 122, in _runcmd
    raise MountError(rc, ";".join(out + err))
MountError: (32, 'm;o;u;n;t;.;n;f;s;:; ;m;o;u;n;t;i;n;g; ;1;0;.;3;4;.;6;3;.;2;0;2;:;/;h;o;m;e;/;i;s;o;/;s;h;a;r;e;d; ;f;a;i;l;e;d;,; ;r;e;a;s;o;n; ;g;i;v;e;n; ;b;y; ;s;e;r;v;e;r;:;\n; ; ;N;o; ;s;u;c;h; ;f;i;l;e; ;o;r; ;d;i;r;e;c;t;o;r;y;\n')
Thread-1653::DEBUG::2012-02-08 13:50:56,353::lvm::460::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-1653::DEBUG::2012-02-08 13:50:56,353::lvm::462::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-1653::DEBUG::2012-02-08 13:50:56,354::lvm::472::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-1653::DEBUG::2012-02-08 13:50:56,354::lvm::474::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-1653::DEBUG::2012-02-08 13:50:56,355::lvm::493::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-1653::DEBUG::2012-02-08 13:50:56,355::lvm::495::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex


Version-Release number of selected component (if applicable):
vdsm-4.9.3.2-0.fc16.x86_64
ovirt-engine-3.0.0_0001-1.4.fc16.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Block connection to ISO domain
2. Unblock connection
3. Activate ISO
  
Actual results:
Fails

Expected results:
Succeeds

Additional info:

ISO domain is still mounted on host
10.34.63.202:/home/iso/shared on /rhev/data-center/mnt/10.34.63.202:_home_iso_shared type nfs4 (rw,relatime,vers=4,rsize=524288,wsize=524288,namlen=255,soft,nosharecache,proto=tcp,port=0,timeo=600,retrans=6,sec=sys,clientaddr=10.34.63.208,minorversion=0,local_lock=none,addr=10.34.63.202)


is accessible on filesystem level: 
[root@srh-03 ~]# ll /rhev/data-center/mnt/10.34.63.202\:_home_iso_shared/
total 4
drwxr-xr-x. 4 vdsm kvm 4096 Jul 27  2011 f1621233-13f4-4c6b-99e3-86fbd4dbdad2


but lsof returns warning
[root@srh-03 ~]# lsof | grep shared
lsof: WARNING: can't stat() nfs4 file system /rhev/data-center/mnt/10.34.63.202:_home_iso_shared (deleted)
      Output information may be incomplete.

Comment 1 Jakub Libosvar 2012-02-08 14:12:15 UTC
This is probably not a vdsm bug, umount hangs and nfs doesn't timeout on specified 60 seconds. Will change component soon.

Comment 2 Jakub Libosvar 2012-02-08 15:46:30 UTC
Please ignore the first piece of logs in description - they are not relevant to this issue. The 'operation stuck' error is on spmStart
Thread-255::INFO::2012-02-08 13:28:58,385::logUtils::37::dispatcher::(wrapper) Run and protect: spmStart(spUUID='bb97353c-fb19-4f46-bda4-46dc37f5c5d6', prevID=-1, prevLVER='-1', recoveryMode=0, scsiFencing='false', maxHostID=250, domVersion='2', options=None)

Comment 3 Dan Kenigsberg 2012-04-18 20:24:18 UTC
that's tough, but umount cannot do its thing (incl. flushing fs data to server) with no networking. nfs considers this as a feature. Is there really a Vdsm bug here?

Comment 4 Jakub Libosvar 2012-05-17 07:46:26 UTC
I cannot reproduce anymore - closing as notabug