Hide Forgot
Created attachment 560258 [details] logs Description of problem: I have one host, two iscsi data domains - each on different server and one iso domain that shares storage server with master storage domain. I dropped incoming packets using iptables to server exporting iso and master domain (produces bug 787673). After connection is unblocked, ISO domain status changes to Active but backend cannot obtain statistics about it (same with former master domain but this is fixed as soon as some operation is performed there). MainThread::DEBUG::2012-02-08 13:18:22,602::mount::111::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/umount /rhev/data-center/mnt/blockSD/a777bffb-acce-4c42-a379-1a88a840ca14/master' (cwd None) Thread-417::ERROR::2012-02-08 13:18:29,326::domainMonitor::120::Storage.DomainMonitor::(_monitorDomain) Error while collecting domain `f1621233-13f4-4c6b-99e3-86fbd4dbdad2` monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 103, in _monitorDomain domain.selftest() File "/usr/share/vdsm/storage/nfsSD.py", line 131, in selftest fileSD.FileStorageDomain.selftest(self) File "/usr/share/vdsm/storage/fileSD.py", line 295, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/share/vdsm/storage/processPool.py", line 53, in wrapper return self.runExternally(func, *args, **kwds) File "/usr/share/vdsm/storage/processPool.py", line 64, in runExternally return self._procPool.runExternally(*args, **kwargs) File "/usr/share/vdsm/storage/processPool.py", line 145, in runExternally raise Timeout("Operation Stuck") Timeout: Operation Stuck ... Thread-1494::DEBUG::2012-02-08 13:47:45,570::clientIF::261::Storage.Dispatcher.Protect::(wrapper) [10.34.63.26] Thread-1494::DEBUG::2012-02-08 13:47:45,570::task::588::TaskManager.Task::(_updateState) Task=`32285a86-3126-4527-802f-719597d25824`::moving from state init -> state preparing Thread-1494::INFO::2012-02-08 13:47:45,571::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': '10.34.63.202:/home/iso/shared', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': '117a86c3-be84-451a-b9a6-5680d3b617b6', 'port': ''}], options=None) Thread-1494::INFO::2012-02-08 13:47:45,571::storage_connection::167::Storage.ServerConnection::(disconnect) Request to disconnect NFS storage server Thread-1494::WARNING::2012-02-08 13:47:45,573::storage_connection::382::Storage.ServerConnection::(__disconnectNFSServer) Cannot remove mountpoint after umount() Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_connection.py", line 378, in __disconnectNFSServer os.rmdir(mnt.fs_file) OSError: [Errno 16] Device or resource busy: '/rhev/data-center/mnt/10.34.63.202:_home_iso_shared' Thread-1494::DEBUG::2012-02-08 13:47:45,574::misc::1017::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage) Thread-1494::DEBUG::2012-02-08 13:47:45,574::misc::1019::SamplingMethod::(__call__) Got in to sampling method Thread-1494::DEBUG::2012-02-08 13:47:45,574::misc::1017::SamplingMethod::(__call__) Trying to enter sampling method (storage.iscsi.rescan) Thread-1494::DEBUG::2012-02-08 13:47:45,575::misc::1019::SamplingMethod::(__call__) Got in to sampling method Thread-1494::DEBUG::2012-02-08 13:47:45,575::iscsi::389::Storage.Misc.excCmd::(rescan) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None) Thread-1494::DEBUG::2012-02-08 13:47:45,600::iscsi::389::Storage.Misc.excCmd::(rescan) SUCCESS: <err> = ''; <rc> = 0 Thread-1494::DEBUG::2012-02-08 13:47:45,601::misc::1027::SamplingMethod::(__call__) Returning last result After a while ISO domain goes to Inactive but is still mounted on the host. As soon as ISO is activated by user, it fails due to it's already mounted with: Thread-1567::DEBUG::2012-02-08 13:49:17,569::mount::111::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6 10.34.63.202:/home/iso/shared /tmp/tmp8sLigH' (cwd None) Thread-1567::DEBUG::2012-02-08 13:49:17,638::mount::111::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/umount /tmp/tmp8sLigH' (cwd None) Thread-1653::DEBUG::2012-02-08 13:50:56,232::clientIF::261::Storage.Dispatcher.Protect::(wrapper) [10.34.63.26] Thread-1653::DEBUG::2012-02-08 13:50:56,232::task::588::TaskManager.Task::(_updateState) Task=`cacaae64-48ae-4c78-b831-f34521dcdbbf`::moving from state init -> state preparing Thread-1653::INFO::2012-02-08 13:50:56,233::logUtils::37::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': '10.34.63.202:/home/iso/shared', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': '117a86c3-be84-451a-b9a6-5680d3b617b6', 'port': ''}], options=None) Thread-1653::INFO::2012-02-08 13:50:56,233::storage_connection::146::Storage.ServerConnection::(connect) Request to connect NFS storage server Thread-1653::DEBUG::2012-02-08 13:50:56,237::mount::111::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6 10.34.63.202:/home/iso/shared /rhev/data-center/mnt/10.34.63.202:_home_iso_shared' (cwd None) Thread-1653::ERROR::2012-02-08 13:50:56,347::storage_connection::255::Storage.ServerConnection::(__connectNFSServer) Error during storage connection Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_connection.py", line 253, in __connectNFSServer mnt.mount(getNfsOptions(con), mount.VFS_NFS, timeout=CON_TIMEOUT) File "/usr/share/vdsm/storage/mount.py", line 107, in mount return self._runcmd(cmd, timeout) File "/usr/share/vdsm/storage/mount.py", line 122, in _runcmd raise MountError(rc, ";".join(out + err)) MountError: (32, 'm;o;u;n;t;.;n;f;s;:; ;m;o;u;n;t;i;n;g; ;1;0;.;3;4;.;6;3;.;2;0;2;:;/;h;o;m;e;/;i;s;o;/;s;h;a;r;e;d; ;f;a;i;l;e;d;,; ;r;e;a;s;o;n; ;g;i;v;e;n; ;b;y; ;s;e;r;v;e;r;:;\n; ; ;N;o; ;s;u;c;h; ;f;i;l;e; ;o;r; ;d;i;r;e;c;t;o;r;y;\n') Thread-1653::DEBUG::2012-02-08 13:50:56,353::lvm::460::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex Thread-1653::DEBUG::2012-02-08 13:50:56,353::lvm::462::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex Thread-1653::DEBUG::2012-02-08 13:50:56,354::lvm::472::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex Thread-1653::DEBUG::2012-02-08 13:50:56,354::lvm::474::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex Thread-1653::DEBUG::2012-02-08 13:50:56,355::lvm::493::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex Thread-1653::DEBUG::2012-02-08 13:50:56,355::lvm::495::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex Version-Release number of selected component (if applicable): vdsm-4.9.3.2-0.fc16.x86_64 ovirt-engine-3.0.0_0001-1.4.fc16.x86_64 How reproducible: Always Steps to Reproduce: 1. Block connection to ISO domain 2. Unblock connection 3. Activate ISO Actual results: Fails Expected results: Succeeds Additional info: ISO domain is still mounted on host 10.34.63.202:/home/iso/shared on /rhev/data-center/mnt/10.34.63.202:_home_iso_shared type nfs4 (rw,relatime,vers=4,rsize=524288,wsize=524288,namlen=255,soft,nosharecache,proto=tcp,port=0,timeo=600,retrans=6,sec=sys,clientaddr=10.34.63.208,minorversion=0,local_lock=none,addr=10.34.63.202) is accessible on filesystem level: [root@srh-03 ~]# ll /rhev/data-center/mnt/10.34.63.202\:_home_iso_shared/ total 4 drwxr-xr-x. 4 vdsm kvm 4096 Jul 27 2011 f1621233-13f4-4c6b-99e3-86fbd4dbdad2 but lsof returns warning [root@srh-03 ~]# lsof | grep shared lsof: WARNING: can't stat() nfs4 file system /rhev/data-center/mnt/10.34.63.202:_home_iso_shared (deleted) Output information may be incomplete.
This is probably not a vdsm bug, umount hangs and nfs doesn't timeout on specified 60 seconds. Will change component soon.
Please ignore the first piece of logs in description - they are not relevant to this issue. The 'operation stuck' error is on spmStart Thread-255::INFO::2012-02-08 13:28:58,385::logUtils::37::dispatcher::(wrapper) Run and protect: spmStart(spUUID='bb97353c-fb19-4f46-bda4-46dc37f5c5d6', prevID=-1, prevLVER='-1', recoveryMode=0, scsiFencing='false', maxHostID=250, domVersion='2', options=None)
that's tough, but umount cannot do its thing (incl. flushing fs data to server) with no networking. nfs considers this as a feature. Is there really a Vdsm bug here?
I cannot reproduce anymore - closing as notabug