Created attachment 885644 [details] VDSM Log Description of problem: Cluster of 4 nodes never stops contending for SPM. Not able to run any VMs now Version-Release number of selected component (if applicable): 3.4.0 How reproducible: Unknown Steps to Reproduce: 1.Manually shut down all VMs 2.Put all nodes in maintenance mode 3.CHange network configuration on all nodes 4.Reboot nodes and attempt to activate Actual results: Nodes never complete SPM contention Expected results: Nodes and Datacenter come back online Additional info: ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-11) [3aed1419] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed
Created attachment 885645 [details] sanlock
Created attachment 885646 [details] engine.log file
The issue is that the host doesn't have access to all the storage domains which causes to the spm start process to fail. There's a bug open for that issue - https://bugzilla.redhat.com/show_bug.cgi?id=1072900. From looking in the logs, it seems like that host have problem accessing two storage domains - 3406665e-4adc-4fd4-aa1e-037547b29adb f3b51811-4a7f-43af-8633-322b3db23c48 Can you verify that the host can access those domains? from the log it seems like the nfs paths for those are: shtistg01.suprtekstic.com:/storage/infrastructure shtistg01.suprtekstic.com:/storage/exports log snippet: 1. Thread-14:EBUG::2014-04-11 22:54:44,331::mount::226::Storage.Misc.excCmd:_runcmd) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retra ns=6,nfsvers=3 ashtistg01.suprtekstic.com:/storage/exports /rhev/data-center/mnt/ashtistg01.suprtekstic.com:_storage_exports' (cwd None) Thread-14::ERROR::2014-04-11 22:55:36,659::storageServer::209::StorageServer.MountConnection:connect) Mount failed: (32, ';mount.nfs: Failed to resolve serv er ashtistg01.suprtekstic.com: Name or service not known\n') Traceback (most recent call last): File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect self._mount.mount(self.options, self._vfsType) File "/usr/share/vdsm/storage/mount.py", line 222, in mount return self._runcmd(cmd, timeout) File "/usr/share/vdsm/storage/mount.py", line 238, in _runcmd raise MountError(rc, ";".join((out, err))) MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n') Thread-14::ERROR::2014-04-11 22:55:36,705::hsm::2379::Storage.HSM:connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2376, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 320, in connect return self._mountCon.connect() File "/usr/share/vdsm/storage/storageServer.py", line 215, in connect raise e MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n') 2. Thread-14::ERROR::2014-04-11 22:56:29,307::storageServer::209::StorageServer.MountConnection:connect) Mount failed: (32, ';mount.nfs: Failed to resolve serv er ashtistg01.suprtekstic.com: Name or service not known\n') Traceback (most recent call last): File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect self._mount.mount(self.options, self._vfsType) File "/usr/share/vdsm/storage/mount.py", line 222, in mount return self._runcmd(cmd, timeout) File "/usr/share/vdsm/storage/mount.py", line 238, in _runcmd raise MountError(rc, ";".join((out, err))) MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n') Thread-14::ERROR::2014-04-11 22:56:29,309::hsm::2379::Storage.HSM:connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2376, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 320, in connect return self._mountCon.connect() File "/usr/share/vdsm/storage/storageServer.py", line 215, in connect raise e MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n') Regardless of that, there are sanlock errors over the log when trying to acquire host-id over the log. Fede, can you take a look on those sanlock errors to verify that we don't have further issues here?
bug 1072900 is in MODIFIED - moving this one too.
This is an automated message oVirt 3.4.1 has been released: * should fix your issue * should be available at your local mirror within two days. If problems still persist, please make note of it in this bug report.
This issue started affect also our environment after our ISO domain became unavailable. It is really unsettling when something like this starts to happen in production environment with 20 virtual hosts. Annoyingly enough, even after ISO domain was fixed and it available to all hosts, together with main domain, issue did not go away. Also, oVirt 4.2.1 is not yet available in repositories, which is kinda sad.
(In reply to Kristaps Tigeris from comment #6) > This issue started affect also our environment after our ISO domain became > unavailable. It is really unsettling when something like this starts to > happen in production environment with 20 virtual hosts. > > Annoyingly enough, even after ISO domain was fixed and it available to all > hosts, together with main domain, issue did not go away. > > Also, oVirt 4.2.1 is not yet available in repositories, which is kinda sad. kristaps - do you mean 3.4.1? still an issue locating this version?
Yes, I mean 3.4.1. I resolved my issue by downgrading vdsm on hosts. But yea, I still don't see 3.4.1 in oVirt repository.
(In reply to Kristaps Tigeris from comment #8) > Yes, I mean 3.4.1. > > I resolved my issue by downgrading vdsm on hosts. > > But yea, I still don't see 3.4.1 in oVirt repository. Sandro, I too can't find 3.4.1 in oVirt's repos. Can you take a look please?
oVirt 3.4.1 has been released in http://resources.ovirt.org/pub/ovirt-3.4. Release notes are available here: http://www.ovirt.org/OVirt_3.4.1_release_notes In order to install it on a clean system, you need to install # yum localinstall http://resources.ovirt.org/pub/yum-repo/ovirt-release34.rpm Let me know if you need any other info.