Bug 1086951 - SPM never stops contending
Summary: SPM never stops contending
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.4.1
Assignee: Liron Aravot
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-12 03:12 UTC by Maurice James
Modified: 2016-02-10 17:08 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-05-08 13:35:51 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
VDSM Log (2.11 MB, text/plain)
2014-04-12 03:12 UTC, Maurice James
no flags Details
sanlock (1.19 MB, text/plain)
2014-04-12 03:13 UTC, Maurice James
no flags Details
engine.log file (83.35 KB, text/plain)
2014-04-12 03:19 UTC, Maurice James
no flags Details

Description Maurice James 2014-04-12 03:12:08 UTC
Created attachment 885644 [details]
VDSM Log

Description of problem:
Cluster of 4 nodes never stops contending for SPM. Not able to run any VMs now

Version-Release number of selected component (if applicable):
3.4.0

How reproducible:
Unknown

Steps to Reproduce:
1.Manually shut down all VMs
2.Put all nodes in maintenance mode
3.CHange network configuration on all nodes
4.Reboot nodes and attempt to activate

Actual results:
Nodes never complete SPM contention

Expected results:
Nodes and Datacenter come back online

Additional info:
ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-11) [3aed1419] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed

Comment 1 Maurice James 2014-04-12 03:13:18 UTC
Created attachment 885645 [details]
sanlock

Comment 2 Maurice James 2014-04-12 03:19:20 UTC
Created attachment 885646 [details]
engine.log file

Comment 3 Liron Aravot 2014-04-20 22:24:53 UTC
The issue is that the host doesn't have access to all the storage domains which causes to the spm start process to fail.
There's a bug open for that issue - https://bugzilla.redhat.com/show_bug.cgi?id=1072900.

From looking in the logs, it seems like that host have problem accessing two storage domains -
3406665e-4adc-4fd4-aa1e-037547b29adb
f3b51811-4a7f-43af-8633-322b3db23c48

Can you verify that the host can access those domains? from the log it seems like the nfs paths for those are:
shtistg01.suprtekstic.com:/storage/infrastructure
shtistg01.suprtekstic.com:/storage/exports


log snippet:
1.
Thread-14:EBUG::2014-04-11 22:54:44,331::mount::226::Storage.Misc.excCmd:_runcmd) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retra
ns=6,nfsvers=3 ashtistg01.suprtekstic.com:/storage/exports /rhev/data-center/mnt/ashtistg01.suprtekstic.com:_storage_exports' (cwd None)
Thread-14::ERROR::2014-04-11 22:55:36,659::storageServer::209::StorageServer.MountConnection:connect) Mount failed: (32, ';mount.nfs: Failed to resolve serv
er ashtistg01.suprtekstic.com: Name or service not known\n')
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect
    self._mount.mount(self.options, self._vfsType)
  File "/usr/share/vdsm/storage/mount.py", line 222, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 238, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n')
Thread-14::ERROR::2014-04-11 22:55:36,705::hsm::2379::Storage.HSM:connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2376, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 320, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 215, in connect
    raise e
MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n')





2.
Thread-14::ERROR::2014-04-11 22:56:29,307::storageServer::209::StorageServer.MountConnection:connect) Mount failed: (32, ';mount.nfs: Failed to resolve serv
er ashtistg01.suprtekstic.com: Name or service not known\n')
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect
    self._mount.mount(self.options, self._vfsType)
  File "/usr/share/vdsm/storage/mount.py", line 222, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 238, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n')
Thread-14::ERROR::2014-04-11 22:56:29,309::hsm::2379::Storage.HSM:connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2376, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 320, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 215, in connect
    raise e
MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n')


Regardless of that, there are sanlock errors over the log when trying to acquire host-id over the log.
Fede, can you take a look on those sanlock errors to verify that we don't have further issues here?

Comment 4 Allon Mureinik 2014-04-30 08:27:58 UTC
bug 1072900 is in MODIFIED - moving this one too.

Comment 5 Sandro Bonazzola 2014-05-08 13:35:51 UTC
This is an automated message

oVirt 3.4.1 has been released:
 * should fix your issue
 * should be available at your local mirror within two days.

If problems still persist, please make note of it in this bug report.

Comment 6 Kristaps Tigeris 2014-05-15 18:50:19 UTC
This issue started affect also our environment after our ISO domain became unavailable. It is really unsettling when something like this starts to happen in production environment with 20 virtual hosts. 

Annoyingly enough, even after ISO domain was fixed and it available to all hosts, together with main domain, issue did not go away.

Also, oVirt 4.2.1 is not yet available in repositories, which is kinda sad.

Comment 7 Itamar Heim 2014-05-18 20:36:23 UTC
(In reply to Kristaps Tigeris from comment #6)
> This issue started affect also our environment after our ISO domain became
> unavailable. It is really unsettling when something like this starts to
> happen in production environment with 20 virtual hosts. 
> 
> Annoyingly enough, even after ISO domain was fixed and it available to all
> hosts, together with main domain, issue did not go away.
> 
> Also, oVirt 4.2.1 is not yet available in repositories, which is kinda sad.

kristaps - do you mean 3.4.1? still an issue locating this version?

Comment 8 Kristaps Tigeris 2014-05-19 07:14:01 UTC
Yes, I mean 3.4.1.

I resolved my issue by downgrading vdsm on hosts.

But yea, I still don't see 3.4.1 in oVirt repository.

Comment 9 Allon Mureinik 2014-05-19 07:28:57 UTC
(In reply to Kristaps Tigeris from comment #8)
> Yes, I mean 3.4.1.
> 
> I resolved my issue by downgrading vdsm on hosts.
> 
> But yea, I still don't see 3.4.1 in oVirt repository.
Sandro, I too can't find 3.4.1 in oVirt's repos. Can you take a look please?

Comment 10 Sandro Bonazzola 2014-05-19 08:27:03 UTC
oVirt 3.4.1 has been released in http://resources.ovirt.org/pub/ovirt-3.4.

Release notes are available here: http://www.ovirt.org/OVirt_3.4.1_release_notes

In order to install it on a clean system, you need to install

 # yum localinstall http://resources.ovirt.org/pub/yum-repo/ovirt-release34.rpm

Let me know if you need any other info.


Note You need to log in before you can comment on or make changes to this bug.