Bug 1086951

Summary: SPM never stops contending
Product: [Retired] oVirt Reporter: Maurice James <midnightsteel>
Component: ovirt-engine-coreAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Aharon Canan <acanan>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.4CC: acathrow, amureini, fsimonce, gklein, iheim, kristapstigeris, michal.skrivanek, sbonazzo, yeylon
Target Milestone: ---   
Target Release: 3.4.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-08 13:35:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
VDSM Log
none
sanlock
none
engine.log file none

Description Maurice James 2014-04-12 03:12:08 UTC
Created attachment 885644 [details]
VDSM Log

Description of problem:
Cluster of 4 nodes never stops contending for SPM. Not able to run any VMs now

Version-Release number of selected component (if applicable):
3.4.0

How reproducible:
Unknown

Steps to Reproduce:
1.Manually shut down all VMs
2.Put all nodes in maintenance mode
3.CHange network configuration on all nodes
4.Reboot nodes and attempt to activate

Actual results:
Nodes never complete SPM contention

Expected results:
Nodes and Datacenter come back online

Additional info:
ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-11) [3aed1419] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed

Comment 1 Maurice James 2014-04-12 03:13:18 UTC
Created attachment 885645 [details]
sanlock

Comment 2 Maurice James 2014-04-12 03:19:20 UTC
Created attachment 885646 [details]
engine.log file

Comment 3 Liron Aravot 2014-04-20 22:24:53 UTC
The issue is that the host doesn't have access to all the storage domains which causes to the spm start process to fail.
There's a bug open for that issue - https://bugzilla.redhat.com/show_bug.cgi?id=1072900.

From looking in the logs, it seems like that host have problem accessing two storage domains -
3406665e-4adc-4fd4-aa1e-037547b29adb
f3b51811-4a7f-43af-8633-322b3db23c48

Can you verify that the host can access those domains? from the log it seems like the nfs paths for those are:
shtistg01.suprtekstic.com:/storage/infrastructure
shtistg01.suprtekstic.com:/storage/exports


log snippet:
1.
Thread-14:EBUG::2014-04-11 22:54:44,331::mount::226::Storage.Misc.excCmd:_runcmd) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retra
ns=6,nfsvers=3 ashtistg01.suprtekstic.com:/storage/exports /rhev/data-center/mnt/ashtistg01.suprtekstic.com:_storage_exports' (cwd None)
Thread-14::ERROR::2014-04-11 22:55:36,659::storageServer::209::StorageServer.MountConnection:connect) Mount failed: (32, ';mount.nfs: Failed to resolve serv
er ashtistg01.suprtekstic.com: Name or service not known\n')
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect
    self._mount.mount(self.options, self._vfsType)
  File "/usr/share/vdsm/storage/mount.py", line 222, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 238, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n')
Thread-14::ERROR::2014-04-11 22:55:36,705::hsm::2379::Storage.HSM:connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2376, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 320, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 215, in connect
    raise e
MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n')





2.
Thread-14::ERROR::2014-04-11 22:56:29,307::storageServer::209::StorageServer.MountConnection:connect) Mount failed: (32, ';mount.nfs: Failed to resolve serv
er ashtistg01.suprtekstic.com: Name or service not known\n')
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect
    self._mount.mount(self.options, self._vfsType)
  File "/usr/share/vdsm/storage/mount.py", line 222, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 238, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n')
Thread-14::ERROR::2014-04-11 22:56:29,309::hsm::2379::Storage.HSM:connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2376, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 320, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 215, in connect
    raise e
MountError: (32, ';mount.nfs: Failed to resolve server ashtistg01.suprtekstic.com: Name or service not known\n')


Regardless of that, there are sanlock errors over the log when trying to acquire host-id over the log.
Fede, can you take a look on those sanlock errors to verify that we don't have further issues here?

Comment 4 Allon Mureinik 2014-04-30 08:27:58 UTC
bug 1072900 is in MODIFIED - moving this one too.

Comment 5 Sandro Bonazzola 2014-05-08 13:35:51 UTC
This is an automated message

oVirt 3.4.1 has been released:
 * should fix your issue
 * should be available at your local mirror within two days.

If problems still persist, please make note of it in this bug report.

Comment 6 Kristaps Tigeris 2014-05-15 18:50:19 UTC
This issue started affect also our environment after our ISO domain became unavailable. It is really unsettling when something like this starts to happen in production environment with 20 virtual hosts. 

Annoyingly enough, even after ISO domain was fixed and it available to all hosts, together with main domain, issue did not go away.

Also, oVirt 4.2.1 is not yet available in repositories, which is kinda sad.

Comment 7 Itamar Heim 2014-05-18 20:36:23 UTC
(In reply to Kristaps Tigeris from comment #6)
> This issue started affect also our environment after our ISO domain became
> unavailable. It is really unsettling when something like this starts to
> happen in production environment with 20 virtual hosts. 
> 
> Annoyingly enough, even after ISO domain was fixed and it available to all
> hosts, together with main domain, issue did not go away.
> 
> Also, oVirt 4.2.1 is not yet available in repositories, which is kinda sad.

kristaps - do you mean 3.4.1? still an issue locating this version?

Comment 8 Kristaps Tigeris 2014-05-19 07:14:01 UTC
Yes, I mean 3.4.1.

I resolved my issue by downgrading vdsm on hosts.

But yea, I still don't see 3.4.1 in oVirt repository.

Comment 9 Allon Mureinik 2014-05-19 07:28:57 UTC
(In reply to Kristaps Tigeris from comment #8)
> Yes, I mean 3.4.1.
> 
> I resolved my issue by downgrading vdsm on hosts.
> 
> But yea, I still don't see 3.4.1 in oVirt repository.
Sandro, I too can't find 3.4.1 in oVirt's repos. Can you take a look please?

Comment 10 Sandro Bonazzola 2014-05-19 08:27:03 UTC
oVirt 3.4.1 has been released in http://resources.ovirt.org/pub/ovirt-3.4.

Release notes are available here: http://www.ovirt.org/OVirt_3.4.1_release_notes

In order to install it on a clean system, you need to install

 # yum localinstall http://resources.ovirt.org/pub/yum-repo/ovirt-release34.rpm

Let me know if you need any other info.