Bug 1072900 - spm cannot be started if domain can't be produced
Summary: spm cannot be started if domain can't be produced
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.4.1
Assignee: Liron Aravot
QA Contact: Aharon Canan
URL:
Whiteboard: storage
: 1092667 (view as bug list)
Depends On:
Blocks: 968977
TreeView+ depends on / blocked
 
Reported: 2014-03-05 11:01 UTC by G. Bersano
Modified: 2016-02-10 16:58 UTC (History)
13 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-05-08 13:36:48 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
engine and vdsm logs (1.69 MB, application/x-gzip)
2014-03-06 16:14 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 25424 0 None MERGED sp: fix spm start when failing to produce domain Never
oVirt gerrit 27118 0 None ABANDONED sp: fix spm start when failing to produce domain Never
oVirt gerrit 27194 0 None MERGED sp: fix spm start when failing to produce domain Never

Description G. Bersano 2014-03-05 11:01:39 UTC
Description of problem:
If you need to put in maintenance mode every host in a DataCenter and the ISO Storage Domain and/or the Export Storage Domain aren't reacheable when you want to start the DC again, the DC will not return to function.

Version-Release number of selected component (if applicable):
oVirt 3.4beta3

How reproducible:
Always

Steps to Reproduce:
1. You have a fully configured DataCenter, I mean you also configured the ISO Storage Domain and/or the Export Storage Domain.
2. You have to do serious maintenance to your DataCenter. You shutdown every VM and then put in Maintenance mode every host in the corresponding Cluster(s).
3. You finish your planned activity, almost everything is OK but for some reason one of the Export or ISO SD doesn't function.
4. You Activate one of the hosts 

Actual results:
The DataCenter isn't able to start and toggles between "Non Responsive" and "Contending" status.

Expected results:
The only blocking Storage Domain should be the Master Data SD. 
Malfunctioning of every other Storage shouldn't prevent the DC to come up.

Additional info:
This is especially serious if one of those SD is exposed by a VM configured to be run by that DC. That NFS server can't be started before the DC comes up but the DC doesn't comes up without that NFS share.
At that point the simplest remedy is to evict that SD using the "Destroy" action in webadmin.

Comment 1 G. Bersano 2014-03-05 11:08:37 UTC
I forgot to say that if you can restore the functionality of the NFS server everything comes up as expected.
But this isn't possible if that server is a VM of that DC.

Comment 2 Nicolas Ecarnot 2014-03-05 13:35:16 UTC
I've witnessed a similar behaviour in 3.3.
When the NFS server went back up and running (it was an external one), everything came back to life.

Comment 3 Sven Kieske 2014-03-06 10:09:59 UTC
any chance of getting this fixed in 3.4.1?

Comment 4 Elad 2014-03-06 16:14:38 UTC
Created attachment 871549 [details]
engine and vdsm logs

Encountered a similar problem.
After blocking storage server which master domain is located on, reconstruct is reported as completed, but DC is stuck between 2 states - Non responsive and Contending.

2014-03-06 16:20:10,150 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-4-thread-46) [467e56f2] Correlation ID: 467e56f2, Job ID: c18d89f2-6318-4d0c-a353-7e269ce
63c5b, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center elad1 completed.



Attaching logs, setting severity to Urgent.

Comment 5 Eduardo Warszawski 2014-03-20 04:05:44 UTC
(In reply to G. Bersano from comment #0)

> Expected results:

> The only blocking Storage Domain should be the Master Data SD. 
By design of the ovirt system spm host should be able at any time to reach any SD of the pool. Failing this condition can lead, among others, to extend requests being not served and to paused VMs.


> Malfunctioning of every other Storage shouldn't prevent the DC to come up.
Engine should care to choose a new host capable of reaching all the SDs of the pool or update (reconstruct) the pool definition accordingly.

The error info returned by the failed startSpm includes the unreacheable domain and this info should be used manage to start the DC.

Comment 6 Allon Mureinik 2014-03-20 12:34:30 UTC
(In reply to Eduardo Warszawski from comment #5)
> (In reply to G. Bersano from comment #0)
> 
> > Expected results:
> 
> > The only blocking Storage Domain should be the Master Data SD. 
> By design of the ovirt system spm host should be able at any time to reach
> any SD of the pool. Failing this condition can lead, among others, to extend
> requests being not served and to paused VMs.
> 
> 
> > Malfunctioning of every other Storage shouldn't prevent the DC to come up.
> Engine should care to choose a new host capable of reaching all the SDs of
> the pool or update (reconstruct) the pool definition accordingly.
> 
> The error info returned by the failed startSpm includes the unreacheable
> domain and this info should be used manage to start the DC.
I tend to agree - AFAIK, VDSM should not make any decisions about the contents of the pool, just execute what engine decides.

Comment 7 Federico Simoncelli 2014-03-21 18:14:15 UTC
(In reply to Eduardo Warszawski from comment #5)
> (In reply to G. Bersano from comment #0)
> 
> > Expected results:
> 
> > The only blocking Storage Domain should be the Master Data SD. 
> By design of the ovirt system spm host should be able at any time to reach
> any SD of the pool. Failing this condition can lead, among others, to extend
> requests being not served and to paused VMs.

Until an host reaches the master it can hold the SPM, and I agree it's engine duty to elect a new SPM when needed/possible.

Anyway the extension requests are issued only through the master domain so if one of the other domains is unreachable is irrelevant.

> > Malfunctioning of every other Storage shouldn't prevent the DC to come up.
> Engine should care to choose a new host capable of reaching all the SDs of
> the pool or update (reconstruct) the pool definition accordingly.

I agree that the engine should choose a new host capable of reaching all the SDs but not being able to start the SPM if one of the domains is unreachable is a bug.

If you force engine to reconstruct before starting the SPM you're evicting a domain from the pool before it reaches the usual timeout. In fact normally a domain is deemed unreachable after a timeout of 5 minutes.

We cannot wait 5 minutes before reconstructing/starting the SPM and we cannot evict a domain before its 5 minutes of grace time.

That is why we should allow to start the SPM even if one of the regular domains is unreachable, and then when the timeout is reached the engine will send deactivateStorageDomain.

Comment 8 Liron Aravot 2014-03-23 08:55:01 UTC
(In reply to Eduardo Warszawski from comment #5)
> (In reply to G. Bersano from comment #0)
> 
> > Expected results:
> 
> > The only blocking Storage Domain should be the Master Data SD. 
> By design of the ovirt system spm host should be able at any time to reach
> any SD of the pool. Failing this condition can lead, among others, to extend
> requests being not served and to paused VMs.
>
 
Eduardo,
Currently (previous to the introduced bug) - spm start didn't require access to all the domains - changing this will lead to regression working with all previous engine versions and will break BC - how do you suggest to handle it?

Comment 9 Eduardo Warszawski 2014-03-23 11:15:50 UTC
(In reply to Federico Simoncelli from comment #7)
> (In reply to Eduardo Warszawski from comment #5)
> > (In reply to G. Bersano from comment #0)
> > 
> > > Expected results:
> > 
> > > The only blocking Storage Domain should be the Master Data SD. 
> > By design of the ovirt system spm host should be able at any time to reach
> > any SD of the pool. Failing this condition can lead, among others, to extend
> > requests being not served and to paused VMs.
> 
> Until an host reaches the master it can hold the SPM, and I agree it's
> engine duty to elect a new SPM when needed/possible.
> 
> Anyway the extension requests are issued only through the master domain so
> if one of the other domains is unreachable is irrelevant.
> 
If the domain is unreachable by the SPM host, LVs on it can't be extended.
In spite of the requests in the MSD.





> > > Malfunctioning of every other Storage shouldn't prevent the DC to come up.
> > Engine should care to choose a new host capable of reaching all the SDs of
> > the pool or update (reconstruct) the pool definition accordingly.
> 
> I agree that the engine should choose a new host capable of reaching all the
> SDs but not being able to start the SPM if one of the domains is unreachable
> is a bug.
> 
> If you force engine to reconstruct before starting the SPM you're evicting a
> domain from the pool before it reaches the usual timeout. In fact normally a
> domain is deemed unreachable after a timeout of 5 minutes.
> 
> We cannot wait 5 minutes before reconstructing/starting the SPM and we
> cannot evict a domain before its 5 minutes of grace time.
> 
> That is why we should allow to start the SPM even if one of the regular
> domains is unreachable, and then when the timeout is reached the engine will
> send deactivateStorageDomain.

You should deactivate or remove the unreachable domain in the new pool definition and monitor this domain until it comes back. If you don't want to monitor it you can issue getStorageDomainsList which returns the list of reachable domains.

The only mission of the SPM is performing actions on a SD. There is no point claiming to an SPM of an unreachable domain. Not even for 5 minutes. "Evict" it and add it again later.

Comment 10 Eduardo Warszawski 2014-03-23 11:34:27 UTC
(In reply to Liron Aravot from comment #8)
> (In reply to Eduardo Warszawski from comment #5)
> > (In reply to G. Bersano from comment #0)
> > 
> > > Expected results:
> > 
> > > The only blocking Storage Domain should be the Master Data SD. 
> > By design of the ovirt system spm host should be able at any time to reach
> > any SD of the pool. Failing this condition can lead, among others, to extend
> > requests being not served and to paused VMs.
> >
>  
> Eduardo,
> Currently (previous to the introduced bug) - spm start didn't require access
> to all the domains - changing this will lead to regression working with all
> previous engine versions and will break BC - how do you suggest to handle it?

oVirt 3.0 included the regularization of outdated MSDs. No regression here.
(I593060e354c0bdc9b19f4e11a376094d83e567ce)

This is not BC. If any engine version expects the SPM to be started, it should be fixed.
If any SD is not reachable the host lost the SPM.
In spite that the code is no so consistent, connectStorage pool may fail in case of unreachable domains.

You have conflicting behaviour, fix the engine.

Comment 11 Elad 2014-04-01 11:09:51 UTC
*** Bug 1078907 has been marked as a duplicate of this bug. ***

Comment 12 Liron Aravot 2014-05-01 15:05:17 UTC
*** Bug 1092667 has been marked as a duplicate of this bug. ***

Comment 13 Sandro Bonazzola 2014-05-08 13:36:48 UTC
This is an automated message

oVirt 3.4.1 has been released:
 * should fix your issue
 * should be available at your local mirror within two days.

If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.