1183977 – [engine-backend] When reconstruct master is marked as finished, the problematic domain is reported as active, while the new master is inactive

Bug 1183977 - [engine-backend] When reconstruct master is marked as finished, the problematic domain is reported as active, while the new master is inactive

Summary: [engine-backend] When reconstruct master is marked as finished, the problemat...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Liron Aravot
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1202399
TreeView+	depends on / blocked

Reported:	2015-01-20 10:58 UTC by Elad
Modified:	2016-03-10 12:04 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1202399 (view as bug list)
Environment:
Last Closed:
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs, screeshots and db_dump (1.15 MB, application/x-gzip) 2015-01-20 10:58 UTC, Elad	no flags	Details
logs-2.2 (1.75 MB, application/x-gzip) 2015-02-02 08:17 UTC, Elad	no flags	Details
logs (netapp) (426.11 KB, application/x-gzip) 2015-02-02 10:02 UTC, Elad	no flags	Details
3.4 host on 3.4 DC (202.52 KB, application/x-gzip) 2015-02-02 12:29 UTC, Elad	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	37952	0	master	MERGED	monitor: return whether the reported status is actual	2020-06-30 13:16:35 UTC
oVirt gerrit	38039	0	master	MERGED	core: determine whether domain monitoring result is actual	2020-06-30 13:16:35 UTC

Description Elad 2015-01-20 10:58:28 UTC

Created attachment 981794 [details]
logs, screeshots and db_dump

Description of problem:
When the reconstruct operation is marked as completed, for a time period of several minutes, there is an illegal set of states of the storage domains in the DC.
- The new master is reported as 'Inactive' because it hasn't been activated yet.
- The old master, which is no longer the master and is not accessible for the hosts in the DC, is reported as 'Active'. 

This set of states of the storage domains in the DC is illegal since there shouldn't be a situation in which there is an active storage domain in the DC while the master domain is inactive. 

Version-Release number of selected component (if applicable):
rhev 3.5 vt13.8

How reproducible:
Alaways

Steps to Reproduce:
1. Have more than 1 storage domain in the DC
2. Block the master domain from all the hosts in the DC
3.

Actual results:
Reconstruct takes place and succeeds. There is a new master domain in the DC.

2015-01-20 12:41:25,589 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-7-thread-1) [798e2dff] FINISH, ReconstructMasterVDSCommand, log id: 612a9167

Engine reports the domains state as described in the bug description.
(see screeshots)

Expected results:
There should never be a situation in which there is not active master domain in the DC, while there are non-master domains in state active

Additional info: logs, screeshots and db_dump

Comment 1 Elad 2015-01-20 12:53:41 UTC

Reducing severity since the issue doesn't cause any failure in the reconstruct flow.

Comment 2 Tal Nisan 2015-01-20 14:24:43 UTC

Based in comment #1 it seems that the issue exists only in the display in the UI for several minutes and then fixed

Comment 3 Elad 2015-01-20 17:11:11 UTC

(In reply to Tal Nisan from comment #2)
> Based in comment #1 it seems that the issue exists only in the display in
> the UI for several minutes and then fixed

Investigated by Liron, it doesn't seem to be a display issue after all

Comment 4 Yaniv Lavi 2015-01-20 18:09:17 UTC

(In reply to Elad from comment #3)
> (In reply to Tal Nisan from comment #2)
> > Based in comment #1 it seems that the issue exists only in the display in
> > the UI for several minutes and then fixed
> 
> Investigated by Liron, it doesn't seem to be a display issue after all

So what is the issue?

Comment 5 Elad 2015-01-21 08:32:21 UTC

(In reply to Yaniv Dary from comment #4)

> So what is the issue?

As described above, while the new master domain is inactive, the old master is reported to be active, which is a false positive report (it is possible to initiate operations on the problematic domain) and also an illegal set of domains states (a non-master domain cannot be active if there is no active master domain in the same DC).

Comment 6 Elad 2015-01-25 11:42:09 UTC

While the problematic domain is reported wrongly as 'Active', it is allowed to perform reconstruct so it will be elected as master. This operation fails and after that the following occurs:

" Sync Error on Master Domain between Host purple-vds3 and oVirt Engine. Domain: vnx7 is marked as Master in oVirt Engine database but not on the Storage side. Please consult with Support on how to fix this issue. "

Comment 7 Liron Aravot 2015-01-29 12:58:42 UTC

Elad, is there an actual impact rather then waiting again for the domains to move to Inactive?
returining the severity to medium as this happens only on reconstruct when having Inactive domains.

Comment 8 Liron Aravot 2015-02-01 11:23:07 UTC

Elad,
can you try to reproduce it?
I've tried multiple times and couldn't reproduce it - i do understand the issue by performing RCA, but i want to see how often this occures.
thanks.

Comment 9 Elad 2015-02-02 08:17:04 UTC

Created attachment 987009 [details]
logs-2.2

Reproduced, attaching the logs

(the blocked domain is reproted as active)

2015-02-02 08:01:49,498 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-61) Storage Domain fa6c7974-d693-4f6a-8314-4b2fa23f0c2b:vnx
1 was reported by Host green-vdsb as Active in Pool 00000002-0002-0002-0002-000000000002, moving to active status

Comment 10 Liron Aravot 2015-02-02 08:42:08 UTC

Elad, thanks.
I dp understand the cause for this issue - I'd like to have some further info.

1. please try to reproduce while disabling StoragePoolDiskBackend config value for you datacenter version (with the 'original' storage used)

2. can you please try to reproduce using a different storage machine? that issue has reproduced only on that storage machine afaik.

Comment 11 Liron Aravot 2015-02-02 09:25:09 UTC

3. please also try to reproduce using 3.4 vdsm (same storage).

Comment 12 Elad 2015-02-02 10:02:04 UTC

Created attachment 987048 [details]
logs (netapp)

The same with other storage servers (tried with netapp nfs server)

2015-02-02 09:53:00,496 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-84) Storage Domain b5ff7e46-dff3-4ce9-a9fc-e8c330b5b455:netapp1 was reported by Host green-vdsb as Active in Pool 00000002-0002-0002-0002-000000000002, moving to active status

Comment 13 Elad 2015-02-02 12:29:01 UTC

Created attachment 987098 [details]
3.4 host on 3.4 DC

Tested the following matrix:

- 3.5 DC with 3.5 host - manage to reproduce
- 3.4 DC with 3.5 host - manage to reproduce
- 3.4 DC with 3.4 host - didn't manage to reproduce

Comment 14 Liron Aravot 2015-02-03 09:54:40 UTC

It can be seen that the bug couldn't be reproduce (or harder to reproduce) with vdsm 3.4 but was reproduced on vdsm 3.5 using both the StoragePoolDiskBackend/StoragePoolMemoryBackend)-I'm adding needinfo? on elad to reproduce it with 3.4 engine but that's regardless.

The root cause of the problem is that the domain monitoring thread is starting with a mixed default of positive, when the domain monitoring result is reported to the engine it could get this result and move the domains to Active (as reported from vdsm), while immediately after it'll be found out that the domain is in problem and it'll eventually change status to Inactive.

The solution here should be to start the monitoring thread with negative result and only when we know that the domain is OK report it back to the engine. That'll solve that issue and will also unify the domain monitoring result (currently some fields has negative initial result and some has initial positive result). regardless of that, we can consider in 3.6 other fields that are reported.

adding needinfo? on amureini/nsoffer to see their opinion on that.

Comment 15 Nir Soffer 2015-02-03 16:31:55 UTC

(In reply to Liron Aravot from comment #14)
> The root cause of the problem is that the domain monitoring thread is
> starting with a mixed default of positive, when the domain monitoring result
> is reported to the engine it could get this result and move the domains to
> Active (as reported from vdsm), while immediately after it'll be found out
> that the domain is in problem and it'll eventually change status to Inactive.
> 
> The solution here should be to start the monitoring thread with negative
> result and only when we know that the domain is OK report it back to the
> engine. That'll solve that issue and will also unify the domain monitoring
> result (currently some fields has negative initial result and some has
> initial positive result). regardless of that, we can consider in 3.6 other
> fields that are reported.

I agree that returning valid status when we don't know the status yet
is the root cause.

In the past we tried to return special lastCheck value meaning
"not checked yet" (see http://gerrit.ovirt.org/21649). This change
broke host activation flow, so we reverted this back to "fake"
lastCheck value (see http://gerrit.ovirt.org/21878). We can return
an error until the domain is checked, but this may also break
activation flow and older engines.

So it seems that the right solution is to return a new "checked" key
which will be False until the domain monitor is actually checked, 
and True after the first check was completed. Newer engine can use
this information to avoid the issue in this bug and probably also
in the activation flow.

Comment 16 Allon Mureinik 2015-02-04 09:27:14 UTC

(In reply to Nir Soffer from comment #15)
> (In reply to Liron Aravot from comment #14)
> > The root cause of the problem is that the domain monitoring thread is
> > starting with a mixed default of positive, when the domain monitoring result
> > is reported to the engine it could get this result and move the domains to
> > Active (as reported from vdsm), while immediately after it'll be found out
> > that the domain is in problem and it'll eventually change status to Inactive.
> > 
> > The solution here should be to start the monitoring thread with negative
> > result and only when we know that the domain is OK report it back to the
> > engine. That'll solve that issue and will also unify the domain monitoring
> > result (currently some fields has negative initial result and some has
> > initial positive result). regardless of that, we can consider in 3.6 other
> > fields that are reported.
> 
> I agree that returning valid status when we don't know the status yet
> is the root cause.
> 
> In the past we tried to return special lastCheck value meaning
> "not checked yet" (see http://gerrit.ovirt.org/21649). This change
> broke host activation flow, so we reverted this back to "fake"
> lastCheck value (see http://gerrit.ovirt.org/21878). We can return
> an error until the domain is checked, but this may also break
> activation flow and older engines.
> 
> So it seems that the right solution is to return a new "checked" key
> which will be False until the domain monitor is actually checked, 
> and True after the first check was completed. Newer engine can use
> this information to avoid the issue in this bug and probably also
> in the activation flow.
+1 on that.
The general idea should be that even if we can't fix older engines we shouldn't break them any than they already are, while newer engines can have better functionality.

Comment 18 Elad 2015-04-20 09:21:37 UTC

When reconstruct is finished, the problematic domain is reported as inactive and the master gets activated successfully.

Verified using 3.6.0-1
ovirt-engine-3.6.0-0.0.master.20150412172306.git55ba764.el6.noarch
vdsm-4.17.0-632.git19a83a2.el7.x86_64

Comment 19 Christopher Pereira 2015-04-29 02:16:20 UTC

In latest master build, I experienced the following issue:

A storage domain was unable to become the new "master storage domain" (MSD) after another (the previous MSD) was "isolated" (the connection to the old MSD's gluster-server was interrupted).

Expected result: The remaining storage domain should become the new MSD and not relay on the old MSD's availability.

Actually I can see in Engine:

1) Both SD are down (red)...since datacenter is also down.
2) The alive SD is labeled as "Data (Master)" and tries to become the new MSD, then it probably fails and turns back to just "Data" and the previous MSD tries the same.

Related VDSM logs:

Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state init ->
 state preparing
Thread-32620::INFO::2015-04-28 16:34:02,508::logUtils::48::dispatcher::(wrapper) Run and protect: getAllTasksStatuses(spUUID=None, options=None)
Thread-32620::ERROR::2015-04-28 16:34:02,508::task::863::Storage.TaskManager.Task::(_setError) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 870, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2202, in getAllTasksStatuses
    raise se.SpmStatusError()
SpmStatusError: Not SPM: ()
Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::882::Storage.TaskManager.Task::(_run) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Task._run: bf487090-8d62-4b42-bfd
e-93574a8e1486 () {} failed - stopping task
Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::1214::Storage.TaskManager.Task::(stop) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::stopping in state preparing (for
ce False)
Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::990::Storage.TaskManager.Task::(_decref) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::ref 1 aborting True
Thread-32620::INFO::2015-04-28 16:34:02,508::task::1168::Storage.TaskManager.Task::(prepare) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::aborting: Task is aborted: 'No
t SPM' - code 654
Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::1173::Storage.TaskManager.Task::(prepare) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Prepare: aborted: Not SPM
Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::990::Storage.TaskManager.Task::(_decref) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::ref 0 aborting True
Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::925::Storage.TaskManager.Task::(_doAbort) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Task._doAbort: force False
Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state preparing -> state aborting
Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::547::Storage.TaskManager.Task::(__state_aborting) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::_aborting: recover policy none
Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state aborting -> state failed
Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-32620::ERROR::2015-04-28 16:34:02,509::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': 'Not SPM: ()', 'code': 654}}
Thread-32620::DEBUG::2015-04-28 16:34:02,509::stompReactor::158::yajsonrpc.StompServer::(send) Sending response

Comment 20 Allon Mureinik 2015-04-29 08:21:40 UTC

Thanks for your report Christopher!

Can you please open a new BZ and attach the logs from ovirt-engine and VDSM to it so this issue can be properly tracked ?

Comment 21 Christopher Pereira 2015-05-04 00:29:55 UTC

Hi Allon,

We found out that the volume that was not able to become the new Master Storage Domain was just a replica-2 volume, which is not supported by oVirt.
We will resetup and repeat the tests first.

Best regards,
Christopher

Comment 22 Allon Mureinik 2016-03-10 10:43:36 UTC

RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 23 Allon Mureinik 2016-03-10 10:47:46 UTC

RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 24 Allon Mureinik 2016-03-10 12:04:18 UTC

RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Note You need to log in before you can comment on or make changes to this bug.