Created attachment 981794 [details] logs, screeshots and db_dump Description of problem: When the reconstruct operation is marked as completed, for a time period of several minutes, there is an illegal set of states of the storage domains in the DC. - The new master is reported as 'Inactive' because it hasn't been activated yet. - The old master, which is no longer the master and is not accessible for the hosts in the DC, is reported as 'Active'. This set of states of the storage domains in the DC is illegal since there shouldn't be a situation in which there is an active storage domain in the DC while the master domain is inactive. Version-Release number of selected component (if applicable): rhev 3.5 vt13.8 How reproducible: Alaways Steps to Reproduce: 1. Have more than 1 storage domain in the DC 2. Block the master domain from all the hosts in the DC 3. Actual results: Reconstruct takes place and succeeds. There is a new master domain in the DC. 2015-01-20 12:41:25,589 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-7-thread-1) [798e2dff] FINISH, ReconstructMasterVDSCommand, log id: 612a9167 Engine reports the domains state as described in the bug description. (see screeshots) Expected results: There should never be a situation in which there is not active master domain in the DC, while there are non-master domains in state active Additional info: logs, screeshots and db_dump
Reducing severity since the issue doesn't cause any failure in the reconstruct flow.
Based in comment #1 it seems that the issue exists only in the display in the UI for several minutes and then fixed
(In reply to Tal Nisan from comment #2) > Based in comment #1 it seems that the issue exists only in the display in > the UI for several minutes and then fixed Investigated by Liron, it doesn't seem to be a display issue after all
(In reply to Elad from comment #3) > (In reply to Tal Nisan from comment #2) > > Based in comment #1 it seems that the issue exists only in the display in > > the UI for several minutes and then fixed > > Investigated by Liron, it doesn't seem to be a display issue after all So what is the issue?
(In reply to Yaniv Dary from comment #4) > So what is the issue? As described above, while the new master domain is inactive, the old master is reported to be active, which is a false positive report (it is possible to initiate operations on the problematic domain) and also an illegal set of domains states (a non-master domain cannot be active if there is no active master domain in the same DC).
While the problematic domain is reported wrongly as 'Active', it is allowed to perform reconstruct so it will be elected as master. This operation fails and after that the following occurs: " Sync Error on Master Domain between Host purple-vds3 and oVirt Engine. Domain: vnx7 is marked as Master in oVirt Engine database but not on the Storage side. Please consult with Support on how to fix this issue. "
Elad, is there an actual impact rather then waiting again for the domains to move to Inactive? returining the severity to medium as this happens only on reconstruct when having Inactive domains.
Elad, can you try to reproduce it? I've tried multiple times and couldn't reproduce it - i do understand the issue by performing RCA, but i want to see how often this occures. thanks.
Created attachment 987009 [details] logs-2.2 Reproduced, attaching the logs (the blocked domain is reproted as active) 2015-02-02 08:01:49,498 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-61) Storage Domain fa6c7974-d693-4f6a-8314-4b2fa23f0c2b:vnx 1 was reported by Host green-vdsb as Active in Pool 00000002-0002-0002-0002-000000000002, moving to active status
Elad, thanks. I dp understand the cause for this issue - I'd like to have some further info. 1. please try to reproduce while disabling StoragePoolDiskBackend config value for you datacenter version (with the 'original' storage used) 2. can you please try to reproduce using a different storage machine? that issue has reproduced only on that storage machine afaik.
3. please also try to reproduce using 3.4 vdsm (same storage).
Created attachment 987048 [details] logs (netapp) The same with other storage servers (tried with netapp nfs server) 2015-02-02 09:53:00,496 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-84) Storage Domain b5ff7e46-dff3-4ce9-a9fc-e8c330b5b455:netapp1 was reported by Host green-vdsb as Active in Pool 00000002-0002-0002-0002-000000000002, moving to active status
Created attachment 987098 [details] 3.4 host on 3.4 DC Tested the following matrix: - 3.5 DC with 3.5 host - manage to reproduce - 3.4 DC with 3.5 host - manage to reproduce - 3.4 DC with 3.4 host - didn't manage to reproduce
It can be seen that the bug couldn't be reproduce (or harder to reproduce) with vdsm 3.4 but was reproduced on vdsm 3.5 using both the StoragePoolDiskBackend/StoragePoolMemoryBackend)-I'm adding needinfo? on elad to reproduce it with 3.4 engine but that's regardless. The root cause of the problem is that the domain monitoring thread is starting with a mixed default of positive, when the domain monitoring result is reported to the engine it could get this result and move the domains to Active (as reported from vdsm), while immediately after it'll be found out that the domain is in problem and it'll eventually change status to Inactive. The solution here should be to start the monitoring thread with negative result and only when we know that the domain is OK report it back to the engine. That'll solve that issue and will also unify the domain monitoring result (currently some fields has negative initial result and some has initial positive result). regardless of that, we can consider in 3.6 other fields that are reported. adding needinfo? on amureini/nsoffer to see their opinion on that.
(In reply to Liron Aravot from comment #14) > The root cause of the problem is that the domain monitoring thread is > starting with a mixed default of positive, when the domain monitoring result > is reported to the engine it could get this result and move the domains to > Active (as reported from vdsm), while immediately after it'll be found out > that the domain is in problem and it'll eventually change status to Inactive. > > The solution here should be to start the monitoring thread with negative > result and only when we know that the domain is OK report it back to the > engine. That'll solve that issue and will also unify the domain monitoring > result (currently some fields has negative initial result and some has > initial positive result). regardless of that, we can consider in 3.6 other > fields that are reported. I agree that returning valid status when we don't know the status yet is the root cause. In the past we tried to return special lastCheck value meaning "not checked yet" (see http://gerrit.ovirt.org/21649). This change broke host activation flow, so we reverted this back to "fake" lastCheck value (see http://gerrit.ovirt.org/21878). We can return an error until the domain is checked, but this may also break activation flow and older engines. So it seems that the right solution is to return a new "checked" key which will be False until the domain monitor is actually checked, and True after the first check was completed. Newer engine can use this information to avoid the issue in this bug and probably also in the activation flow.
(In reply to Nir Soffer from comment #15) > (In reply to Liron Aravot from comment #14) > > The root cause of the problem is that the domain monitoring thread is > > starting with a mixed default of positive, when the domain monitoring result > > is reported to the engine it could get this result and move the domains to > > Active (as reported from vdsm), while immediately after it'll be found out > > that the domain is in problem and it'll eventually change status to Inactive. > > > > The solution here should be to start the monitoring thread with negative > > result and only when we know that the domain is OK report it back to the > > engine. That'll solve that issue and will also unify the domain monitoring > > result (currently some fields has negative initial result and some has > > initial positive result). regardless of that, we can consider in 3.6 other > > fields that are reported. > > I agree that returning valid status when we don't know the status yet > is the root cause. > > In the past we tried to return special lastCheck value meaning > "not checked yet" (see http://gerrit.ovirt.org/21649). This change > broke host activation flow, so we reverted this back to "fake" > lastCheck value (see http://gerrit.ovirt.org/21878). We can return > an error until the domain is checked, but this may also break > activation flow and older engines. > > So it seems that the right solution is to return a new "checked" key > which will be False until the domain monitor is actually checked, > and True after the first check was completed. Newer engine can use > this information to avoid the issue in this bug and probably also > in the activation flow. +1 on that. The general idea should be that even if we can't fix older engines we shouldn't break them any than they already are, while newer engines can have better functionality.
When reconstruct is finished, the problematic domain is reported as inactive and the master gets activated successfully. Verified using 3.6.0-1 ovirt-engine-3.6.0-0.0.master.20150412172306.git55ba764.el6.noarch vdsm-4.17.0-632.git19a83a2.el7.x86_64
In latest master build, I experienced the following issue: A storage domain was unable to become the new "master storage domain" (MSD) after another (the previous MSD) was "isolated" (the connection to the old MSD's gluster-server was interrupted). Expected result: The remaining storage domain should become the new MSD and not relay on the old MSD's availability. Actually I can see in Engine: 1) Both SD are down (red)...since datacenter is also down. 2) The alive SD is labeled as "Data (Master)" and tries to become the new MSD, then it probably fails and turns back to just "Data" and the previous MSD tries the same. Related VDSM logs: Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state init -> state preparing Thread-32620::INFO::2015-04-28 16:34:02,508::logUtils::48::dispatcher::(wrapper) Run and protect: getAllTasksStatuses(spUUID=None, options=None) Thread-32620::ERROR::2015-04-28 16:34:02,508::task::863::Storage.TaskManager.Task::(_setError) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 870, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2202, in getAllTasksStatuses raise se.SpmStatusError() SpmStatusError: Not SPM: () Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::882::Storage.TaskManager.Task::(_run) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Task._run: bf487090-8d62-4b42-bfd e-93574a8e1486 () {} failed - stopping task Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::1214::Storage.TaskManager.Task::(stop) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::stopping in state preparing (for ce False) Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::990::Storage.TaskManager.Task::(_decref) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::ref 1 aborting True Thread-32620::INFO::2015-04-28 16:34:02,508::task::1168::Storage.TaskManager.Task::(prepare) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::aborting: Task is aborted: 'No t SPM' - code 654 Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::1173::Storage.TaskManager.Task::(prepare) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Prepare: aborted: Not SPM Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::990::Storage.TaskManager.Task::(_decref) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::ref 0 aborting True Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::925::Storage.TaskManager.Task::(_doAbort) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Task._doAbort: force False Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state preparing -> state aborting Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::547::Storage.TaskManager.Task::(__state_aborting) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::_aborting: recover policy none Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state aborting -> state failed Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-32620::ERROR::2015-04-28 16:34:02,509::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': 'Not SPM: ()', 'code': 654}} Thread-32620::DEBUG::2015-04-28 16:34:02,509::stompReactor::158::yajsonrpc.StompServer::(send) Sending response
Thanks for your report Christopher! Can you please open a new BZ and attach the logs from ovirt-engine and VDSM to it so this issue can be properly tracked ?
Hi Allon, We found out that the volume that was not able to become the new Master Storage Domain was just a replica-2 volume, which is not supported by oVirt. We will resetup and repeat the tests first. Best regards, Christopher
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE