Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 840838

Summary: [rhevm] [engine-core] reconstruct should be called in case fenceSpm() failed
Product: Red Hat Enterprise Virtualization Manager Reporter: Haim <hateya>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED WONTFIX QA Contact: Haim <hateya>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: abaron, amureini, dyasny, ewarszaw, fsimonce, hateya, iheim, lpeer, Rhev-m-bugs, sgrinber, yeylon, ykaul
Target Milestone: ---   
Target Release: 3.1.5   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-11-07 10:00:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine and vdsm logs none

Description Haim 2012-07-17 11:03:05 UTC
Description of problem:

my case:

- 2 hosts, host A (spm), host B (hsm) - not connected (in maintenance)
- host A went to non-responsive
- active host B, connectStoragePool fails on cannot find master domain
- using web-admin, right click on host A >> 'confirm host has been rebooted'

the action fails on vdsm side with the following error:

Thread-1909::INFO::2012-07-17 16:49:37,742::logUtils::37::dispatcher::(wrapper) Run and protect: fenceSpmStorage(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', lastOwner=None, lastLver=None, options=None)
Thread-1909::ERROR::2012-07-17 16:49:37,742::task::853::TaskManager.Task::(_setError) Task=`bd036ea7-e322-4fe2-b51a-b014af6763cc`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2976, in fenceSpmStorage
    pool = self.getPool(spUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 263, in getPool
    raise se.StoragePoolUnknown(spUUID)
StoragePoolUnknown: Unknown pool id, pool not connected: ('cacbdf16-d006-11e1-b98a-001a4a16970e',)

the only option left on this scenario is that reconstruct should be called to the available domains, and fenceSpm() will succeed after wards.

Comment 1 Haim 2012-07-17 11:09:14 UTC
Created attachment 598605 [details]
engine and vdsm logs

Comment 3 Ayal Baron 2012-08-06 09:54:53 UTC
Indeed in case only 1 host is available (being activated) and connect failed engine can try to reconstruct.

Comment 4 Federico Simoncelli 2012-09-27 09:52:54 UTC
I'm not sure what flow/use-case fenceSpm was used to cover but now I can't see it's point. If all the active hosts can't reach the master domain we should elect one to use reconstruct (as suggested in comment 3).

Comment 5 Ayal Baron 2012-09-27 10:11:37 UTC
From checking the code + trying to reproduce it appears that the system does fix itself after a couple of minutes.
Considering the above and the fact that changing the behaviour requires really complex and risky changes I will close the bug as wontfix unless you can show that the system does indeed reach a state it cannot get out of.

Comment 6 Haim 2012-10-11 11:35:23 UTC
(In reply to comment #5)
> From checking the code + trying to reproduce it appears that the system does
> fix itself after a couple of minutes.
> Considering the above and the fact that changing the behaviour requires
> really complex and risky changes I will close the bug as wontfix unless you
> can show that the system does indeed reach a state it cannot get out of.

well it can't, the only thing I can do is recovery procedure which will destroy my other domains.