Description of problem:
my case:
- 2 hosts, host A (spm), host B (hsm) - not connected (in maintenance)
- host A went to non-responsive
- active host B, connectStoragePool fails on cannot find master domain
- using web-admin, right click on host A >> 'confirm host has been rebooted'
the action fails on vdsm side with the following error:
Thread-1909::INFO::2012-07-17 16:49:37,742::logUtils::37::dispatcher::(wrapper) Run and protect: fenceSpmStorage(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', lastOwner=None, lastLver=None, options=None)
Thread-1909::ERROR::2012-07-17 16:49:37,742::task::853::TaskManager.Task::(_setError) Task=`bd036ea7-e322-4fe2-b51a-b014af6763cc`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 861, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
res = f(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 2976, in fenceSpmStorage
pool = self.getPool(spUUID)
File "/usr/share/vdsm/storage/hsm.py", line 263, in getPool
raise se.StoragePoolUnknown(spUUID)
StoragePoolUnknown: Unknown pool id, pool not connected: ('cacbdf16-d006-11e1-b98a-001a4a16970e',)
the only option left on this scenario is that reconstruct should be called to the available domains, and fenceSpm() will succeed after wards.
Indeed in case only 1 host is available (being activated) and connect failed engine can try to reconstruct.
Comment 4Federico Simoncelli
2012-09-27 09:52:54 UTC
I'm not sure what flow/use-case fenceSpm was used to cover but now I can't see it's point. If all the active hosts can't reach the master domain we should elect one to use reconstruct (as suggested in comment 3).
From checking the code + trying to reproduce it appears that the system does fix itself after a couple of minutes.
Considering the above and the fact that changing the behaviour requires really complex and risky changes I will close the bug as wontfix unless you can show that the system does indeed reach a state it cannot get out of.
(In reply to comment #5)
> From checking the code + trying to reproduce it appears that the system does
> fix itself after a couple of minutes.
> Considering the above and the fact that changing the behaviour requires
> really complex and risky changes I will close the bug as wontfix unless you
> can show that the system does indeed reach a state it cannot get out of.
well it can't, the only thing I can do is recovery procedure which will destroy my other domains.
Description of problem: my case: - 2 hosts, host A (spm), host B (hsm) - not connected (in maintenance) - host A went to non-responsive - active host B, connectStoragePool fails on cannot find master domain - using web-admin, right click on host A >> 'confirm host has been rebooted' the action fails on vdsm side with the following error: Thread-1909::INFO::2012-07-17 16:49:37,742::logUtils::37::dispatcher::(wrapper) Run and protect: fenceSpmStorage(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', lastOwner=None, lastLver=None, options=None) Thread-1909::ERROR::2012-07-17 16:49:37,742::task::853::TaskManager.Task::(_setError) Task=`bd036ea7-e322-4fe2-b51a-b014af6763cc`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2976, in fenceSpmStorage pool = self.getPool(spUUID) File "/usr/share/vdsm/storage/hsm.py", line 263, in getPool raise se.StoragePoolUnknown(spUUID) StoragePoolUnknown: Unknown pool id, pool not connected: ('cacbdf16-d006-11e1-b98a-001a4a16970e',) the only option left on this scenario is that reconstruct should be called to the available domains, and fenceSpm() will succeed after wards.