Bug 853984
| Summary: | [engine-core] engine gets confused and tries to send disconnect storage pool while host is SPM (Operation not allowed while SPM is active while host is SPM) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dafna Ron <dron> | ||||
| Component: | ovirt-engine | Assignee: | Barak <bazulay> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Dafna Ron <dron> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 3.1.0 | CC: | abaron, acathrow, amureini, bazulay, hateya, iheim, jkt, laravot, lpeer, mkenneth, Rhev-m-bugs, yeylon | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.1.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | storage | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-12-29 10:50:38 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
host selected as Spm: [root@gold-vdsd tmp]# vdsClient -s 0 getSpmStatus f570527f-004a-4cab-8bee-129fa589bec5 spmId = 2 spmStatus = SPM spmLver = 9 second Host: [root@localhost tmp]# vdsClient -s 0 getSpmStatus f570527f-004a-4cab-8bee-129fa589bec5 spmId = 2 spmStatus = Free spmLver = 9 moving to manager since the DisconnectStoragePool should not be sent by manager. the bug is a async task manager bug, the finished tasks on spm aren't cleared altough they should. (seems like the async task manager doesn't perform polling on those tasks for some reason). I don't understand how come the VDSM did not self-fenced. Per the description above (comment #7) it looks like the VDSM did not detect that it doesn't see the entire pool. In this case it looks like VDSM should restart itself. It did not fence itself because it had access to the disk and was able to update its lease. The disk that was missing did not contain the lease hence no problem of keeping the cluster lock. In this situation engine can decide whether it would like to fence the node or better yet, just stop the spm and start it on a different node. Dafna, just to make sure here. Does the engine detects that the storage is faulty? Is this the reason for the disconnect attempt? (In reply to comment #14) > Dafna, just to make sure here. > > Does the engine detects that the storage is faulty? Is this the reason for > the disconnect attempt? yes. Closing old bugs. If this issue is still relevant/important in current version, please re-open the bug. |
Created attachment 609391 [details] logs Description of problem: In a setup where the domains are made out of luns from different storage's (extended) I removed my host's from one of the STorage's access list. backend tried sending DisconnectStoragePool to vdsm and got a reply from vdsm that the host is not SPM. looking with vdsClient I can see that the host is SPM. as a result, the host remains in up state and backend keeps marking it as SPM/None each in loop. Version-Release number of selected component (if applicable): si16 vdsm-4.9.6-31.0.el6_3.x86_64 How reproducible: 100% Steps to Reproduce: 1. create domains which have luns from different storage servers 2. remove the hosts from one of the storage's access list's 3. Actual results: backend is sending DisconnectStoragePool to the vds marked as SPM and gets an error: Operation not allowed while SPM is active this will run in loop. Expected results: we should release Spm Additional info: logs Thread-11244::INFO::2012-09-03 14:49:04,482::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='f570527f-004a-4cab-8bee-129fa589bec5', hostID=2, scsiKey='f570527f-004a-4cab-8bee-129fa589bec5', remove=False, options=None) Thread-11244::ERROR::2012-09-03 14:49:04,483::task::853::TaskManager.Task::(_setError) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 889, in disconnectStoragePool self.validateNotSPM(spUUID) File "/usr/share/vdsm/storage/hsm.py", line 253, in validateNotSPM raise se.IsSpm(spUUID) IsSpm: Operation not allowed while SPM is active: ('f570527f-004a-4cab-8bee-129fa589bec5',) Thread-11244::DEBUG::2012-09-03 14:49:04,484::task::872::TaskManager.Task::(_run) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::Task._run: 8b3620a4-bf12-46e5-8f00-80797acb3fa3 ('f570527f-004a-4cab-8bee-129fa589bec5', 2, 'f570527f-004a-4cab-8bee-129fa589bec5', False) {} failed - stopping task Thread-11244::DEBUG::2012-09-03 14:49:04,485::task::1199::TaskManager.Task::(stop) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::stopping in state preparing (force False) Thread-11244::DEBUG::2012-09-03 14:49:04,486::task::978::TaskManager.Task::(_decref) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::ref 1 aborting True Thread-11244::INFO::2012-09-03 14:49:04,486::task::1157::TaskManager.Task::(prepare) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::aborting: Task is aborted: 'Operation not allowed while SPM is active' - code 656 Thread-11244::DEBUG::2012-09-03 14:49:04,487::task::1162::TaskManager.Task::(prepare) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::Prepare: aborted: Operation not allowed while SPM is active vdsClient output on host selected as spm: