Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 853984

Summary: [engine-core] engine gets confused and tries to send disconnect storage pool while host is SPM (Operation not allowed while SPM is active while host is SPM)
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Barak <bazulay>
Status: CLOSED WONTFIX QA Contact: Dafna Ron <dron>
Severity: high Docs Contact:
Priority: high    
Version: 3.1.0CC: abaron, acathrow, amureini, bazulay, hateya, iheim, jkt, laravot, lpeer, mkenneth, Rhev-m-bugs, yeylon
Target Milestone: ---   
Target Release: 3.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-12-29 10:50:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2012-09-03 13:43:36 UTC
Created attachment 609391 [details]
logs

Description of problem:

In a setup where the domains are made out of luns from different storage's (extended) I removed my host's from one of the STorage's access list. 
backend tried sending DisconnectStoragePool to vdsm and got a reply from vdsm that the host is not SPM. 
looking with vdsClient I can see that the host is SPM. 
as a result, the host remains in up state and backend keeps marking it as SPM/None each in loop. 

Version-Release number of selected component (if applicable):

si16
vdsm-4.9.6-31.0.el6_3.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create domains which have luns from different storage servers
2. remove the hosts from one of the storage's access list's
3.
  
Actual results:

backend is sending DisconnectStoragePool to the vds marked as SPM and gets an error:
Operation not allowed while SPM is active

this will run in loop. 

Expected results:

we should release Spm

Additional info: logs


Thread-11244::INFO::2012-09-03 14:49:04,482::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='f570527f-004a-4cab-8bee-129fa589bec5', hostID=2, scsiKey='f570527f-004a-4cab-8bee-129fa589bec5', remove=False, options=None)
Thread-11244::ERROR::2012-09-03 14:49:04,483::task::853::TaskManager.Task::(_setError) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 889, in disconnectStoragePool
    self.validateNotSPM(spUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 253, in validateNotSPM
    raise se.IsSpm(spUUID)
IsSpm: Operation not allowed while SPM is active: ('f570527f-004a-4cab-8bee-129fa589bec5',)
Thread-11244::DEBUG::2012-09-03 14:49:04,484::task::872::TaskManager.Task::(_run) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::Task._run: 8b3620a4-bf12-46e5-8f00-80797acb3fa3 ('f570527f-004a-4cab-8bee-129fa589bec5', 2, 'f570527f-004a-4cab-8bee-129fa589bec5', False) {} failed - stopping task
Thread-11244::DEBUG::2012-09-03 14:49:04,485::task::1199::TaskManager.Task::(stop) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::stopping in state preparing (force False)
Thread-11244::DEBUG::2012-09-03 14:49:04,486::task::978::TaskManager.Task::(_decref) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::ref 1 aborting True
Thread-11244::INFO::2012-09-03 14:49:04,486::task::1157::TaskManager.Task::(prepare) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::aborting: Task is aborted: 'Operation not allowed while SPM is active' - code 656
Thread-11244::DEBUG::2012-09-03 14:49:04,487::task::1162::TaskManager.Task::(prepare) Task=`8b3620a4-bf12-46e5-8f00-80797acb3fa3`::Prepare: aborted: Operation not allowed while SPM is active


vdsClient output on host selected as spm:

Comment 1 Dafna Ron 2012-09-03 13:44:52 UTC
host selected as Spm:

[root@gold-vdsd tmp]# vdsClient -s 0 getSpmStatus f570527f-004a-4cab-8bee-129fa589bec5
	spmId = 2
	spmStatus = SPM
	spmLver = 9

second Host:

[root@localhost tmp]# vdsClient -s 0 getSpmStatus f570527f-004a-4cab-8bee-129fa589bec5
	spmId = 2
	spmStatus = Free
	spmLver = 9

Comment 2 Dafna Ron 2012-09-03 15:00:37 UTC
moving to manager since the DisconnectStoragePool should not be sent by manager.

Comment 3 Liron Aravot 2012-10-21 15:19:06 UTC
http://gerrit.ovirt.org/#/c/8691/

Comment 4 Liron Aravot 2012-10-23 11:31:15 UTC
the bug is a async task manager bug, the finished tasks on spm aren't cleared altough they should. (seems like the async task manager doesn't perform polling on those tasks for some reason).

Comment 9 Barak 2012-11-18 15:47:32 UTC
I don't understand how come the VDSM did not self-fenced.

Comment 10 Barak 2012-11-28 10:48:48 UTC
Per the description above (comment #7) it looks like the VDSM did not detect that it doesn't see the entire pool.
In this case it looks like VDSM should restart itself.

Comment 12 Ayal Baron 2012-12-17 07:06:20 UTC
It did not fence itself because it had access to the disk and was able to update its lease.  The disk that was missing did not contain the lease hence no problem of keeping the cluster lock.
In this situation engine can decide whether it would like to fence the node or better yet, just stop the spm and start it on a different node.

Comment 14 Simon Grinberg 2012-12-23 14:00:01 UTC
Dafna, just to make sure here. 

Does the engine detects that the storage is faulty? Is this the reason for the disconnect attempt?

Comment 15 Dafna Ron 2012-12-23 14:31:51 UTC
(In reply to comment #14)
> Dafna, just to make sure here. 
> 
> Does the engine detects that the storage is faulty? Is this the reason for
> the disconnect attempt?

yes.

Comment 16 Itamar Heim 2013-12-29 10:50:38 UTC
Closing old bugs. If this issue is still relevant/important in current version, please re-open the bug.