Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 848431

Summary: ovirt-engine-backend: there are tasks on spm will cause SpmStop to fail in loop (logical deadlock)
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: mkublin <mkublin>
Status: CLOSED WORKSFORME QA Contact: Dafna Ron <dron>
Severity: high Docs Contact:
Priority: medium    
Version: unspecifiedCC: bazulay, dyasny, hateya, iheim, lpeer, Rhev-m-bugs, sgrinber, yeylon, ykaul, yzaslavs
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-28 10:59:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2012-08-15 15:05:14 UTC
Created attachment 604626 [details]
logs

Description of problem:

during VdsNotRespondingTreatmentCommand we send a fence command. 
if the fence fails we are resetting the irsproxy. 
if the irsproxy was reset, the async task manager will not be able to clear tasks on spm. 
since there are running tasks on spm (in my case finished) SpmStop will not be sent to Spm. 
as a result I have a host which is NonOperational but still Spm and we keep failing SpmStop command in loop. 

- even putting host in maintenance will not clear the task - only reboot or manual stopTask/clearTask will end the loop. 

Version-Release number of selected component (if applicable):

si13.3

How reproducible:


Steps to Reproduce:

create the following setup: 

iscsi storage with two hosts - do not configure power management or make sure it will not work. 
have several domains from two storage servers + extend master domain with luns from both servers. 
attach iso domain
make sure one of the storage servers works with multipath. 
create and run several vm's
create a preallocated 20GB template
before you start the test create a new server (not desktop) from the template.
 
1. block connectivity to the storage domain that works with multipath
2. give it 15 minutes (storage will get latency errors and go up and down but will not release the spm)
3. using iscsiadm disconnect the blocked storage

  
Actual results:

we would try to fence the host and fail  -> as a result the irsproxy will be reset. 
the asynctask manager will fail to clear the tasks because the irsproxy was reset
SpmStop will fail because there are running tasks on the host and will run in endless loop. 

Expected results:

we should be able to clean the tasks from spm so that we can elect a new spm

Additional info: full engine and vdsm logs


loop: 

2012-08-15 17:42:01,241 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-29) FINISH, SpmStopVDSCommand, log id: 2f998a98
2012-08-15 17:42:01,241 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-29) spm stop on spm failed, stopping spm selection!
2012-08-15 17:42:11,267 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-98) hostFromVds::selectedVds - gold-vdsc, spmStatus Free, storage pool iSCSI
2012-08-15 17:42:11,294 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-98) SpmStatus on vds 9666ae2a-e61f-11e1-97a6-001a4a169741: SPM
2012-08-15 17:42:11,294 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-98) Host reports to be SPM but is not up. 9666ae2a-e61f-11e1-97a6-001a4a169741
2012-08-15 17:42:11,296 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-98) SPM selection - vds seems as spm gold-vdsd
2012-08-15 17:42:11,297 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-98) START, SpmStopVDSCommand(vdsId = 9666ae2a-e61f-11e1-97a6-001a4a169741, storagePoolId = 66b35d07-24eb-4bdc-952d-1cf7144f71ab), log id: 5915a019

Comment 6 mkublin 2012-08-26 13:15:45 UTC
I did a mistake during first analyse of the problem.
The described problem is not related to fence (possible that it is also can happened because of fence, but these is not a case now).
Problem is a following:
1. We failed at SPM status , these process 
2012-08-15 16:20:54,461 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-40) Command SpmStatusVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)

Such error will reset irs proxy, see at IrsBrokerCommand.ProceedStoragePoolStats.
AsyncTaskManager tried to clean some tasks at spm and failed, the status of tasks become ClearedFailed.
After some time a table of task was cleaned.
2012-08-15 16:25:56,215 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-52) Setting new tasks map. The map contains now 0 tasks
2012-08-15 16:25:56,215 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-52) Cleared all tasks of pool 66b35d07-24eb-4bdc-952d-1cf7144f71ab.
2012-08-15 16:25:56,217 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-52) Could not find vds that is spm and non-operational.

Task were left at SPM, even they finished, at that case SPMStop will never successes. Look at code at SpmStopVDSCommand.

By the way also interesting error occurred and can be seen at log:
2012-08-15 16:25:56,032 ERROR [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-81) Getting existing tasks on Storage Pool iSCSI failed.: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Cannot allocate IRS server

It means we got storage pool up event, but we did not success to add any task,
so if there are some task at some cases we will have a same loop behaviour.

Comment 7 Barak 2012-09-24 08:48:08 UTC
Dafna,

How often does this happen?

We find the scenario a bit complicated to reproduce.

Considering rhevm-future.

Comment 8 Dafna Ron 2012-09-24 13:21:17 UTC
I have encountered this bug only in this scenario.

Comment 10 mkublin 2012-12-04 15:21:04 UTC
should be fixed. I solved some possible races, during clean up of IrsBrokerCommand. There are were couple of patches, one of them is http://gerrit.ovirt.org/#/c/9116/ - fixed connection leak and nullify all proxies together.

Comment 11 Barak 2013-02-28 10:59:52 UTC
It looks like this can't be reproduced.
And since there were many patches that relate to the above issue were accepted and changed the flow described above (e.g. comment #10).

changing status to CLOSE WORKSFORME.