Bug 848431
| Summary: | ovirt-engine-backend: there are tasks on spm will cause SpmStop to fail in loop (logical deadlock) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dafna Ron <dron> | ||||
| Component: | ovirt-engine | Assignee: | mkublin <mkublin> | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | Dafna Ron <dron> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | unspecified | CC: | bazulay, dyasny, hateya, iheim, lpeer, Rhev-m-bugs, sgrinber, yeylon, ykaul, yzaslavs | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | infra | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-02-28 10:59:52 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
I did a mistake during first analyse of the problem.
The described problem is not related to fence (possible that it is also can happened because of fence, but these is not a case now).
Problem is a following:
1. We failed at SPM status , these process
2012-08-15 16:20:54,461 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-40) Command SpmStatusVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
Such error will reset irs proxy, see at IrsBrokerCommand.ProceedStoragePoolStats.
AsyncTaskManager tried to clean some tasks at spm and failed, the status of tasks become ClearedFailed.
After some time a table of task was cleaned.
2012-08-15 16:25:56,215 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-52) Setting new tasks map. The map contains now 0 tasks
2012-08-15 16:25:56,215 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-52) Cleared all tasks of pool 66b35d07-24eb-4bdc-952d-1cf7144f71ab.
2012-08-15 16:25:56,217 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-52) Could not find vds that is spm and non-operational.
Task were left at SPM, even they finished, at that case SPMStop will never successes. Look at code at SpmStopVDSCommand.
By the way also interesting error occurred and can be seen at log:
2012-08-15 16:25:56,032 ERROR [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-81) Getting existing tasks on Storage Pool iSCSI failed.: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Cannot allocate IRS server
It means we got storage pool up event, but we did not success to add any task,
so if there are some task at some cases we will have a same loop behaviour.
Dafna, How often does this happen? We find the scenario a bit complicated to reproduce. Considering rhevm-future. I have encountered this bug only in this scenario. should be fixed. I solved some possible races, during clean up of IrsBrokerCommand. There are were couple of patches, one of them is http://gerrit.ovirt.org/#/c/9116/ - fixed connection leak and nullify all proxies together. It looks like this can't be reproduced. And since there were many patches that relate to the above issue were accepted and changed the flow described above (e.g. comment #10). changing status to CLOSE WORKSFORME. |
Created attachment 604626 [details] logs Description of problem: during VdsNotRespondingTreatmentCommand we send a fence command. if the fence fails we are resetting the irsproxy. if the irsproxy was reset, the async task manager will not be able to clear tasks on spm. since there are running tasks on spm (in my case finished) SpmStop will not be sent to Spm. as a result I have a host which is NonOperational but still Spm and we keep failing SpmStop command in loop. - even putting host in maintenance will not clear the task - only reboot or manual stopTask/clearTask will end the loop. Version-Release number of selected component (if applicable): si13.3 How reproducible: Steps to Reproduce: create the following setup: iscsi storage with two hosts - do not configure power management or make sure it will not work. have several domains from two storage servers + extend master domain with luns from both servers. attach iso domain make sure one of the storage servers works with multipath. create and run several vm's create a preallocated 20GB template before you start the test create a new server (not desktop) from the template. 1. block connectivity to the storage domain that works with multipath 2. give it 15 minutes (storage will get latency errors and go up and down but will not release the spm) 3. using iscsiadm disconnect the blocked storage Actual results: we would try to fence the host and fail -> as a result the irsproxy will be reset. the asynctask manager will fail to clear the tasks because the irsproxy was reset SpmStop will fail because there are running tasks on the host and will run in endless loop. Expected results: we should be able to clean the tasks from spm so that we can elect a new spm Additional info: full engine and vdsm logs loop: 2012-08-15 17:42:01,241 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-29) FINISH, SpmStopVDSCommand, log id: 2f998a98 2012-08-15 17:42:01,241 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-29) spm stop on spm failed, stopping spm selection! 2012-08-15 17:42:11,267 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-98) hostFromVds::selectedVds - gold-vdsc, spmStatus Free, storage pool iSCSI 2012-08-15 17:42:11,294 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-98) SpmStatus on vds 9666ae2a-e61f-11e1-97a6-001a4a169741: SPM 2012-08-15 17:42:11,294 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-98) Host reports to be SPM but is not up. 9666ae2a-e61f-11e1-97a6-001a4a169741 2012-08-15 17:42:11,296 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-98) SPM selection - vds seems as spm gold-vdsd 2012-08-15 17:42:11,297 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-98) START, SpmStopVDSCommand(vdsId = 9666ae2a-e61f-11e1-97a6-001a4a169741, storagePoolId = 66b35d07-24eb-4bdc-952d-1cf7144f71ab), log id: 5915a019