Bug 1145099

Summary: Engine never completes task VdsNotRespondingTreatmentCommand (Handling non responsive Host <hostName>) in case of SPM host reboot
Product: Red Hat Enterprise Virtualization Manager Reporter: Gilad Lazarovich <glazarov>
Component: ovirt-engineAssignee: Ori Liel <oliel>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: medium Docs Contact:
Priority: low    
Version: 3.5.0CC: acanan, gklein, lsurette, michal.skrivanek, oourfali, pmatyas, pstehlik, rbalakri, Rhev-m-bugs, srevivo, ykaul
Target Milestone: ovirt-3.6.0-rc   
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 3.6.0-9 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-20 01:11:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1257610    
Bug Blocks:    
Attachments:
Description Flags
Engine and VDSM logs
none
Tasks not completing none

Description Gilad Lazarovich 2014-09-22 11:37:27 UTC
Created attachment 939974 [details]
Engine and VDSM logs

Description of problem:
Handling non responsive Host task doesn't complete when SPM host network goes down or system is rebooted

Version-Release number of selected component (if applicable):
3.5 vt3.1

How reproducible:
100%

Steps to Reproduce:
1. On a Data Center with one or more hosts and at least one Storage domain defined, reboot the SPM host or bring down its rhevm network
2. Check the Tasks pane for tasks related to handling when hosts are unresponsive

Actual results:
The Handling non responsive Host <hostName> task never completes

Expected results:
The task should complete (in this case it sounds like with a failure)

Additional info:
The DB shows these tasks never complete:
engine=# SELECT action_type,description, status,start_time,end_time from job;
        action_type        |                            description                             | status  |         start_time         |          end_time          
---------------------------+--------------------------------------------------------------------+---------+----------------------------+----------------------------
 VdsNotRespondingTreatment | Handling non responsive Host gold-vdsd.qa.lab.tlv.redhat.com       | STARTED | 2014-09-22 11:02:30.889+03 | 
 VdsNotRespondingTreatment | Handling non responsive Host gold-vdsd.qa.lab.tlv.redhat.com       | STARTED | 2014-09-22 10:24:57.99+03  | 
 VdsNotRespondingTreatment | Handling non responsive Host gold-vdsd.qa.lab.tlv.redhat.com       | FAILED  | 2014-09-22 10:43:27.282+03 | 2014-09-22 10:43:27.317+03
 SshSoftFencing            | Executing SSH Soft Fencing on host gold-vdsc.qa.lab.tlv.redhat.com | FAILED  | 2014-09-22 10:51:22.713+03 | 2014-09-22 10:52:25.807+03
 SshSoftFencing            | Executing SSH Soft Fencing on host gold-vdsd.qa.lab.tlv.redhat.com | FAILED  | 2014-09-22 11:01:27.786+03 | 2014-09-22 11:02:30.877+03
 VdsNotRespondingTreatment | Handling non responsive Host gold-vdsd.qa.lab.tlv.redhat.com       | STARTED | 2014-09-22 10:06:44.817+03 | 
 SshSoftFencing            | Executing SSH Soft Fencing on host gold-vdsd.qa.lab.tlv.redhat.com | FAILED  | 2014-09-22 10:42:24.157+03 | 2014-09-22 10:43:27.269+03
 VdsNotRespondingTreatment | Handling non responsive Host gold-vdsc.qa.lab.tlv.redhat.com       | STARTED | 2014-09-22 10:52:25.82+03  | 
(8 rows)

Comment 1 Gilad Lazarovich 2014-09-22 11:38:22 UTC
Created attachment 939975 [details]
Tasks not completing

Comment 2 Michal Skrivanek 2015-06-02 09:39:38 UTC
host flow is infra

Comment 3 Ori Liel 2015-07-30 08:06:25 UTC
Patch posted: 

  https://gerrit.ovirt.org/#/c/44136/3

The problem happens when VdsNotRespondingTreatment command invokes SetSpmStatus command. There's a mixup with the execution-context, which results in SetSpmStatus being marked as completed twice, and VdsNotRespondingTreatment never being marked as completed. 

This was fixed locally, but the problem probably happens for all monitored commands which are invoked by another command. A general fix is required, but a lot of verification is required for that, and that will be done in the future. The fix is to make CommandContext.clone() clone the ExecutionContext too.

Comment 5 Petr Matyáš 2016-01-21 12:56:24 UTC
Verified on 3.6.2-10