Bug 1218548 - VdsNotRespondingTreatment Job remains in status STARTED even after Manual fencing the host
Summary: VdsNotRespondingTreatment Job remains in status STARTED even after Manual fen...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Eli Mesika
QA Contact: sefi litmanovich
URL:
Whiteboard:
: 1203143 1228992 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-05-05 08:57 UTC by sefi litmanovich
Modified: 2016-04-20 01:10 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-04-20 01:10:12 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine log (982.18 KB, application/x-gzip)
2015-05-05 08:57 UTC, sefi litmanovich
no flags Details
screenshot (106.47 KB, image/png)
2015-07-01 06:09 UTC, Michael Burman
no flags Details
engine logs (419.02 KB, application/x-gzip)
2015-07-14 06:32 UTC, Michael Burman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 42606 0 master MERGED clone context of internal cmd in non-responding 2020-04-22 17:48:54 UTC

Description sefi litmanovich 2015-05-05 08:57:14 UTC
Created attachment 1022112 [details]
engine log

Description of problem:

My env had a single host in a cluster with nfs SD. host was SPM and connected to stroage.
The host is running rhel 7 with vdsm-4.16.13-1.el7ev and has not PM agent.
At some point host lost connectivity and became non responsive and fence flow was initiated.
after SshSoftFencing failed (due to host's non-connectivity) 'VdsNotRespondingTreatment' is initiated with Message:
Handling non responsive Host {hostname}.

At this point host remained non connective and after several hours I chose to 'confirm manual reboot' to release the host from SPM role and then put it in maintenance.

But the 'VdsNotRespondingTreatment' is still showing as STARTED in DB and the message still persists in task list in the ui.
After engine restart job's status become UNKNOWN.

Version-Release number of selected component (if applicable):

rhevm-3.5.1-0.4.el6ev.noarch


Actual results:

Job gets stuck on STARTED status and message remains in task list

Expected results:

Either job should change to status FAILED after fencing fails or Finished after manual reboot on the host is confirmed.

Comment 1 Eli Mesika 2015-06-17 08:04:54 UTC
*** Bug 1203143 has been marked as a duplicate of this bug. ***

Comment 2 Eli Mesika 2015-06-21 11:12:37 UTC
*** Bug 1228992 has been marked as a duplicate of this bug. ***

Comment 3 Max Kovgan 2015-06-28 14:13:33 UTC
ovirt-3.6.0-3 release

Comment 4 Michael Burman 2015-07-01 06:07:27 UTC
Please note, i still see this issue in new 3.6.0-3 eninge--> 3.6.0-0.0.master.20150627185750.git6f063c1.el6

Tasks are remain in adding status for example from yesterday. 
Attaching screen shot.

Comment 5 Michael Burman 2015-07-01 06:09:00 UTC
Created attachment 1044902 [details]
screenshot

Comment 6 Oved Ourfali 2015-07-01 06:43:19 UTC
Liran - is this related to the job/step issue?
Eli - can you verify it works with the latest patches?

Comment 7 Eli Mesika 2015-07-01 09:11:09 UTC
(In reply to Oved Ourfali from comment #6)
> Liran - is this related to the job/step issue?
> Eli - can you verify it works with the latest patches?

Rebased on master and tested again, works fine

Comment 8 sefi litmanovich 2015-07-13 16:03:07 UTC
Verified with ovirt-engine-3.6.0-0.0.master.20150627185750.git6f063c1.el6.noarch.

steps:

1. Have a host with no PM configured.
2. block connection between host and engine.

result: host goes to connecting state and after that to non-responsive state.
in DB: job VdsNotRespondingTreatment STARTED -> after several attempts and failures to connect to host VdsNotRespondingTreatment FAILED - as expected.

3. confirm host has been rebooted.
4. put host to maintenance.
5. restore connectivity between host and engine.
6. activate host.

result: host is up.

Michael - Can you try to reproduce the same flow? it seems we got different results in the same version, maybe you can specify your steps?

Comment 9 Michael Burman 2015-07-14 05:12:29 UTC
Hi Sefi, Eli, Oved

Not sure about steps, but i have tasks stuck in the tasks log UI 
For example Adding new host from July 09..to 3.6.0-0.0.master.20150627185750.git6f063c1.el6

	
2015-Jul-09, 11:07 Adding new Host puma22.scl.lab.tlv.redhat.com to Cluster mburman_1

the task just stay there and looks like it still trying to resolve. 
when actually puma22 server is installed with success.

Feel free to contact me if you would like to enter my setup.

Comment 10 Oved Ourfali 2015-07-14 05:34:51 UTC
Liran - please take a look and make sure this is covered with recent master and your recent additions.

Comment 11 Liran Zelkha 2015-07-14 05:50:51 UTC
Michael - can you send server logs (server.log and engine.log)? 
I'm adding hosts and it works fine.

Comment 12 Michael Burman 2015-07-14 06:32:23 UTC
Created attachment 1051628 [details]
engine logs


Note You need to log in before you can comment on or make changes to this bug.