1218548 – VdsNotRespondingTreatment Job remains in status STARTED even after Manual fencing the host

Bug 1218548 - VdsNotRespondingTreatment Job remains in status STARTED even after Manual fencing the host

Summary: VdsNotRespondingTreatment Job remains in status STARTED even after Manual fen...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Eli Mesika
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1203143 1228992 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-05-05 08:57 UTC by sefi litmanovich
Modified:	2016-04-20 01:10 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-04-20 01:10:12 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine log (982.18 KB, application/x-gzip) 2015-05-05 08:57 UTC, sefi litmanovich	no flags	Details
screenshot (106.47 KB, image/png) 2015-07-01 06:09 UTC, Michael Burman	no flags	Details
engine logs (419.02 KB, application/x-gzip) 2015-07-14 06:32 UTC, Michael Burman	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	42606	0	master	MERGED	clone context of internal cmd in non-responding	2020-04-22 17:48:54 UTC

Description sefi litmanovich 2015-05-05 08:57:14 UTC

Created attachment 1022112 [details]
engine log

Description of problem:

My env had a single host in a cluster with nfs SD. host was SPM and connected to stroage.
The host is running rhel 7 with vdsm-4.16.13-1.el7ev and has not PM agent.
At some point host lost connectivity and became non responsive and fence flow was initiated.
after SshSoftFencing failed (due to host's non-connectivity) 'VdsNotRespondingTreatment' is initiated with Message:
Handling non responsive Host {hostname}.

At this point host remained non connective and after several hours I chose to 'confirm manual reboot' to release the host from SPM role and then put it in maintenance.

But the 'VdsNotRespondingTreatment' is still showing as STARTED in DB and the message still persists in task list in the ui.
After engine restart job's status become UNKNOWN.

Version-Release number of selected component (if applicable):

rhevm-3.5.1-0.4.el6ev.noarch


Actual results:

Job gets stuck on STARTED status and message remains in task list

Expected results:

Either job should change to status FAILED after fencing fails or Finished after manual reboot on the host is confirmed.

Comment 1 Eli Mesika 2015-06-17 08:04:54 UTC

*** Bug 1203143 has been marked as a duplicate of this bug. ***

Comment 2 Eli Mesika 2015-06-21 11:12:37 UTC

*** Bug 1228992 has been marked as a duplicate of this bug. ***

Comment 3 Max Kovgan 2015-06-28 14:13:33 UTC

ovirt-3.6.0-3 release

Comment 4 Michael Burman 2015-07-01 06:07:27 UTC

Please note, i still see this issue in new 3.6.0-3 eninge--> 3.6.0-0.0.master.20150627185750.git6f063c1.el6

Tasks are remain in adding status for example from yesterday. 
Attaching screen shot.

Comment 5 Michael Burman 2015-07-01 06:09:00 UTC

Created attachment 1044902 [details]
screenshot

Comment 6 Oved Ourfali 2015-07-01 06:43:19 UTC

Liran - is this related to the job/step issue?
Eli - can you verify it works with the latest patches?

Comment 7 Eli Mesika 2015-07-01 09:11:09 UTC

(In reply to Oved Ourfali from comment #6)
> Liran - is this related to the job/step issue?
> Eli - can you verify it works with the latest patches?

Rebased on master and tested again, works fine

Comment 8 sefi litmanovich 2015-07-13 16:03:07 UTC

Verified with ovirt-engine-3.6.0-0.0.master.20150627185750.git6f063c1.el6.noarch.

steps:

1. Have a host with no PM configured.
2. block connection between host and engine.

result: host goes to connecting state and after that to non-responsive state.
in DB: job VdsNotRespondingTreatment STARTED -> after several attempts and failures to connect to host VdsNotRespondingTreatment FAILED - as expected.

3. confirm host has been rebooted.
4. put host to maintenance.
5. restore connectivity between host and engine.
6. activate host.

result: host is up.

Michael - Can you try to reproduce the same flow? it seems we got different results in the same version, maybe you can specify your steps?

Comment 9 Michael Burman 2015-07-14 05:12:29 UTC

Hi Sefi, Eli, Oved

Not sure about steps, but i have tasks stuck in the tasks log UI 
For example Adding new host from July 09..to 3.6.0-0.0.master.20150627185750.git6f063c1.el6

	
2015-Jul-09, 11:07 Adding new Host puma22.scl.lab.tlv.redhat.com to Cluster mburman_1

the task just stay there and looks like it still trying to resolve. 
when actually puma22 server is installed with success.

Feel free to contact me if you would like to enter my setup.

Comment 10 Oved Ourfali 2015-07-14 05:34:51 UTC

Liran - please take a look and make sure this is covered with recent master and your recent additions.

Comment 11 Liran Zelkha 2015-07-14 05:50:51 UTC

Michael - can you send server logs (server.log and engine.log)? 
I'm adding hosts and it works fine.

Comment 12 Michael Burman 2015-07-14 06:32:23 UTC

Created attachment 1051628 [details]
engine logs

Note You need to log in before you can comment on or make changes to this bug.