1066693 – Every thirty minutes OnVdsDuringFailureTimer is shown in engine log

Bug 1066693 - Every thirty minutes OnVdsDuringFailureTimer is shown in engine log

Summary: Every thirty minutes OnVdsDuringFailureTimer is shown in engine log

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Mooli Tayer
QA Contact:	Tareq Alayan
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-02-18 22:35 UTC by Dave Sullivan
Modified:	2019-10-10 09:14 UTC (History)
CC List:	14 users (show)
Fixed In Version:	org.ovirt.engine-root-3.4.0-14
Doc Type:	Bug Fix
Doc Text:	Previously, OnVdsDuringFailureTimer appeared in the engine log every 30 minutes after a host failed to run virtual machines. The error continued to appear even if the virtual machines were able to be run on another host, and even if no further problem was detected in the host. Now, the OnVdsDuringFailureTimer message appears only for three failed attempts, after which the number of attempts is reset to zero.
Clone Of:
Environment:
Last Closed:	2014-06-09 15:04:30 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	764093	None	None	None	Never
Red Hat Product Errata	RHSA-2014:0506	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Virtualization Manager 3.4.0 update	2014-06-09 18:55:38 UTC
oVirt gerrit	26138	None	None	None	Never
oVirt gerrit	26139	None	None	None	Never
oVirt gerrit	26140	None	None	None	Never
oVirt gerrit	26141	None	None	None	Never
oVirt gerrit	26459	None	None	None	Never
oVirt gerrit	26460	None	None	None	Never
oVirt gerrit	26461	None	None	None	Never
oVirt gerrit	26462	None	None	None	Never

Description Dave Sullivan 2014-02-18 22:35:38 UTC

Description of problem:

See attached engine log

[dsulliva@localhost 01004615]$ grep -R OnVdsDuringFailureTimer lb0156-2014021214561392238613/
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 16:44:46,565 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-78) [39d3c9] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -699
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 17:14:46,595 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-38) [6accfc45] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -700
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 17:44:46,641 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-19) OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -701
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 18:14:46,683 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-58) OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -702
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 18:44:46,712 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-59) [60beae2c] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -703
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 19:14:46,740 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-67) OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -704
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 19:44:46,770 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-31) [291564a2] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -705
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 20:14:46,809 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-93) [4c7bcc69] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -706


Version-Release number of selected component (if applicable):

3.3.0 GA


Actual results:

message shown in engine.log OnVdsDuringFailureTimer of vds vzlb1014 entered

Expected results:

do not want to see this message assuming there are no ramifications of the message

Additional info:

https://bugzilla.redhat.com/show_bug.cgi?id=1047567#c5

Comment 3 Barak 2014-03-25 11:47:12 UTC

Eli,

- Please elaborate on which flows this message appears ?
- What is the role of this Timer/Quartz ?

There are 2 separate issues to be handled:
- Why do we need this log? what is it's purpose if we can't do anything about it ?
- why are the negative values

Comment 4 Mooli Tayer 2014-03-25 12:49:56 UTC

Barak,

1.) After failing to run a vm (3 times default) the host's status is set to error and a scheduled thread starts moving it to up (every 30 min default)

2.) This log line is important for understanding issues like this one.
(the message was improved by Eli in bug 1047567)
we can think about logging debug and not info but then we would probably not get them in a bug like this one. If not for this bug the message would appear much less (3 times )so it would probably not bother users.

3.) Negative values: I believe this is caused by a missus in using quartz pause and resume and am attempting to reproduce now. If that is true the fix should be simple.

Comment 5 Barak 2014-03-25 18:51:45 UTC

(In reply to Mooli Tayer from comment #4)
> Barak,
> 
> 1.) After failing to run a vm (3 times default) the host's status is set to
> error and a scheduled thread starts moving it to up (every 30 min default)
> 
> 2.) This log line is important for understanding issues like this one.
> (the message was improved by Eli in bug 1047567)
> we can think about logging debug and not info but then we would probably not
> get them in a bug like this one. If not for this bug the message would
> appear much less (3 times )so it would probably not bother users.
> 


But what is it's purpose? to show the timer goes into action ?
Is it necessary ?
Shouldn't we log (info level) only when it acts on host status change ? and all the rest should be debug ?


> 3.) Negative values: I believe this is caused by a missus in using quartz
> pause and resume and am attempting to reproduce now. If that is true the fix
> should be simple.

Comment 6 Mooli Tayer 2014-03-26 11:56:37 UTC

(In reply to Barak from comment #5)
> (In reply to Mooli Tayer from comment #4)
> > Barak,
> > 
> > 1.) After failing to run a vm (3 times default) the host's status is set to
> > error and a scheduled thread starts moving it to up (every 30 min default)
> > 
> > 2.) This log line is important for understanding issues like this one.
> > (the message was improved by Eli in bug 1047567)
> > we can think about logging debug and not info but then we would probably not
> > get them in a bug like this one. If not for this bug the message would
> > appear much less (3 times )so it would probably not bother users.
> > 
> 
> 
> But what is it's purpose? to show the timer goes into action ?
> Is it necessary ?
> Shouldn't we log (info level) only when it acts on host status change ? and
> all the rest should be debug ?

Maybe if we change the message to something more informative then users can benefit. "Attempting to recover erroneous host $hostName" this message
tells the users that a background thread set it's status from ERROR
(status for hosts with failling vms) to UP. this message should not 
appear as much as described in this bug after the bug is resolved.

> 
> 
> > 3.) Negative values: I believe this is caused by a missus in using quartz
> > pause and resume and am attempting to reproduce now. If that is true the fix
> > should be simple.

Comment 7 Barak 2014-04-02 09:27:55 UTC

Molli please ad description for QE:
- previous behavior
- generally what was fixed and how
- how to test it (including how to get a host to be in ERROR status)

Comment 8 Mooli Tayer 2014-04-02 10:49:54 UTC

A host receives ERROR status if it failed to run vms NumberOfFailedRunsOnVds times (default 3) AND those vms WERE ABLE to run on a different host.
What I did to prevent the host from running vms was I added a line throwing
an exception in /usr/share/vdsm/clientIf.py:createVm()

At that point you should see in the log:
"Vds {0} moved to Error mode after {1} attempts. Time: {2}"

Now the new behavior is simple: after TimeToReduceFailedRunOnVdsInMinutes
(default 30) IF THE HOST STATUS IS STILL ERROR it will be set to UP and number of failed attempts is zeroed. Log should have:
"Settings host {0} to up after {1} failed attempts to run a VM"
at that point we return to beginning state.

As for the old behavior it is somewhat more complicated to explain,
and is described in the message of this commit:
http://gerrit.ovirt.org/#/c/26139/

Comment 9 Tareq Alayan 2014-05-13 11:45:07 UTC

couldn't reproduce rhevm-3.4.0-0.18.el6ev.noarch

Comment 10 errata-xmlrpc 2014-06-09 15:04:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0506.html

Note You need to log in before you can comment on or make changes to this bug.