Bug 1066693

Summary:	Every thirty minutes OnVdsDuringFailureTimer is shown in engine log
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Dave Sullivan <dsulliva>
Component:	ovirt-engine	Assignee:	Mooli Tayer <mtayer>
Status:	CLOSED ERRATA	QA Contact:	Tareq Alayan <talayan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.3.0	CC:	aberezin, acathrow, bazulay, emesika, iheim, jentrena, lbopf, lpeer, mkalinin, mtayer, oourfali, pstehlik, Rhev-m-bugs, yeylon
Target Milestone:	---
Target Release:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	infra
Fixed In Version:	org.ovirt.engine-root-3.4.0-14	Doc Type:	Bug Fix
Doc Text:	Previously, OnVdsDuringFailureTimer appeared in the engine log every 30 minutes after a host failed to run virtual machines. The error continued to appear even if the virtual machines were able to be run on another host, and even if no further problem was detected in the host. Now, the OnVdsDuringFailureTimer message appears only for three failed attempts, after which the number of attempts is reset to zero.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-06-09 15:04:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dave Sullivan 2014-02-18 22:35:38 UTC

Description of problem:

See attached engine log

[dsulliva@localhost 01004615]$ grep -R OnVdsDuringFailureTimer lb0156-2014021214561392238613/
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 16:44:46,565 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-78) [39d3c9] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -699
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 17:14:46,595 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-38) [6accfc45] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -700
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 17:44:46,641 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-19) OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -701
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 18:14:46,683 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-58) OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -702
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 18:44:46,712 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-59) [60beae2c] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -703
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 19:14:46,740 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-67) OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -704
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 19:44:46,770 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-31) [291564a2] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -705
lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 20:14:46,809 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-93) [4c7bcc69] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -706


Version-Release number of selected component (if applicable):

3.3.0 GA


Actual results:

message shown in engine.log OnVdsDuringFailureTimer of vds vzlb1014 entered

Expected results:

do not want to see this message assuming there are no ramifications of the message

Additional info:

https://bugzilla.redhat.com/show_bug.cgi?id=1047567#c5

Comment 3 Barak 2014-03-25 11:47:12 UTC

Eli,

- Please elaborate on which flows this message appears ?
- What is the role of this Timer/Quartz ?

There are 2 separate issues to be handled:
- Why do we need this log? what is it's purpose if we can't do anything about it ?
- why are the negative values

Comment 4 Mooli Tayer 2014-03-25 12:49:56 UTC

Barak,

1.) After failing to run a vm (3 times default) the host's status is set to error and a scheduled thread starts moving it to up (every 30 min default)

2.) This log line is important for understanding issues like this one.
(the message was improved by Eli in bug 1047567)
we can think about logging debug and not info but then we would probably not get them in a bug like this one. If not for this bug the message would appear much less (3 times )so it would probably not bother users.

3.) Negative values: I believe this is caused by a missus in using quartz pause and resume and am attempting to reproduce now. If that is true the fix should be simple.

Comment 5 Barak 2014-03-25 18:51:45 UTC

(In reply to Mooli Tayer from comment #4)
> Barak,
> 
> 1.) After failing to run a vm (3 times default) the host's status is set to
> error and a scheduled thread starts moving it to up (every 30 min default)
> 
> 2.) This log line is important for understanding issues like this one.
> (the message was improved by Eli in bug 1047567)
> we can think about logging debug and not info but then we would probably not
> get them in a bug like this one. If not for this bug the message would
> appear much less (3 times )so it would probably not bother users.
> 


But what is it's purpose? to show the timer goes into action ?
Is it necessary ?
Shouldn't we log (info level) only when it acts on host status change ? and all the rest should be debug ?


> 3.) Negative values: I believe this is caused by a missus in using quartz
> pause and resume and am attempting to reproduce now. If that is true the fix
> should be simple.

Comment 6 Mooli Tayer 2014-03-26 11:56:37 UTC

(In reply to Barak from comment #5)
> (In reply to Mooli Tayer from comment #4)
> > Barak,
> > 
> > 1.) After failing to run a vm (3 times default) the host's status is set to
> > error and a scheduled thread starts moving it to up (every 30 min default)
> > 
> > 2.) This log line is important for understanding issues like this one.
> > (the message was improved by Eli in bug 1047567)
> > we can think about logging debug and not info but then we would probably not
> > get them in a bug like this one. If not for this bug the message would
> > appear much less (3 times )so it would probably not bother users.
> > 
> 
> 
> But what is it's purpose? to show the timer goes into action ?
> Is it necessary ?
> Shouldn't we log (info level) only when it acts on host status change ? and
> all the rest should be debug ?

Maybe if we change the message to something more informative then users can benefit. "Attempting to recover erroneous host $hostName" this message
tells the users that a background thread set it's status from ERROR
(status for hosts with failling vms) to UP. this message should not 
appear as much as described in this bug after the bug is resolved.

> 
> 
> > 3.) Negative values: I believe this is caused by a missus in using quartz
> > pause and resume and am attempting to reproduce now. If that is true the fix
> > should be simple.

Comment 7 Barak 2014-04-02 09:27:55 UTC

Molli please ad description for QE:
- previous behavior
- generally what was fixed and how
- how to test it (including how to get a host to be in ERROR status)

Comment 8 Mooli Tayer 2014-04-02 10:49:54 UTC

A host receives ERROR status if it failed to run vms NumberOfFailedRunsOnVds times (default 3) AND those vms WERE ABLE to run on a different host.
What I did to prevent the host from running vms was I added a line throwing
an exception in /usr/share/vdsm/clientIf.py:createVm()

At that point you should see in the log:
"Vds {0} moved to Error mode after {1} attempts. Time: {2}"

Now the new behavior is simple: after TimeToReduceFailedRunOnVdsInMinutes
(default 30) IF THE HOST STATUS IS STILL ERROR it will be set to UP and number of failed attempts is zeroed. Log should have:
"Settings host {0} to up after {1} failed attempts to run a VM"
at that point we return to beginning state.

As for the old behavior it is somewhat more complicated to explain,
and is described in the message of this commit:
http://gerrit.ovirt.org/#/c/26139/

Comment 9 Tareq Alayan 2014-05-13 11:45:07 UTC

couldn't reproduce rhevm-3.4.0-0.18.el6ev.noarch

Comment 10 errata-xmlrpc 2014-06-09 15:04:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0506.html