Description of problem: See attached engine log [dsulliva@localhost 01004615]$ grep -R OnVdsDuringFailureTimer lb0156-2014021214561392238613/ lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 16:44:46,565 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-78) [39d3c9] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -699 lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 17:14:46,595 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-38) [6accfc45] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -700 lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 17:44:46,641 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-19) OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -701 lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 18:14:46,683 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-58) OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -702 lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 18:44:46,712 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-59) [60beae2c] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -703 lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 19:14:46,740 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-67) OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -704 lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 19:44:46,770 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-31) [291564a2] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -705 lb0156-2014021214561392238613/var/log/ovirt-engine/engine.log:2014-02-11 20:14:46,809 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-93) [4c7bcc69] OnVdsDuringFailureTimer of vds vzlb1014 entered. Attempts after -706 Version-Release number of selected component (if applicable): 3.3.0 GA Actual results: message shown in engine.log OnVdsDuringFailureTimer of vds vzlb1014 entered Expected results: do not want to see this message assuming there are no ramifications of the message Additional info: https://bugzilla.redhat.com/show_bug.cgi?id=1047567#c5
Eli, - Please elaborate on which flows this message appears ? - What is the role of this Timer/Quartz ? There are 2 separate issues to be handled: - Why do we need this log? what is it's purpose if we can't do anything about it ? - why are the negative values
Barak, 1.) After failing to run a vm (3 times default) the host's status is set to error and a scheduled thread starts moving it to up (every 30 min default) 2.) This log line is important for understanding issues like this one. (the message was improved by Eli in bug 1047567) we can think about logging debug and not info but then we would probably not get them in a bug like this one. If not for this bug the message would appear much less (3 times )so it would probably not bother users. 3.) Negative values: I believe this is caused by a missus in using quartz pause and resume and am attempting to reproduce now. If that is true the fix should be simple.
(In reply to Mooli Tayer from comment #4) > Barak, > > 1.) After failing to run a vm (3 times default) the host's status is set to > error and a scheduled thread starts moving it to up (every 30 min default) > > 2.) This log line is important for understanding issues like this one. > (the message was improved by Eli in bug 1047567) > we can think about logging debug and not info but then we would probably not > get them in a bug like this one. If not for this bug the message would > appear much less (3 times )so it would probably not bother users. > But what is it's purpose? to show the timer goes into action ? Is it necessary ? Shouldn't we log (info level) only when it acts on host status change ? and all the rest should be debug ? > 3.) Negative values: I believe this is caused by a missus in using quartz > pause and resume and am attempting to reproduce now. If that is true the fix > should be simple.
(In reply to Barak from comment #5) > (In reply to Mooli Tayer from comment #4) > > Barak, > > > > 1.) After failing to run a vm (3 times default) the host's status is set to > > error and a scheduled thread starts moving it to up (every 30 min default) > > > > 2.) This log line is important for understanding issues like this one. > > (the message was improved by Eli in bug 1047567) > > we can think about logging debug and not info but then we would probably not > > get them in a bug like this one. If not for this bug the message would > > appear much less (3 times )so it would probably not bother users. > > > > > But what is it's purpose? to show the timer goes into action ? > Is it necessary ? > Shouldn't we log (info level) only when it acts on host status change ? and > all the rest should be debug ? Maybe if we change the message to something more informative then users can benefit. "Attempting to recover erroneous host $hostName" this message tells the users that a background thread set it's status from ERROR (status for hosts with failling vms) to UP. this message should not appear as much as described in this bug after the bug is resolved. > > > > 3.) Negative values: I believe this is caused by a missus in using quartz > > pause and resume and am attempting to reproduce now. If that is true the fix > > should be simple.
Molli please ad description for QE: - previous behavior - generally what was fixed and how - how to test it (including how to get a host to be in ERROR status)
A host receives ERROR status if it failed to run vms NumberOfFailedRunsOnVds times (default 3) AND those vms WERE ABLE to run on a different host. What I did to prevent the host from running vms was I added a line throwing an exception in /usr/share/vdsm/clientIf.py:createVm() At that point you should see in the log: "Vds {0} moved to Error mode after {1} attempts. Time: {2}" Now the new behavior is simple: after TimeToReduceFailedRunOnVdsInMinutes (default 30) IF THE HOST STATUS IS STILL ERROR it will be set to UP and number of failed attempts is zeroed. Log should have: "Settings host {0} to up after {1} failed attempts to run a VM" at that point we return to beginning state. As for the old behavior it is somewhat more complicated to explain, and is described in the message of this commit: http://gerrit.ovirt.org/#/c/26139/
couldn't reproduce rhevm-3.4.0-0.18.el6ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0506.html