Bug 1066693
Summary: | Every thirty minutes OnVdsDuringFailureTimer is shown in engine log | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dave Sullivan <dsulliva> |
Component: | ovirt-engine | Assignee: | Mooli Tayer <mtayer> |
Status: | CLOSED ERRATA | QA Contact: | Tareq Alayan <talayan> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.3.0 | CC: | aberezin, acathrow, bazulay, emesika, iheim, jentrena, lbopf, lpeer, mkalinin, mtayer, oourfali, pstehlik, Rhev-m-bugs, yeylon |
Target Milestone: | --- | ||
Target Release: | 3.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | infra | ||
Fixed In Version: | org.ovirt.engine-root-3.4.0-14 | Doc Type: | Bug Fix |
Doc Text: |
Previously, OnVdsDuringFailureTimer appeared in the engine log every 30 minutes after a host failed to run virtual machines. The error continued to appear even if the virtual machines were able to be run on another host, and even if no further problem was detected in the host. Now, the OnVdsDuringFailureTimer message appears only for three failed attempts, after which the number of attempts is reset to zero.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2014-06-09 15:04:30 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dave Sullivan
2014-02-18 22:35:38 UTC
Eli, - Please elaborate on which flows this message appears ? - What is the role of this Timer/Quartz ? There are 2 separate issues to be handled: - Why do we need this log? what is it's purpose if we can't do anything about it ? - why are the negative values Barak, 1.) After failing to run a vm (3 times default) the host's status is set to error and a scheduled thread starts moving it to up (every 30 min default) 2.) This log line is important for understanding issues like this one. (the message was improved by Eli in bug 1047567) we can think about logging debug and not info but then we would probably not get them in a bug like this one. If not for this bug the message would appear much less (3 times )so it would probably not bother users. 3.) Negative values: I believe this is caused by a missus in using quartz pause and resume and am attempting to reproduce now. If that is true the fix should be simple. (In reply to Mooli Tayer from comment #4) > Barak, > > 1.) After failing to run a vm (3 times default) the host's status is set to > error and a scheduled thread starts moving it to up (every 30 min default) > > 2.) This log line is important for understanding issues like this one. > (the message was improved by Eli in bug 1047567) > we can think about logging debug and not info but then we would probably not > get them in a bug like this one. If not for this bug the message would > appear much less (3 times )so it would probably not bother users. > But what is it's purpose? to show the timer goes into action ? Is it necessary ? Shouldn't we log (info level) only when it acts on host status change ? and all the rest should be debug ? > 3.) Negative values: I believe this is caused by a missus in using quartz > pause and resume and am attempting to reproduce now. If that is true the fix > should be simple. (In reply to Barak from comment #5) > (In reply to Mooli Tayer from comment #4) > > Barak, > > > > 1.) After failing to run a vm (3 times default) the host's status is set to > > error and a scheduled thread starts moving it to up (every 30 min default) > > > > 2.) This log line is important for understanding issues like this one. > > (the message was improved by Eli in bug 1047567) > > we can think about logging debug and not info but then we would probably not > > get them in a bug like this one. If not for this bug the message would > > appear much less (3 times )so it would probably not bother users. > > > > > But what is it's purpose? to show the timer goes into action ? > Is it necessary ? > Shouldn't we log (info level) only when it acts on host status change ? and > all the rest should be debug ? Maybe if we change the message to something more informative then users can benefit. "Attempting to recover erroneous host $hostName" this message tells the users that a background thread set it's status from ERROR (status for hosts with failling vms) to UP. this message should not appear as much as described in this bug after the bug is resolved. > > > > 3.) Negative values: I believe this is caused by a missus in using quartz > > pause and resume and am attempting to reproduce now. If that is true the fix > > should be simple. Molli please ad description for QE: - previous behavior - generally what was fixed and how - how to test it (including how to get a host to be in ERROR status) A host receives ERROR status if it failed to run vms NumberOfFailedRunsOnVds times (default 3) AND those vms WERE ABLE to run on a different host. What I did to prevent the host from running vms was I added a line throwing an exception in /usr/share/vdsm/clientIf.py:createVm() At that point you should see in the log: "Vds {0} moved to Error mode after {1} attempts. Time: {2}" Now the new behavior is simple: after TimeToReduceFailedRunOnVdsInMinutes (default 30) IF THE HOST STATUS IS STILL ERROR it will be set to UP and number of failed attempts is zeroed. Log should have: "Settings host {0} to up after {1} failed attempts to run a VM" at that point we return to beginning state. As for the old behavior it is somewhat more complicated to explain, and is described in the message of this commit: http://gerrit.ovirt.org/#/c/26139/ couldn't reproduce rhevm-3.4.0-0.18.el6ev.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0506.html |