Description of problem: When an high availability vm is down, due to some environmental issue (like storage maintenance), there is only one attempt to start the vm (which will probably fail). I suggest that the backend will try to start the vm periodically Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. 2. 3. Actual results: 2013-Feb-19, VM ... was restarted on Host ... 2013-Feb-19, Failed to run VM ... on Host ... . 2013-Feb-19, VM ... is down. Exit message: Bad volume specification {'index': '0', 'domainID': 'd5d0dfe6-c2b9-429e-96c2-c1cd247ac828', 'reqsize': '0', 'format': 'raw', 'boot': 'true', 'volumeID': '2ca7c670-13b0-4a9b-8a9b-e37d0ba7e1ad', 'apparentsize': '32212254720', 'imageID': '6e94cfbd-4510-4403-896e-029bf97816bd', 'readonly': False, 'iface': 'virtio', 'truesize': '32212254720', 'poolID': '132859ec-ef83-4c01-b411-0cea6d3e1ed6', 'device': 'disk', 'shared': False, 'propagateErrors': 'off', 'type': 'disk', 'if': 'virtio'}. 2013-Feb-19, VM ... was started by ... (Host: ...). Expected results: Additional info:
Barak, we need some more info here; ie- how do we know when this is a temporary issue and when it isn't? For example, if you have 30 HM VMs, and storage goes down, you will have a lot of noise and initiating restart operations may end up with marking hosts as problematic (even when the issue is with storage or network). So this needs to be refined to the cases we'd like to handle.
Doron, I see what you mean, but using one try takes out the point from the availability. If the vms are down for a long period, the administrator will know to tell whether is an host or something else issue, And will need to restart the system. Today the admin need to restart the VM anyway, my suggestion reduces the times human interference is needed. This is a general idea, I'm sure you guys can take create the details. Simon, can you add your thoughts about the subject ?
(In reply to comment #2) > Simon, can you add your thoughts about the subject ? I agree we should improve the mechanism to something else then just one retry and then stop, on the other hand we know Doron is right and we had too many issues with the host recovery mechanism when it did periodic retries without checking the issue has been resolved. I'm moving this to future in order to have a discussion place holder for this issue, but we need to think carefully about any mechanism we select.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Milan - how's the new policy feature affecting this request?
The new "kill" autoresume policy for HA VMs with leases should help make VM restarts safer. But being involved just in the Vdsm part, I lack proper insight, Engine developers should know better about the exact motivations and utilization.
(In reply to Milan Zamazal from comment #15) > The new "kill" autoresume policy for HA VMs with leases should help make VM > restarts safer. But being involved just in the Vdsm part, I lack proper > insight, Engine developers should know better about the exact motivations > and utilization. What's the status on the engine side?
(In reply to Yaniv Kaul from comment #16) > (In reply to Milan Zamazal from comment #15) > > The new "kill" autoresume policy for HA VMs with leases should help make VM > > restarts safer. But being involved just in the Vdsm part, I lack proper > > insight, Engine developers should know better about the exact motivations > > and utilization. > > What's the status on the engine side? For HA VMs with a lease, the only policy which is allowed is the "KILL" which means the engine will try to re-start it again. So, it will help in this case. However, if the VM is a HA VM without lease, you can set also the "Resume" policy (which is the default) and in that case this policies will not help and the original behavior will be preserved.
verified on ovirt-engine-4.4.0-0.17.master.el7.noarch according to https://polarion.engineering.redhat.com/polarion/redirect/project/RHEVM3/workitem?id=RHEVM-26910
*** Bug 1653389 has been marked as a duplicate of this bug. ***
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.