We have an embedded Linux appliance, based on Debian, that not supports the installation of guest agent (neither manually). This appliance is Kerio Control Firewall. The VM is hosted in HA cluster but, in case of failure, is not migrated automatically. We tried all available migration policies: legacy, minimum downtime, postcopy, suspend. Others VMs with guest agent installed are migrated with success. When one of the hosts in the cluster fails, this VM event is logged: "VM xxxxxx is down. Exit message: User shut down from within the guest" Seems that oVirt mistakes host failure with user shutdown. Is it possible to have HA VMs without guest agent?
I suppose you mean in case of host failure. What is the host OS and how does it fail?
(In reply to Michal Skrivanek from comment #1) > I suppose you mean in case of host failure. What is the host OS and how does > it fail? Yes exactly. The hosts in the cluster are CentOS 7.3 We have simulated the failure unplugging the power cable of one host.
if you unplugged the cable then I guess it depends how the power management in oVirt is configured. But that kind of contradicts that you saw a message about shut down. The host should go to Not Responding state, and then after fencing to Down. At that point the HA VM should be restarted. Can you pease attach the engine.log to check the exact sequence of actions?
Created attachment 1319541 [details] Screenshot VM without guest agent
Created attachment 1319543 [details] Configuration of VM with HA
Created attachment 1319544 [details] Host power management (ILO)
(In reply to Michal Skrivanek from comment #3) > if you unplugged the cable then I guess it depends how the power management > in oVirt is configured. But that kind of contradicts that you saw a message > about shut down. The host should go to Not Responding state, and then after > fencing to Down. At that point the HA VM should be restarted. Can you pease > attach the engine.log to check the exact sequence of actions? Each host has power management configured through ILO. Each VM has HA enabled. But only the VM with guest agent installed is migrated. "engine.log" of VM with guest agent (OK): 2017-08-09 12:25:11,678+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] VM '1ff4caad-6239-4acc-98c0-9bb4e43bbc22' was reported as Down on VDS '532843ac-a073-4ebf-90e7-f8dd92d538cc'(ahead-hs01hp) 2017-08-09 12:25:11,678+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-8) [] START, DestroyVDSCommand(HostName = ahead-hs01hp, DestroyVmVDSCommandParameters:{runAsync='true', hostId='532843ac-a073-4ebf-90e7-f8dd92d538cc', vmId='1ff4caad-6239-4acc-98c0-9bb4e43bbc22', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 5f6abf0 2017-08-09 12:25:11,683+02 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-34) [5d9a1ba9] Running command: ProcessDownVmCommand internal: true. 2017-08-09 12:25:12,682+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-8) [] FINISH, DestroyVDSCommand, log id: 5f6abf0 2017-08-09 12:25:12,682+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] VM '1ff4caad-6239-4acc-98c0-9bb4e43bbc22'(linuxtest) moved from 'Up' --> 'Down' 2017-08-09 12:25:12,706+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] EVENT_ID: VM_DOWN_ERROR(119), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM linuxtest is down with error. Exit message: VM has been terminated on the host. 2017-08-09 12:25:12,706+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] add VM '1ff4caad-6239-4acc-98c0-9bb4e43bbc22'(linuxtest) to HA rerun treatment 2017-08-09 12:25:12,714+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] EVENT_ID: HA_VM_FAILED(9,602), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM linuxtest failed. It will be restarted automatically. 2017-08-09 12:25:12,714+02 INFO [org.ovirt.engine.core.bll.VdsEventListener] (ForkJoinPool-1-worker-8) [] Highly Available VM went down. Attempting to restart. VM Name 'linuxtest', VM Id '1ff4caad-6239-4acc-98c0-9bb4e43bbc22' 2017-08-09 12:25:12,721+02 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-38) [7c7c3245] Running command: ProcessDownVmCommand internal: true. "engine.log" of VM without guest agent (KO): 2017-08-28 14:42:17,794+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM 'b6a02b62-8f6e-412f-9427-a3b6c26627d4' was reported as Down on VDS '532843ac-a073-4ebf-90e7-f8dd92d538cc'(ahead-hs01hp) 2017-08-28 14:42:17,794+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-12) [] START, DestroyVDSCommand(HostName = ahead-hs01hp, DestroyVmVDSCommandParameters:{runAsync='true', hostId='532843ac-a073-4ebf-90e7-f8dd92d538cc', vmId='b6a02b62-8f6e-412f-9427-a3b6c26627d4', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 7d4dd9d9 2017-08-28 14:42:17,801+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-12) [] FINISH, DestroyVDSCommand, log id: 7d4dd9d9 2017-08-28 14:42:17,801+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM 'b6a02b62-8f6e-412f-9427-a3b6c26627d4'(ahead-kctrl01.cloud.ahead.local) moved from 'Up' --> 'Down' 2017-08-28 14:42:17,873+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_DOWN(61), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM ahead-kctrl01.cloud.ahead.local is down. Exit message: User shut down from within the guest 2017-08-28 14:42:17,885+02 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-15) [55bea7dd] Running command: ProcessDownVmCommand internal: true. As you can see, for the first the reason of shutdown is "Message: VM linuxtest is down with error. Exit message: VM has been terminated on the host." but for the second is "Message: VM ahead-kctrl01.cloud.ahead.local is down. Exit message: User shut down from within the guest" and in this case HA migration is not started.
right, that makes sense now. I was only confused by the statement of plugging out the VM - as that would mean no iLO interaction and host shutdown. Indeed there is a problem with detection of guest terminations on host shutdown (the termination signal is indistinguishable from guest-initiated shutdown). There is a recent work in libvirt to be able to differentiate, but for now oVirt still uses ovirt-guest-agent to figure that out. Note that this should not happen in the "regular" case of power outage where the host doesn't have a chance to initiate host shutdown. This may be addressed by bug 1334982, but it would need to be retested for this specific fencing case
(In reply to Michal Skrivanek from comment #8) > right, that makes sense now. I was only confused by the statement of > plugging out the VM - as that would mean no iLO interaction and host > shutdown. > Indeed there is a problem with detection of guest terminations on host > shutdown (the termination signal is indistinguishable from guest-initiated > shutdown). There is a recent work in libvirt to be able to differentiate, > but for now oVirt still uses ovirt-guest-agent to figure that out. > > Note that this should not happen in the "regular" case of power outage where > the host doesn't have a chance to initiate host shutdown. > > This may be addressed by bug 1334982, but it would need to be retested for > this specific fencing case There is no way, at the moment, to ensure an HA migration without guest agent? We also tried with watchdog device, but this specific appliance has no support for it. Thank you.
correct. It currently doesn't work together with power management shutting down the host. It would work in case of real power failure and immediate power off.
To simplify the tracking, marking this bug a duplicate of 1334982 and added a comment there that also this case has to be tested. The patches solving the issue are the same though. *** This bug has been marked as a duplicate of bug 1334982 ***