1485967 – HA Migration fail without guest agent

Bug 1485967 - HA Migration fail without guest agent

Summary: HA Migration fail without guest agent

Keywords:
Status:	CLOSED DUPLICATE of bug 1334982
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.1.4.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Petr Kotas
QA Contact:	Pavel Stehlik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-28 15:04 UTC by Simone Giordano
Modified:	2017-08-30 14:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-08-30 14:03:14 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Screenshot VM without guest agent (4.99 KB, image/png) 2017-08-29 13:27 UTC, Simone Giordano	no flags	Details
Configuration of VM with HA (4.87 KB, image/png) 2017-08-29 13:27 UTC, Simone Giordano	no flags	Details
Host power management (ILO) (20.67 KB, image/png) 2017-08-29 13:27 UTC, Simone Giordano	no flags	Details
View All

Description Simone Giordano 2017-08-28 15:04:09 UTC

We have an embedded Linux appliance, based on Debian, that not supports the installation of guest agent (neither manually). This appliance is Kerio Control Firewall. The VM is hosted in HA cluster but, in case of failure, is not migrated automatically.

We tried all available migration policies: legacy, minimum downtime, postcopy, suspend.

Others VMs with guest agent installed are migrated with success.

When one of the hosts in the cluster fails, this VM event is logged:


"VM xxxxxx is down. Exit message: User shut down from within the guest"


Seems that oVirt mistakes host failure with user shutdown.

Is it possible to have HA VMs without guest agent?

Comment 1 Michal Skrivanek 2017-08-29 05:08:40 UTC

I suppose you mean in case of host failure. What is the host OS and how does it fail?

Comment 2 Simone Giordano 2017-08-29 10:58:43 UTC

(In reply to Michal Skrivanek from comment #1)
> I suppose you mean in case of host failure. What is the host OS and how does
> it fail?

Yes exactly. The hosts in the cluster are CentOS 7.3
We have simulated the failure unplugging the power cable of one host.

Comment 3 Michal Skrivanek 2017-08-29 11:05:33 UTC

if you unplugged the cable then I guess it depends how the power management in oVirt is configured. But that kind of contradicts that you saw a message about shut down. The host should go to Not Responding state, and then after fencing to Down. At that point the HA VM should be restarted. Can you pease attach the engine.log to check the exact sequence of actions?

Comment 4 Simone Giordano 2017-08-29 13:27:00 UTC

Created attachment 1319541 [details]
Screenshot VM without guest agent

Comment 5 Simone Giordano 2017-08-29 13:27:34 UTC

Created attachment 1319543 [details]
Configuration of VM with HA

Comment 6 Simone Giordano 2017-08-29 13:27:56 UTC

Created attachment 1319544 [details]
Host power management (ILO)

Comment 7 Simone Giordano 2017-08-29 13:31:02 UTC

(In reply to Michal Skrivanek from comment #3)
> if you unplugged the cable then I guess it depends how the power management
> in oVirt is configured. But that kind of contradicts that you saw a message
> about shut down. The host should go to Not Responding state, and then after
> fencing to Down. At that point the HA VM should be restarted. Can you pease
> attach the engine.log to check the exact sequence of actions?

Each host has power management configured through ILO.
Each VM has HA enabled.
But only the VM with guest agent installed is migrated.

"engine.log" of VM with guest agent (OK):


2017-08-09 12:25:11,678+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] VM '1ff4caad-6239-4acc-98c0-9bb4e43bbc22' was reported as Down on VDS '532843ac-a073-4ebf-90e7-f8dd92d538cc'(ahead-hs01hp)
2017-08-09 12:25:11,678+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-8) [] START, DestroyVDSCommand(HostName = ahead-hs01hp, DestroyVmVDSCommandParameters:{runAsync='true', hostId='532843ac-a073-4ebf-90e7-f8dd92d538cc', vmId='1ff4caad-6239-4acc-98c0-9bb4e43bbc22', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 5f6abf0
2017-08-09 12:25:11,683+02 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-34) [5d9a1ba9] Running command: ProcessDownVmCommand internal: true.
2017-08-09 12:25:12,682+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-8) [] FINISH, DestroyVDSCommand, log id: 5f6abf0
2017-08-09 12:25:12,682+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] VM '1ff4caad-6239-4acc-98c0-9bb4e43bbc22'(linuxtest) moved from 'Up' --> 'Down'
2017-08-09 12:25:12,706+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] EVENT_ID: VM_DOWN_ERROR(119), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM linuxtest is down with error. Exit message: VM has been terminated on the host.
2017-08-09 12:25:12,706+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] add VM '1ff4caad-6239-4acc-98c0-9bb4e43bbc22'(linuxtest) to HA rerun treatment
2017-08-09 12:25:12,714+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] EVENT_ID: HA_VM_FAILED(9,602), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM linuxtest failed. It will be restarted automatically.
2017-08-09 12:25:12,714+02 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (ForkJoinPool-1-worker-8) [] Highly Available VM went down. Attempting to restart. VM Name 'linuxtest', VM Id '1ff4caad-6239-4acc-98c0-9bb4e43bbc22'
2017-08-09 12:25:12,721+02 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-38) [7c7c3245] Running command: ProcessDownVmCommand internal: true.


"engine.log" of VM without guest agent (KO):


2017-08-28 14:42:17,794+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM 'b6a02b62-8f6e-412f-9427-a3b6c26627d4' was reported as Down on VDS '532843ac-a073-4ebf-90e7-f8dd92d538cc'(ahead-hs01hp)
2017-08-28 14:42:17,794+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-12) [] START, DestroyVDSCommand(HostName = ahead-hs01hp, DestroyVmVDSCommandParameters:{runAsync='true', hostId='532843ac-a073-4ebf-90e7-f8dd92d538cc', vmId='b6a02b62-8f6e-412f-9427-a3b6c26627d4', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 7d4dd9d9
2017-08-28 14:42:17,801+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-12) [] FINISH, DestroyVDSCommand, log id: 7d4dd9d9
2017-08-28 14:42:17,801+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM 'b6a02b62-8f6e-412f-9427-a3b6c26627d4'(ahead-kctrl01.cloud.ahead.local) moved from 'Up' --> 'Down'
2017-08-28 14:42:17,873+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_DOWN(61), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM ahead-kctrl01.cloud.ahead.local is down. Exit message: User shut down from within the guest
2017-08-28 14:42:17,885+02 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-15) [55bea7dd] Running command: ProcessDownVmCommand internal: true.


As you can see, for the first the reason of shutdown is "Message: VM linuxtest is down with error. Exit message: VM has been terminated on the host." but for the second is "Message: VM ahead-kctrl01.cloud.ahead.local is down. Exit message: User shut down from within the guest" and in this case HA migration is not started.

Comment 8 Michal Skrivanek 2017-08-29 14:15:37 UTC

right, that makes sense now. I was only confused by the statement of plugging out the VM - as that would mean no iLO interaction and host shutdown.
Indeed there is a problem with detection of guest terminations on host shutdown (the termination signal is indistinguishable from guest-initiated shutdown). There is a recent work in libvirt to be able to differentiate, but for now oVirt still uses ovirt-guest-agent to figure that out.

Note that this should not happen in the "regular" case of power outage where the host doesn't have a chance to initiate host shutdown.

This may be addressed by bug 1334982, but it would need to be retested for this specific fencing case

Comment 9 Simone Giordano 2017-08-29 14:30:11 UTC

(In reply to Michal Skrivanek from comment #8)
> right, that makes sense now. I was only confused by the statement of
> plugging out the VM - as that would mean no iLO interaction and host
> shutdown.
> Indeed there is a problem with detection of guest terminations on host
> shutdown (the termination signal is indistinguishable from guest-initiated
> shutdown). There is a recent work in libvirt to be able to differentiate,
> but for now oVirt still uses ovirt-guest-agent to figure that out.
> 
> Note that this should not happen in the "regular" case of power outage where
> the host doesn't have a chance to initiate host shutdown.
> 
> This may be addressed by bug 1334982, but it would need to be retested for
> this specific fencing case

There is no way, at the moment, to ensure an HA migration without guest agent? We also tried with watchdog device, but this specific appliance has no support for it.

Thank you.

Comment 10 Michal Skrivanek 2017-08-30 09:46:47 UTC

correct. It currently doesn't work together with power management shutting down the host. It would work in case of real power failure and immediate power off.

Comment 11 Tomas Jelinek 2017-08-30 14:03:14 UTC

To simplify the tracking, marking this bug a duplicate of 1334982 and added a comment there that also this case has to be tested. The patches solving the issue are the same though.

*** This bug has been marked as a duplicate of bug 1334982 ***

Note You need to log in before you can comment on or make changes to this bug.