1563639 – I/O paused HA VM is not restarted on other active host while its host is non-operational

Bug 1563639 - I/O paused HA VM is not restarted on other active host while its host is non-operational

Summary: I/O paused HA VM is not restarted on other active host while its host is non-...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.2.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Michal Skrivanek
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-04 11:29 UTC by Polina
Modified:	2018-04-25 09:37 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-04-25 09:37:47 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
vdsm, engine, libvirt logs. (1002.20 KB, application/x-gzip) 2018-04-04 11:29 UTC, Polina	no flags	Details
vdsm log with full debug (489.04 KB, application/x-gzip) 2018-04-12 12:42 UTC, Polina	no flags	Details
engine & vdsm logs (2.86 MB, application/x-gzip) 2018-04-20 21:09 UTC, Polina	no flags	Details
logs_dumpxml.tar.gz (961.97 KB, application/x-gzip) 2018-04-24 07:00 UTC, Polina	no flags	Details
View All

Description Polina 2018-04-04 11:29:33 UTC

Created attachment 1417242 [details]
vdsm, engine, libvirt logs.

Description of problem: HA VM paused with i/o error > 80 sec must be restarted on an active host. Actually, the VM remains paused.

Version-Release number of selected component (if applicable):rhv-release-4.2.2-9-001.noarch

How reproducible:100%

Steps to Reproduce:
1.Start HA VM no lease (iscsi bootable disk. iscsi is master SD). The env has 3 hosts.
2.block the iscsi storage. Wait 5+ minutes. 

Actual results: The VM is I/O paused. The host is active for a while, then turns to Non Operational. The VM remains to be paused on the same host.


Expected results:The VM is I/O paused. The host is active for a while, then turns to Non Operational. The VM is destroyed and started on other  active host


Additional info: Tried the scenario on SPM/not SPM host. We have the same behavior.
Attached vdsm, engine, libvirt logs.

Comment 1 Polina 2018-04-04 11:41:20 UTC

if it is HA with lease, the VM has restarted automatically as expected.  the problem is with HA VM no lease.

Comment 2 Michal Skrivanek 2018-04-05 05:08:44 UTC

HA VMs without a lease cannot be safely killed, hence it’s not being restarted as you’ve seen. This is expected
Perhaps just check docs please to see if it is clearly explained

Comment 3 Polina 2018-04-07 14:36:02 UTC

From docs follows that this feature is related to both HA VMs - with /without lease. it is only emphasized that it is best to use VMs with lease:

"To prevent split brain it's best to use VM leases on highly available VMs. The same also helps restarting highly available VMs on other hosts in case of prolonged storage problems on some of the hosts."

I also asked Milan by irc to be sure that the feature of automatically restarting relates to both kinds of HA VMs - with and without lease. 

Could you please confirm if it is wrong? and automatical restart for HA VMs while prolonged i/o storage error only relevant for HA with lease?

Comment 6 Milan Zamazal 2018-04-10 14:13:32 UTC

Polina, could you please provide vdsm.log demonstrating the problem with DEBUG level enabled for all components, especially virt?

Comment 7 Polina 2018-04-12 12:42:52 UTC

Created attachment 1420821 [details]
vdsm log with full debug

Comment 8 Polina 2018-04-12 12:44:45 UTC

the full debug on vdsm.log added.

the scenario starts in engine.log from line:

2018-04-12 15:26:58,480+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-2) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem.

Comment 9 Milan Zamazal 2018-04-12 13:49:38 UTC

Polina, are you sure the attached vdsm.log is from the right host? I can't see any VM started in the log. According to engine.log the VM was started on host_mixed_3, is it the same as cougar03 where the vdsm.log is from?

Comment 10 Polina 2018-04-20 21:01:08 UTC

yes, but I attach the new logs - logs_engine_vdsm.tar.gz after repeated reproducing 
the problem


please look for the scenario starting from the line
15523 2018-04-20 23:38:07,581+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_        mixed_virtio_2_0 has been paused due to storage I/O problem.

Comment 11 Polina 2018-04-20 21:09:15 UTC

Created attachment 1424741 [details]
engine & vdsm logs

Comment 12 Milan Zamazal 2018-04-23 07:57:00 UTC

Thank you Polina for the logs. I can see the VM was started with resume behavior "leave paused" as seen in its domain XML:

  <resumeBehavior>leave_paused</resumeBehavior>

This is the reason why it hasn't been killed and hasn't been started on another host.

Polina, could you please check that "kill" resume behavior is selected in the VM HA settings before the VM is started?

Comment 13 Polina 2018-04-24 06:59:38 UTC

Hi Milan. Please see the new logs and xml dump attached (logs_dumpxml.tar.gz). 
The VM is in auto_resume mode
<storage_error_resume_behaviour>auto_resume</storage_error_resume_behaviour>

In engine.log please see the start of the scenario at
2018-04-24 09:03:01,158+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-66) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem.

At 09:51:53 the VM remains paused

Comment 14 Polina 2018-04-24 07:00:19 UTC

Created attachment 1425827 [details]
logs_dumpxml.tar.gz

Comment 15 Polina 2018-04-24 07:04:10 UTC

please let me know by irc if you want me to reproduce the scenario and leave the environment for your investigation . the bug is 100% reproducible

Comment 16 Milan Zamazal 2018-04-24 08:47:28 UTC

VM with auto_resume policy remains paused until the I/O error is remedied. Looking into the logs it seems the I/O error persists, so the VM is left paused, which is correct. What exactly should be the bug here?

(Please don't forget about the libvirt bug discussed in https://bugzilla.redhat.com/1526025.)

Comment 17 Polina 2018-04-25 08:39:17 UTC

Hi Milan, this is my misunderstanding of the feature(1540548). I understood that HA VM must be automatically restarted on another host while blocking storage connection lasts too long, no matter what is configured for Resume Behavior.

Please re-check that this statement is correct:
The automatic restarting of VM while blocking storage is relevant only for an HA VMs (with&without lease) which configured with 'KILL' Resume Behavior.

If this statement is correct, the bug could be closed

Comment 18 Milan Zamazal 2018-04-25 09:37:47 UTC

(In reply to Polina from comment #17)
> Please re-check that this statement is correct:
> The automatic restarting of VM while blocking storage is relevant only for
> an HA VMs (with&without lease) which configured with 'KILL' Resume Behavior.

Yes. Note that doesn't mean that VMs without leases with KILL resume behavior are always restarted, e.g. if the host running the VM is unreachable then Engine can't restart them since their status is unknown (there may or may not be killed). But that doesn't matter regarding this bug.

> If this statement is correct, the bug could be closed

OK, closing.

Note You need to log in before you can comment on or make changes to this bug.