Created attachment 1417242 [details]
vdsm, engine, libvirt logs.
Description of problem: HA VM paused with i/o error > 80 sec must be restarted on an active host. Actually, the VM remains paused.
Version-Release number of selected component (if applicable):rhv-release-4.2.2-9-001.noarch
Steps to Reproduce:
1.Start HA VM no lease (iscsi bootable disk. iscsi is master SD). The env has 3 hosts.
2.block the iscsi storage. Wait 5+ minutes.
Actual results: The VM is I/O paused. The host is active for a while, then turns to Non Operational. The VM remains to be paused on the same host.
Expected results:The VM is I/O paused. The host is active for a while, then turns to Non Operational. The VM is destroyed and started on other active host
Additional info: Tried the scenario on SPM/not SPM host. We have the same behavior.
Attached vdsm, engine, libvirt logs.
if it is HA with lease, the VM has restarted automatically as expected. the problem is with HA VM no lease.
HA VMs without a lease cannot be safely killed, hence it’s not being restarted as you’ve seen. This is expected
Perhaps just check docs please to see if it is clearly explained
From docs follows that this feature is related to both HA VMs - with /without lease. it is only emphasized that it is best to use VMs with lease:
"To prevent split brain it's best to use VM leases on highly available VMs. The same also helps restarting highly available VMs on other hosts in case of prolonged storage problems on some of the hosts."
I also asked Milan by irc to be sure that the feature of automatically restarting relates to both kinds of HA VMs - with and without lease.
Could you please confirm if it is wrong? and automatical restart for HA VMs while prolonged i/o storage error only relevant for HA with lease?
Polina, could you please provide vdsm.log demonstrating the problem with DEBUG level enabled for all components, especially virt?
Created attachment 1420821 [details]
vdsm log with full debug
the full debug on vdsm.log added.
the scenario starts in engine.log from line:
2018-04-12 15:26:58,480+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-2)  EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem.
Polina, are you sure the attached vdsm.log is from the right host? I can't see any VM started in the log. According to engine.log the VM was started on host_mixed_3, is it the same as cougar03 where the vdsm.log is from?
yes, but I attach the new logs - logs_engine_vdsm.tar.gz after repeated reproducing
please look for the scenario starting from the line
15523 2018-04-20 23:38:07,581+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-10)  EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_ mixed_virtio_2_0 has been paused due to storage I/O problem.
Created attachment 1424741 [details]
engine & vdsm logs
Thank you Polina for the logs. I can see the VM was started with resume behavior "leave paused" as seen in its domain XML:
This is the reason why it hasn't been killed and hasn't been started on another host.
Polina, could you please check that "kill" resume behavior is selected in the VM HA settings before the VM is started?
Hi Milan. Please see the new logs and xml dump attached (logs_dumpxml.tar.gz).
The VM is in auto_resume mode
In engine.log please see the start of the scenario at
2018-04-24 09:03:01,158+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-66)  EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem.
At 09:51:53 the VM remains paused
Created attachment 1425827 [details]
please let me know by irc if you want me to reproduce the scenario and leave the environment for your investigation . the bug is 100% reproducible
VM with auto_resume policy remains paused until the I/O error is remedied. Looking into the logs it seems the I/O error persists, so the VM is left paused, which is correct. What exactly should be the bug here?
(Please don't forget about the libvirt bug discussed in https://bugzilla.redhat.com/1526025.)
Hi Milan, this is my misunderstanding of the feature(1540548). I understood that HA VM must be automatically restarted on another host while blocking storage connection lasts too long, no matter what is configured for Resume Behavior.
Please re-check that this statement is correct:
The automatic restarting of VM while blocking storage is relevant only for an HA VMs (with&without lease) which configured with 'KILL' Resume Behavior.
If this statement is correct, the bug could be closed
(In reply to Polina from comment #17)
> Please re-check that this statement is correct:
> The automatic restarting of VM while blocking storage is relevant only for
> an HA VMs (with&without lease) which configured with 'KILL' Resume Behavior.
Yes. Note that doesn't mean that VMs without leases with KILL resume behavior are always restarted, e.g. if the host running the VM is unreachable then Engine can't restart them since their status is unknown (there may or may not be killed). But that doesn't matter regarding this bug.
> If this statement is correct, the bug could be closed