Created attachment 1430556 [details]
engine, vdsm, qemu logs
Description of problem:HA VM configured with KILL resume behavior must be automatically restarted on another healthy host while I/O error pause.
In most caces it does. Sometimes (~30%) VMs on nfs SD remain to be i/o paused and not restarted.
Version-Release number of selected component (if applicable): rhv-release-4.2.3-4-001.noarch
Steps to Reproduce:
1. Configure 'KILL' resume behavior for High Available nfs VM .
2. Block the storage on the host where the VM is running (iptables -I INPUT -s yellow-vdsb.qa.lab.tlv.redhat.com -j DROP).
3. Wait for some time (> 1hr )
Actual results: sometimes the HA VM with 'KILL' resume behavior is not restarted. remains paused on the same host forever
Expected results: The HA VM with 'KILL' resume behavior must be restarted on another host while staying in i/o error pause.
Additional info: in the attached engine.log the scenario starts at:
2671 2018-05-03 10:42:14,821+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-92)  EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_0 has been paused due to storage I/O problem.
this bug is related to the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1540548 (comments31-35)
Libvirt analysis: https://bugzilla.redhat.com/show_bug.cgi?id=1540548#c41
We've noticed there is another NFS mount with different settings which would cause a hang for ~20 minutes, it is possible libvirt was stuck accessing that one. Can you please reproduce on a clean setup?
The scenario was run on the clean setup. The environment has three different nfs mounts. only one of them nfs_0 is configured with the Retransmissions=2, Timeout=1. Other nfs mounts have the default settings. Is there a problem with such a setup?
(In reply to Polina from comment #3)
> The scenario was run on the clean setup. The environment has three different
> nfs mounts. only one of them nfs_0 is configured with the Retransmissions=2,
> Timeout=1. Other nfs mounts have the default settings. Is there a problem
> with such a setup?
according to host logs it seems to timeout the same way as before changing mount options, at least at some point. It would be best if you really only have one or use the same setting on all
Created attachment 1434230 [details]
logs and screenshot
Hi, the bug is reproduced again in the environment with all nfs SDs (including export_domain) set with a small timeout.
Please see the new attachment, including engine.log, vdsm.log, qemu, get vm response, vm screenshot.
Polina, I guess this is no longer relevant and covered by other bugs in the meantime, right?
I don't see the problem happens in upstream ovirt-release42-snapshot-4.2.6-0.3.rc3.20180826015005.git2aa33d5.el7.noarch and downstream rhv-release-4.2.5-6-001.noarch.