Bug 1574402

Summary: I/O paused HA VM(nfs) with 'KILL' resume behavior is not restarted on another host
Product: [oVirt] ovirt-engine Reporter: Polina <pagranat>
Component: BLL.VirtAssignee: Michal Skrivanek <michal.skrivanek>
Status: CLOSED CURRENTRELEASE QA Contact: Polina <pagranat>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.2CC: bugs, michal.skrivanek, pagranat
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-03 11:37:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine, vdsm, qemu logs
none
logs and screenshot none

Description Polina 2018-05-03 08:26:21 UTC
Created attachment 1430556 [details]
engine, vdsm, qemu logs

Description of problem:HA VM configured with KILL resume behavior must be automatically restarted on another healthy host while I/O error pause.
In most caces it does. Sometimes (~30%) VMs on nfs SD remain to be i/o paused and not restarted.


Version-Release number of selected component (if applicable): rhv-release-4.2.3-4-001.noarch

How reproducible:~30%


Steps to Reproduce:
1. Configure 'KILL' resume behavior for High Available nfs VM .
2. Block the storage on the host where the VM is running (iptables -I INPUT -s yellow-vdsb.qa.lab.tlv.redhat.com -j DROP). 
3. Wait for some time (> 1hr ) 

Actual results: sometimes the HA VM with 'KILL' resume behavior is not restarted. remains paused on the same host forever

Expected results: The HA VM with 'KILL' resume behavior must be restarted on another host while staying in i/o error pause.

Additional info: in the attached engine.log the scenario starts at:
2671 2018-05-03 10:42:14,821+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_0 has been paused due to storage I/O problem.

Comment 1 Polina 2018-05-03 08:34:50 UTC
this bug is related to the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1540548 (comments31-35)

Comment 2 Michal Skrivanek 2018-05-04 13:28:06 UTC
Libvirt analysis: https://bugzilla.redhat.com/show_bug.cgi?id=1540548#c41

We've noticed there is another NFS mount with different settings which would cause a hang for ~20 minutes, it is possible libvirt was stuck accessing that one. Can you please reproduce on a clean setup?

Comment 3 Polina 2018-05-06 14:00:51 UTC
The scenario was run on the clean setup. The environment has three different nfs mounts. only one of them nfs_0 is configured with the Retransmissions=2, Timeout=1. Other nfs mounts have the default settings. Is there a problem with such a setup?

Comment 4 Michal Skrivanek 2018-05-07 13:07:02 UTC
(In reply to Polina from comment #3)
> The scenario was run on the clean setup. The environment has three different
> nfs mounts. only one of them nfs_0 is configured with the Retransmissions=2,
> Timeout=1. Other nfs mounts have the default settings. Is there a problem
> with such a setup?

according to host logs it seems to timeout the same way as before changing mount options, at least at some point. It would be best if you really only have one or use the same setting on all

Comment 5 Polina 2018-05-10 06:49:37 UTC
Created attachment 1434230 [details]
logs and screenshot

Hi, the bug is reproduced again in the environment with all nfs SDs (including export_domain) set with a small timeout. 

Please see the new attachment, including engine.log, vdsm.log, qemu, get vm response, vm screenshot.

Comment 6 Michal Skrivanek 2018-08-30 07:43:15 UTC
Polina, I guess this is no longer relevant and covered by other bugs in the meantime, right?

Comment 7 Polina 2018-09-02 07:16:09 UTC
Hi Michal, 
I don't see the problem happens in upstream ovirt-release42-snapshot-4.2.6-0.3.rc3.20180826015005.git2aa33d5.el7.noarch and downstream rhv-release-4.2.5-6-001.noarch.