Bug 1574402 - I/O paused HA VM(nfs) with 'KILL' resume behavior is not restarted on another host
Summary: I/O paused HA VM(nfs) with 'KILL' resume behavior is not restarted on another...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.2.2
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Michal Skrivanek
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-03 08:26 UTC by Polina
Modified: 2018-09-03 11:37 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-03 11:37:12 UTC
oVirt Team: Virt


Attachments (Terms of Use)
engine, vdsm, qemu logs (2.31 MB, application/x-gzip)
2018-05-03 08:26 UTC, Polina
no flags Details
logs and screenshot (4.00 MB, application/x-gzip)
2018-05-10 06:49 UTC, Polina
no flags Details

Description Polina 2018-05-03 08:26:21 UTC
Created attachment 1430556 [details]
engine, vdsm, qemu logs

Description of problem:HA VM configured with KILL resume behavior must be automatically restarted on another healthy host while I/O error pause.
In most caces it does. Sometimes (~30%) VMs on nfs SD remain to be i/o paused and not restarted.


Version-Release number of selected component (if applicable): rhv-release-4.2.3-4-001.noarch

How reproducible:~30%


Steps to Reproduce:
1. Configure 'KILL' resume behavior for High Available nfs VM .
2. Block the storage on the host where the VM is running (iptables -I INPUT -s yellow-vdsb.qa.lab.tlv.redhat.com -j DROP). 
3. Wait for some time (> 1hr ) 

Actual results: sometimes the HA VM with 'KILL' resume behavior is not restarted. remains paused on the same host forever

Expected results: The HA VM with 'KILL' resume behavior must be restarted on another host while staying in i/o error pause.

Additional info: in the attached engine.log the scenario starts at:
2671 2018-05-03 10:42:14,821+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_0 has been paused due to storage I/O problem.

Comment 1 Polina 2018-05-03 08:34:50 UTC
this bug is related to the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1540548 (comments31-35)

Comment 2 Michal Skrivanek 2018-05-04 13:28:06 UTC
Libvirt analysis: https://bugzilla.redhat.com/show_bug.cgi?id=1540548#c41

We've noticed there is another NFS mount with different settings which would cause a hang for ~20 minutes, it is possible libvirt was stuck accessing that one. Can you please reproduce on a clean setup?

Comment 3 Polina 2018-05-06 14:00:51 UTC
The scenario was run on the clean setup. The environment has three different nfs mounts. only one of them nfs_0 is configured with the Retransmissions=2, Timeout=1. Other nfs mounts have the default settings. Is there a problem with such a setup?

Comment 4 Michal Skrivanek 2018-05-07 13:07:02 UTC
(In reply to Polina from comment #3)
> The scenario was run on the clean setup. The environment has three different
> nfs mounts. only one of them nfs_0 is configured with the Retransmissions=2,
> Timeout=1. Other nfs mounts have the default settings. Is there a problem
> with such a setup?

according to host logs it seems to timeout the same way as before changing mount options, at least at some point. It would be best if you really only have one or use the same setting on all

Comment 5 Polina 2018-05-10 06:49:37 UTC
Created attachment 1434230 [details]
logs and screenshot

Hi, the bug is reproduced again in the environment with all nfs SDs (including export_domain) set with a small timeout. 

Please see the new attachment, including engine.log, vdsm.log, qemu, get vm response, vm screenshot.

Comment 6 Michal Skrivanek 2018-08-30 07:43:15 UTC
Polina, I guess this is no longer relevant and covered by other bugs in the meantime, right?

Comment 7 Polina 2018-09-02 07:16:09 UTC
Hi Michal, 
I don't see the problem happens in upstream ovirt-release42-snapshot-4.2.6-0.3.rc3.20180826015005.git2aa33d5.el7.noarch and downstream rhv-release-4.2.5-6-001.noarch.


Note You need to log in before you can comment on or make changes to this bug.