Bug 1574402

Summary:

I/O paused HA VM(nfs) with 'KILL' resume behavior is not restarted on another host

Product:

[oVirt] ovirt-engine

Reporter:

Polina <pagranat>

Component:

BLL.Virt

Assignee:

Michal Skrivanek <michal.skrivanek>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Polina <pagranat>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.2.2

CC:

bugs, michal.skrivanek, pagranat

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-09-03 11:37:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
engine, vdsm, qemu logs	none
logs and screenshot	none

Description Polina 2018-05-03 08:26:21 UTC

Created attachment 1430556 [details]
engine, vdsm, qemu logs

Description of problem:HA VM configured with KILL resume behavior must be automatically restarted on another healthy host while I/O error pause.
In most caces it does. Sometimes (~30%) VMs on nfs SD remain to be i/o paused and not restarted.


Version-Release number of selected component (if applicable): rhv-release-4.2.3-4-001.noarch

How reproducible:~30%


Steps to Reproduce:
1. Configure 'KILL' resume behavior for High Available nfs VM .
2. Block the storage on the host where the VM is running (iptables -I INPUT -s yellow-vdsb.qa.lab.tlv.redhat.com -j DROP). 
3. Wait for some time (> 1hr ) 

Actual results: sometimes the HA VM with 'KILL' resume behavior is not restarted. remains paused on the same host forever

Expected results: The HA VM with 'KILL' resume behavior must be restarted on another host while staying in i/o error pause.

Additional info: in the attached engine.log the scenario starts at:
2671 2018-05-03 10:42:14,821+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_0 has been paused due to storage I/O problem.

Comment 1 Polina 2018-05-03 08:34:50 UTC

this bug is related to the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1540548 (comments31-35)

Comment 2 Michal Skrivanek 2018-05-04 13:28:06 UTC

Libvirt analysis: https://bugzilla.redhat.com/show_bug.cgi?id=1540548#c41

We've noticed there is another NFS mount with different settings which would cause a hang for ~20 minutes, it is possible libvirt was stuck accessing that one. Can you please reproduce on a clean setup?

Comment 3 Polina 2018-05-06 14:00:51 UTC

The scenario was run on the clean setup. The environment has three different nfs mounts. only one of them nfs_0 is configured with the Retransmissions=2, Timeout=1. Other nfs mounts have the default settings. Is there a problem with such a setup?

Comment 4 Michal Skrivanek 2018-05-07 13:07:02 UTC

(In reply to Polina from comment #3)
> The scenario was run on the clean setup. The environment has three different
> nfs mounts. only one of them nfs_0 is configured with the Retransmissions=2,
> Timeout=1. Other nfs mounts have the default settings. Is there a problem
> with such a setup?

according to host logs it seems to timeout the same way as before changing mount options, at least at some point. It would be best if you really only have one or use the same setting on all

Comment 5 Polina 2018-05-10 06:49:37 UTC

Created attachment 1434230 [details]
logs and screenshot

Hi, the bug is reproduced again in the environment with all nfs SDs (including export_domain) set with a small timeout. 

Please see the new attachment, including engine.log, vdsm.log, qemu, get vm response, vm screenshot.

Comment 6 Michal Skrivanek 2018-08-30 07:43:15 UTC

Polina, I guess this is no longer relevant and covered by other bugs in the meantime, right?

Comment 7 Polina 2018-09-02 07:16:09 UTC

Hi Michal, 
I don't see the problem happens in upstream ovirt-release42-snapshot-4.2.6-0.3.rc3.20180826015005.git2aa33d5.el7.noarch and downstream rhv-release-4.2.5-6-001.noarch.