1574402 – I/O paused HA VM(nfs) with 'KILL' resume behavior is not restarted on another host

Bug 1574402 - I/O paused HA VM(nfs) with 'KILL' resume behavior is not restarted on another host

Summary: I/O paused HA VM(nfs) with 'KILL' resume behavior is not restarted on another...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.2.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Michal Skrivanek
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-03 08:26 UTC by Polina
Modified:	2018-09-03 11:37 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-09-03 11:37:12 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
engine, vdsm, qemu logs (2.31 MB, application/x-gzip) 2018-05-03 08:26 UTC, Polina	no flags	Details
logs and screenshot (4.00 MB, application/x-gzip) 2018-05-10 06:49 UTC, Polina	no flags	Details
View All

Description Polina 2018-05-03 08:26:21 UTC

Created attachment 1430556 [details]
engine, vdsm, qemu logs

Description of problem:HA VM configured with KILL resume behavior must be automatically restarted on another healthy host while I/O error pause.
In most caces it does. Sometimes (~30%) VMs on nfs SD remain to be i/o paused and not restarted.


Version-Release number of selected component (if applicable): rhv-release-4.2.3-4-001.noarch

How reproducible:~30%


Steps to Reproduce:
1. Configure 'KILL' resume behavior for High Available nfs VM .
2. Block the storage on the host where the VM is running (iptables -I INPUT -s yellow-vdsb.qa.lab.tlv.redhat.com -j DROP). 
3. Wait for some time (> 1hr ) 

Actual results: sometimes the HA VM with 'KILL' resume behavior is not restarted. remains paused on the same host forever

Expected results: The HA VM with 'KILL' resume behavior must be restarted on another host while staying in i/o error pause.

Additional info: in the attached engine.log the scenario starts at:
2671 2018-05-03 10:42:14,821+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_0 has been paused due to storage I/O problem.

Comment 1 Polina 2018-05-03 08:34:50 UTC

this bug is related to the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1540548 (comments31-35)

Comment 2 Michal Skrivanek 2018-05-04 13:28:06 UTC

Libvirt analysis: https://bugzilla.redhat.com/show_bug.cgi?id=1540548#c41

We've noticed there is another NFS mount with different settings which would cause a hang for ~20 minutes, it is possible libvirt was stuck accessing that one. Can you please reproduce on a clean setup?

Comment 3 Polina 2018-05-06 14:00:51 UTC

The scenario was run on the clean setup. The environment has three different nfs mounts. only one of them nfs_0 is configured with the Retransmissions=2, Timeout=1. Other nfs mounts have the default settings. Is there a problem with such a setup?

Comment 4 Michal Skrivanek 2018-05-07 13:07:02 UTC

(In reply to Polina from comment #3)
> The scenario was run on the clean setup. The environment has three different
> nfs mounts. only one of them nfs_0 is configured with the Retransmissions=2,
> Timeout=1. Other nfs mounts have the default settings. Is there a problem
> with such a setup?

according to host logs it seems to timeout the same way as before changing mount options, at least at some point. It would be best if you really only have one or use the same setting on all

Comment 5 Polina 2018-05-10 06:49:37 UTC

Created attachment 1434230 [details]
logs and screenshot

Hi, the bug is reproduced again in the environment with all nfs SDs (including export_domain) set with a small timeout. 

Please see the new attachment, including engine.log, vdsm.log, qemu, get vm response, vm screenshot.

Comment 6 Michal Skrivanek 2018-08-30 07:43:15 UTC

Polina, I guess this is no longer relevant and covered by other bugs in the meantime, right?

Comment 7 Polina 2018-09-02 07:16:09 UTC

Hi Michal, 
I don't see the problem happens in upstream ovirt-release42-snapshot-4.2.6-0.3.rc3.20180826015005.git2aa33d5.el7.noarch and downstream rhv-release-4.2.5-6-001.noarch.

Note You need to log in before you can comment on or make changes to this bug.