Bug 1540548

Summary:

[RFE] Automatically restart HA VMs paused due to I/O Error

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Miguel Martin <mmartinv>

Component:

ovirt-engine

Assignee:

Milan Zamazal <mzamazal>

Status:

CLOSED ERRATA

QA Contact:

Polina <pagranat>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.2.0

CC:

apinnick, bburmest, bugs, fgarciad, jdenemar, jsuchane, lsurette, mavital, michal.skrivanek, mkalinin, mtessun, rbalakri, Rhev-m-bugs, srevivo, ykaul

Target Milestone:

ovirt-4.2.2

Keywords:

FutureFeature

Target Release:

---

Flags:

lsvaty: testing_plan_complete-

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Enhancement

Doc Text:

Previously, virtual machines that were paused for too long due to I/O errors were only killed when the engine tried to restart them. The current release adds a default setting, as part of high availability, that kills virtual machines that are paused too long because of I/O error, regardless of whether or when they will be resumed. This allows paused highly available virtual machines to migrate and be restarted.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-05-15 17:48:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1543103, 1546967

Bug Blocks:

1538360

Attachments:

Description	Flags
logs	none
vdsm engine qemu log for HA vm(KILL) that is not restarted while the storage error	none

Description Miguel Martin 2018-01-31 11:31:00 UTC

Description of problem:

In an RHV installation with block-based storage domains, an HA VM running 
on a host is paused forever when the host loses its access to the storage domain.

Version-Release number of selected component (if applicable): 4.x

How reproducible: Always

Steps to Reproduce:

1. Install RHV with at least one block-based storage domain and 2 hosts.
2. Install a VM with its OS disk within the block-based storage domain.
3. Configure the VM as HA.
4. Boot the HA VM in host 1
5. Make the block-based storage domain which contains the VM OS disk unreachable from host 1.

Actual results:

The VM is paused due to an I/O Error, the host is marked as 'Non-operational' and the engine periodically tries to activate the host and to migrate the paused VM (failing because migrating a paused VM is not supported)

Expected results:

The expected result would be the HA VM being killed on host 1 and started on host 2 after a reasonable amount of time if it is not possible to resume the VM on host 1.

Additional info:

In this scenario only SD is unreachable and the manager is still communicating with vdsm, so the state of the VM is known.

To reproduce the issue I used an iSCSI storage domain and to simulate the storage failure the following iptables rule to block outgoing traffic to iSCSI destination port:
~~~
iptables -A OUTPUT -p tcp --dport 3260 -j REJECT
~~~
The same behavior should apply to FC backed storage domains.

Comment 1 Yaniv Kaul 2018-01-31 15:17:10 UTC

Why 4.3? Sounds like a regression even?

Comment 3 Michal Skrivanek 2018-02-01 09:07:49 UTC

(In reply to Yaniv Kaul from comment #1)
> Why 4.3? Sounds like a regression even?

no, it's an RFE actually.
We had a pre-existing feature in this area which automatically resumes Paused VMs on SD reconnect.
The HA with leases functionality was specifically requested to handle loss of connectivity between the manager and host - that works, as soon as we go through the timeouts and set VM to "Unknown" because we're not able to reach the host we restart it elsewhere. This happens regardless if the VM was Paused or it was killed by sanlock. For the former case the VM is actually killed only once the Resume operation is attempted

Here it's different - there is no problem with communication to host. Miguel asks for "restart VM somewhere else if it gets Paused for x secs".

We can use some periodic action within VDSM (e.g. Host Monitoring) and define an interval similar to sanlock's and kill the VM when requested. This would trigger HA restart since the communication with engine supposedly works

Questions:
- should the timeout be configurable or just a vdsm.conf value similar to sanlock
- should it be the default behavior of HA VMs with leases or does it need frther configuration

Comment 4 Marina Kalinin 2018-02-01 19:25:37 UTC

BZ#1230788 -> (oVirt_turn_off_autoresume_of_paused_VMs) [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out time)  -> target to 4.2.0

And the d/s bug: bz#1230788.

Michal, I think we got it covered in 4.2, am I wrong?

Comment 19 Michal Skrivanek 2018-02-21 09:01:04 UTC

since it completes teh HA functionality quite nicely, and the patch looks good already, let's target this for 4.2.3

Comment 20 Yaniv Kaul 2018-03-04 11:47:09 UTC

Michal:
1. Do we need https://gerrit.ovirt.org/#/c/87200/ as well?
2. Is it on 4.2.2 or 4.2.3?

Comment 21 Michal Skrivanek 2018-03-04 12:07:03 UTC

(In reply to Yaniv Kaul from comment #20)
> Michal:
> 1. Do we need https://gerrit.ovirt.org/#/c/87200/ as well?

No

> 2. Is it on 4.2.2 or 4.2.3?

4.2.2

Comment 23 Polina 2018-03-24 18:26:52 UTC

The Part 2 in the plan is related to the enhanced feature.

Comment 24 Polina 2018-04-30 08:19:47 UTC

Hi Michal, 

the bug is reported for block-base storage domains.
Implemented as following: the HA VM (both - with/without lease) configured with 'KILL' Resume Behavior, staying in I/O Error Pause for more than 80 sec, must be restarted on another host.

The verification is done for rhv-release-4.2.3-4-001.noarch on four types of storages: 

1. ISCSI, FC, Gluster - fully verified since the VM I/O Pause caused by the following:  
    - ISCSI and Gluster: dropping the rule by with iptables command;
    - FC: making LUN path faulty (like echo "offline" > /sys/block/sdd/device/state).

2. For NFS the verification has been done like the following:
   The I/O Pause caused by changing /etc/exports file on nfs-server while there is a writing process on VM. In these scenarios, the NFS VM turns to I/O Paused state. Then tries to be restarted on another healthy host. The attempt to be started though fails because of the problematic storage domain.

Please confirm if this is enough for verification

Comment 25 Yaniv Kaul 2018-04-30 08:42:36 UTC

A better test for NFS would be to simply shut down (abruptly or non-abruptly) the NFS server, as well as disconnecting it from the network (which is kind of drop with iptables).

Comment 26 Polina 2018-04-30 10:11:22 UTC

stop of nfs-server doesn't cause I/o error pause (tried this scenario). Even if it would, it is similar the scenario in which the HA VM could only try to restart and fail since the storage is unavailable (as described verification step 2.).

Comment 27 Michal Skrivanek 2018-04-30 10:31:39 UTC

(In reply to Polina from comment #26)
> stop of nfs-server doesn't cause I/o error pause (tried this scenario). Even
> if it would, it is similar the scenario in which the HA VM could only try to
> restart and fail since the storage is unavailable (as described verification
> step 2.).

sure, this needs to be blocked from client side so other hosts can continue working. But the verification of that is going to run into the issues that you won't see IOError before the host is fenced/NonOperational, so as discussed in other bugs you'd need to test this with shorter timeouts than the defaults we use.

Comment 28 Polina 2018-04-30 11:22:07 UTC

Michal, please specify which timeouts do you mean? on vdsm side?

Comment 29 Michal Skrivanek 2018-04-30 12:09:55 UTC

the same as in https://bugzilla.redhat.com/show_bug.cgi?id=1481022#c57

Comment 30 Polina 2018-04-30 15:26:03 UTC

Created attachment 1428847 [details]
logs

Comment 31 Polina 2018-04-30 15:28:43 UTC

Hi, here is the summary - 

The feature is verified for ISCSI, Gluster, FC as described in https://bugzilla.redhat.com/show_bug.cgi?id=1540548#c24

For NFS I succeeded to get IO Error Pause changing the Retransmissions & Timeout parameters for SD.
here are steps:
   1. Put the SD in maintenance(by Data Center)
   2. Open Storage Domains/Manage Domain/Custom Connection Parameters 
   3. Change the following parameters:
	Retransmissions (#) = 2
	Timeout (deciseconds) = 1 (i.e.10 sec)
   4. Activate the SD. 
   5. Run the VM associated with this SD.

However, the nfs HA VM configured with 'KILL', could not be successfully restarted to another host.
Please see the attached logs demonstrating the situation.
The scenario starts at 
 2018-04-30 18:03:56,173+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-13) [] EVENT_ID: VM_PAUSED_EIO(145), VM test1 has been paused due to storage I/O problem.

Then the VM fails to migrate - 
2018-04-30 18:04:22,625+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-752) [40d41ef] Failed to migrate VM 'test1'

Then it gets the Not responding status
2018-04-30 18:04:26,237+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-36) [] VM 'd6fce261-e7d3-4965-a523-969bcdf9b1f2'(test1) moved from 'Paused' --> 'NotResponding'

And then it returns to I/O error again. It is never restarted.

you can see now the scenario on compute-ge-3.scl.lab.tlv.redhat.com.

Comment 32 Michal Skrivanek 2018-04-30 18:34:58 UTC

(In reply to Polina from comment #31)

> you can see now the scenario on compute-ge-3.scl.lab.tlv.redhat.com.

libvirt looks completely stuck, it also failed to destroy the VM so it keeps trying to do that over and over again. Libvirt team need to investigate this, but they likely need a clear reproduction scenario or setup where it's still happening. 
Jardo, can anyone take a look please? It's on NFS but it supposedly should return errors just fine (mounted with a short timeout)

Comment 33 Polina 2018-05-01 07:05:05 UTC

Created attachment 1429098 [details]
vdsm engine qemu log for HA vm(KILL) that is not restarted while the storage error

Hi,

I've restarted the vdsm and engine. serveral times after this the scenario worked as expected. but then the stuck vm problem was reproduced again.
In the attached logs (engine, vdsm, qemu) the problematic scenario starts at 2018-05-01 09:47:19:
2018-05-01 09:47:19,274+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-85) [] EVENT_ID: VM_PAUSED(1,025), VM test1 has been paused.
2018-05-01 09:47:19,401+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-85) [] EVENT_ID: VM_PAUSED_EIO(145), VM test1 has been paused due to storage I/O problem.
...

the logs attached
##################################

In the same log we can see that for I/O error that happened before , at 2018-05-01 09:38:47,941+03, the scenario works as expected  - the VM is restarted on host_mixed_3. thought there is a migration error before the restart, that looks suspected:

2018-05-01 09:38:47,941+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [] EVENT_ID: VM_PAUSED_EIO(145), VM test1 has been paused due to storage I/O problem.
2018-05-01 09:40:17,750+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-327) [5742b069] Failed to migrate VM 'test1'
2018-05-01 09:40:48,208+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-76) [] VM 'd6fce261-e7d3-4965-a523-969bcdf9b1f2'(test1) is running in db and not running on VDS '367a2980-6f7f-4e33-93e0-6b400055cd79'(host_mixed_2)
2018-05-01 09:40:48,209+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-76) [] add VM 'd6fce261-e7d3-4965-a523-969bcdf9b1f2'(test1) to HA rerun treatment
2018-05-01 09:40:48,229+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-76) [] EVENT_ID: HA_VM_FAILED(9,602), Highly Available VM test1 failed. It will be restarted automatically.
2018-05-01 09:40:48,230+03 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (EE-ManagedThreadFactory-engineScheduled-Thread-76) [] Highly Available VM went down. Attempting to restart. VM Name 'test1', VM Id 'd6fce261-e7d3-4965-a523-969bcdf9b1f2'
2018-05-01 09:40:49,416+03 INFO  [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [55917cd4] START, CreateVDSCommand( CreateVDSCommandParameters:{hostId='a7b25efc-9024-4311-ab40-2c305805627d', vmId='d6fce261-e7d3-4965-a523-969bcdf9b1f2', vm='VM [test1]'}), log id: 3427456b
2018-05-01 09:40:49,425+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [55917cd4] START, CreateBrokerVDSCommand(HostName = host_mixed_3, CreateVDSCommandParameters:{hostId='a7b25efc-9024-4311-ab40-2c305805627d', vmId='d6fce261-e7d3-4965-a523-969bcdf9b1f2', vm='VM [test1]'}), log id: 534b7abd
  <name>test1</name>
2018-05-01 09:40:49,597+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [55917cd4] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM test1 on Host host_mixed_3

Comment 34 Polina 2018-05-01 08:50:59 UTC

Hi Michal,

need you advise.
The feature works in general for all kinds of storages, and only in nfs we see this specific problem, which not always happens, but happens a lot (60-70%).
So, I think I can verify this BZ and just open a new bug for this scenario.

Need your confirmation for this.

Comment 35 Michal Skrivanek 2018-05-02 11:18:30 UTC

sounds good to me

Comment 36 Michal Skrivanek 2018-05-02 11:19:28 UTC

adding back needinfo from comment #32

Comment 37 Michal Skrivanek 2018-05-03 07:41:38 UTC

(In reply to Polina from comment #33)
you must to start from the beginning.

I see the VM was started for the first time at 09:28:29,458 on host_mixed_2, paused with EIO at 09:30:37,904, killed at 09:32:31,020 (detected in engine 15s later), rerun on host_mixed_3 at 09:32:48,131.
The migration was initiated at 09:35:42,392 from host_mixed_3 to host_mixed_2 _manually_, and finished migrating at 09:36:13 (detected in engine 09:36:32).
09:38:45 another IOError was reported and the VM was killed at 09:40:39 (detected in engine 09:40:48) and VM was restarted on host_mixed_3 again, and again _manually_ migrated to host_mixed_2 at 09:45:44 till 09:45:52(09:46:03) and again paused EIO at 09:47:17(09:47:19)
At 09:48:49 it becomes NonResponsive and starts flipping between that and Paused. At 09:48:46 vdsm attempts to kill that VM but that fails at 09:49:01,182 and from that time on the VM behavior is unpredictable. The flipping is a consequence of failure in the libvirt destroy call and AFAICT it is accurate - not responding for the duration of that call.

host_mixed_2:
09:35:43 - NonOperational
09:40:00 - Up
09:40:17 - NonOperational
09:44:01 - Up
09:52:14 - NonOperational

Comment 38 Michal Skrivanek 2018-05-03 07:56:55 UTC

I see you've been running that VM before that time as well, I also see the VM ultimately crashed in QEMU at 10:05:36 (not covered by your logs)

Again, I do not see anything wrong, libvirt call seem to get stuck, needs to be investigated by libvirt team (adding jiri as well)

Comment 39 Polina 2018-05-03 08:49:05 UTC

I've reproduced the problem and inserted new bug for this issue https://bugzilla.redhat.com/show_bug.cgi?id=1574402 adding new logs and reference to this BZ comments 31-35

Comment 40 Polina 2018-05-03 10:39:42 UTC

Since the new bug is added for the specific issue described in comments 31-35, I verify the BZ on rhv-release-4.2.3-4-001.noarch. 
The verification summary - tested on four kinds of storages: ISCSI, FC, Gluster, NFS.

the I/O error was caused by the following:
    - ISCSI and Gluster: dropping the rule by with iptables command;
    - FC: making LUN path faulty (like echo "offline" > /sys/block/sdd/device/state).
    - NFS: by dropping the rule by with iptables command while SD is configured with small Retransmissions & Timeout (see comment 31 for details);

Comment 41 Jiri Denemark 2018-05-03 13:54:39 UTC

So in the NFS case described in comment 37, libvirt has unable to kill the
QEMU process for ~15 minutes and virProcessKillPainfully was reporting "Failed
to terminate process 12620 with SIGKILL: Device or resource busy". Then QEMU
died by itself. Looks like the process was stuck on I/O during that time.
There's nothing libvirt can do if kernel doesn't want to kill a process.

Comment 45 errata-xmlrpc 2018-05-15 17:48:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 46 Franta Kust 2019-05-16 13:04:02 UTC

BZ<2>Jira Resync