Bug 1540548
Summary: | [RFE] Automatically restart HA VMs paused due to I/O Error | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Miguel Martin <mmartinv> | ||||||
Component: | ovirt-engine | Assignee: | Milan Zamazal <mzamazal> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Polina <pagranat> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.2.0 | CC: | apinnick, bburmest, bugs, fgarciad, jdenemar, jsuchane, lsurette, mavital, michal.skrivanek, mkalinin, mtessun, rbalakri, Rhev-m-bugs, srevivo, ykaul | ||||||
Target Milestone: | ovirt-4.2.2 | Keywords: | FutureFeature | ||||||
Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Enhancement | |||||||
Doc Text: |
Previously, virtual machines that were paused for too long due to I/O errors were only killed when the engine tried to restart them. The current release adds a default setting, as part of high availability, that kills virtual machines that are paused too long because of I/O error, regardless of whether or when they will be resumed. This allows paused highly available virtual machines to migrate and be restarted.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-05-15 17:48:28 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1543103, 1546967 | ||||||||
Bug Blocks: | 1538360 | ||||||||
Attachments: |
|
Description
Miguel Martin
2018-01-31 11:31:00 UTC
Why 4.3? Sounds like a regression even? (In reply to Yaniv Kaul from comment #1) > Why 4.3? Sounds like a regression even? no, it's an RFE actually. We had a pre-existing feature in this area which automatically resumes Paused VMs on SD reconnect. The HA with leases functionality was specifically requested to handle loss of connectivity between the manager and host - that works, as soon as we go through the timeouts and set VM to "Unknown" because we're not able to reach the host we restart it elsewhere. This happens regardless if the VM was Paused or it was killed by sanlock. For the former case the VM is actually killed only once the Resume operation is attempted Here it's different - there is no problem with communication to host. Miguel asks for "restart VM somewhere else if it gets Paused for x secs". We can use some periodic action within VDSM (e.g. Host Monitoring) and define an interval similar to sanlock's and kill the VM when requested. This would trigger HA restart since the communication with engine supposedly works Questions: - should the timeout be configurable or just a vdsm.conf value similar to sanlock - should it be the default behavior of HA VMs with leases or does it need frther configuration BZ#1230788 -> (oVirt_turn_off_autoresume_of_paused_VMs) [RFE] Have a policy for autoresume of VMs paused due to IO errors (stay paused, turn off, restart with defined time out time) -> target to 4.2.0 And the d/s bug: bz#1230788. Michal, I think we got it covered in 4.2, am I wrong? since it completes teh HA functionality quite nicely, and the patch looks good already, let's target this for 4.2.3 Michal: 1. Do we need https://gerrit.ovirt.org/#/c/87200/ as well? 2. Is it on 4.2.2 or 4.2.3? (In reply to Yaniv Kaul from comment #20) > Michal: > 1. Do we need https://gerrit.ovirt.org/#/c/87200/ as well? No > 2. Is it on 4.2.2 or 4.2.3? 4.2.2 The Part 2 in the plan is related to the enhanced feature. Hi Michal, the bug is reported for block-base storage domains. Implemented as following: the HA VM (both - with/without lease) configured with 'KILL' Resume Behavior, staying in I/O Error Pause for more than 80 sec, must be restarted on another host. The verification is done for rhv-release-4.2.3-4-001.noarch on four types of storages: 1. ISCSI, FC, Gluster - fully verified since the VM I/O Pause caused by the following: - ISCSI and Gluster: dropping the rule by with iptables command; - FC: making LUN path faulty (like echo "offline" > /sys/block/sdd/device/state). 2. For NFS the verification has been done like the following: The I/O Pause caused by changing /etc/exports file on nfs-server while there is a writing process on VM. In these scenarios, the NFS VM turns to I/O Paused state. Then tries to be restarted on another healthy host. The attempt to be started though fails because of the problematic storage domain. Please confirm if this is enough for verification A better test for NFS would be to simply shut down (abruptly or non-abruptly) the NFS server, as well as disconnecting it from the network (which is kind of drop with iptables). stop of nfs-server doesn't cause I/o error pause (tried this scenario). Even if it would, it is similar the scenario in which the HA VM could only try to restart and fail since the storage is unavailable (as described verification step 2.). (In reply to Polina from comment #26) > stop of nfs-server doesn't cause I/o error pause (tried this scenario). Even > if it would, it is similar the scenario in which the HA VM could only try to > restart and fail since the storage is unavailable (as described verification > step 2.). sure, this needs to be blocked from client side so other hosts can continue working. But the verification of that is going to run into the issues that you won't see IOError before the host is fenced/NonOperational, so as discussed in other bugs you'd need to test this with shorter timeouts than the defaults we use. Michal, please specify which timeouts do you mean? on vdsm side? the same as in https://bugzilla.redhat.com/show_bug.cgi?id=1481022#c57 Created attachment 1428847 [details]
logs
Hi, here is the summary - The feature is verified for ISCSI, Gluster, FC as described in https://bugzilla.redhat.com/show_bug.cgi?id=1540548#c24 For NFS I succeeded to get IO Error Pause changing the Retransmissions & Timeout parameters for SD. here are steps: 1. Put the SD in maintenance(by Data Center) 2. Open Storage Domains/Manage Domain/Custom Connection Parameters 3. Change the following parameters: Retransmissions (#) = 2 Timeout (deciseconds) = 1 (i.e.10 sec) 4. Activate the SD. 5. Run the VM associated with this SD. However, the nfs HA VM configured with 'KILL', could not be successfully restarted to another host. Please see the attached logs demonstrating the situation. The scenario starts at 2018-04-30 18:03:56,173+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-13) [] EVENT_ID: VM_PAUSED_EIO(145), VM test1 has been paused due to storage I/O problem. Then the VM fails to migrate - 2018-04-30 18:04:22,625+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-752) [40d41ef] Failed to migrate VM 'test1' Then it gets the Not responding status 2018-04-30 18:04:26,237+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-36) [] VM 'd6fce261-e7d3-4965-a523-969bcdf9b1f2'(test1) moved from 'Paused' --> 'NotResponding' And then it returns to I/O error again. It is never restarted. you can see now the scenario on compute-ge-3.scl.lab.tlv.redhat.com. (In reply to Polina from comment #31) > you can see now the scenario on compute-ge-3.scl.lab.tlv.redhat.com. libvirt looks completely stuck, it also failed to destroy the VM so it keeps trying to do that over and over again. Libvirt team need to investigate this, but they likely need a clear reproduction scenario or setup where it's still happening. Jardo, can anyone take a look please? It's on NFS but it supposedly should return errors just fine (mounted with a short timeout) Created attachment 1429098 [details]
vdsm engine qemu log for HA vm(KILL) that is not restarted while the storage error
Hi,
I've restarted the vdsm and engine. serveral times after this the scenario worked as expected. but then the stuck vm problem was reproduced again.
In the attached logs (engine, vdsm, qemu) the problematic scenario starts at 2018-05-01 09:47:19:
2018-05-01 09:47:19,274+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-85) [] EVENT_ID: VM_PAUSED(1,025), VM test1 has been paused.
2018-05-01 09:47:19,401+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-85) [] EVENT_ID: VM_PAUSED_EIO(145), VM test1 has been paused due to storage I/O problem.
...
the logs attached
##################################
In the same log we can see that for I/O error that happened before , at 2018-05-01 09:38:47,941+03, the scenario works as expected - the VM is restarted on host_mixed_3. thought there is a migration error before the restart, that looks suspected:
2018-05-01 09:38:47,941+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [] EVENT_ID: VM_PAUSED_EIO(145), VM test1 has been paused due to storage I/O problem.
2018-05-01 09:40:17,750+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-327) [5742b069] Failed to migrate VM 'test1'
2018-05-01 09:40:48,208+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-76) [] VM 'd6fce261-e7d3-4965-a523-969bcdf9b1f2'(test1) is running in db and not running on VDS '367a2980-6f7f-4e33-93e0-6b400055cd79'(host_mixed_2)
2018-05-01 09:40:48,209+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-76) [] add VM 'd6fce261-e7d3-4965-a523-969bcdf9b1f2'(test1) to HA rerun treatment
2018-05-01 09:40:48,229+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-76) [] EVENT_ID: HA_VM_FAILED(9,602), Highly Available VM test1 failed. It will be restarted automatically.
2018-05-01 09:40:48,230+03 INFO [org.ovirt.engine.core.bll.VdsEventListener] (EE-ManagedThreadFactory-engineScheduled-Thread-76) [] Highly Available VM went down. Attempting to restart. VM Name 'test1', VM Id 'd6fce261-e7d3-4965-a523-969bcdf9b1f2'
2018-05-01 09:40:49,416+03 INFO [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [55917cd4] START, CreateVDSCommand( CreateVDSCommandParameters:{hostId='a7b25efc-9024-4311-ab40-2c305805627d', vmId='d6fce261-e7d3-4965-a523-969bcdf9b1f2', vm='VM [test1]'}), log id: 3427456b
2018-05-01 09:40:49,425+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [55917cd4] START, CreateBrokerVDSCommand(HostName = host_mixed_3, CreateVDSCommandParameters:{hostId='a7b25efc-9024-4311-ab40-2c305805627d', vmId='d6fce261-e7d3-4965-a523-969bcdf9b1f2', vm='VM [test1]'}), log id: 534b7abd
<name>test1</name>
2018-05-01 09:40:49,597+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [55917cd4] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM test1 on Host host_mixed_3
Hi Michal, need you advise. The feature works in general for all kinds of storages, and only in nfs we see this specific problem, which not always happens, but happens a lot (60-70%). So, I think I can verify this BZ and just open a new bug for this scenario. Need your confirmation for this. sounds good to me adding back needinfo from comment #32 (In reply to Polina from comment #33) you must to start from the beginning. I see the VM was started for the first time at 09:28:29,458 on host_mixed_2, paused with EIO at 09:30:37,904, killed at 09:32:31,020 (detected in engine 15s later), rerun on host_mixed_3 at 09:32:48,131. The migration was initiated at 09:35:42,392 from host_mixed_3 to host_mixed_2 _manually_, and finished migrating at 09:36:13 (detected in engine 09:36:32). 09:38:45 another IOError was reported and the VM was killed at 09:40:39 (detected in engine 09:40:48) and VM was restarted on host_mixed_3 again, and again _manually_ migrated to host_mixed_2 at 09:45:44 till 09:45:52(09:46:03) and again paused EIO at 09:47:17(09:47:19) At 09:48:49 it becomes NonResponsive and starts flipping between that and Paused. At 09:48:46 vdsm attempts to kill that VM but that fails at 09:49:01,182 and from that time on the VM behavior is unpredictable. The flipping is a consequence of failure in the libvirt destroy call and AFAICT it is accurate - not responding for the duration of that call. host_mixed_2: 09:35:43 - NonOperational 09:40:00 - Up 09:40:17 - NonOperational 09:44:01 - Up 09:52:14 - NonOperational I see you've been running that VM before that time as well, I also see the VM ultimately crashed in QEMU at 10:05:36 (not covered by your logs) Again, I do not see anything wrong, libvirt call seem to get stuck, needs to be investigated by libvirt team (adding jiri as well) I've reproduced the problem and inserted new bug for this issue https://bugzilla.redhat.com/show_bug.cgi?id=1574402 adding new logs and reference to this BZ comments 31-35 Since the new bug is added for the specific issue described in comments 31-35, I verify the BZ on rhv-release-4.2.3-4-001.noarch. The verification summary - tested on four kinds of storages: ISCSI, FC, Gluster, NFS. the I/O error was caused by the following: - ISCSI and Gluster: dropping the rule by with iptables command; - FC: making LUN path faulty (like echo "offline" > /sys/block/sdd/device/state). - NFS: by dropping the rule by with iptables command while SD is configured with small Retransmissions & Timeout (see comment 31 for details); So in the NFS case described in comment 37, libvirt has unable to kill the QEMU process for ~15 minutes and virProcessKillPainfully was reporting "Failed to terminate process 12620 with SIGKILL: Device or resource busy". Then QEMU died by itself. Looks like the process was stuck on I/O during that time. There's nothing libvirt can do if kernel doesn't want to kill a process. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488 BZ<2>Jira Resync |