2010478 – After storage error HA VMs failed to auto resume.

Bug 2010478 - After storage error HA VMs failed to auto resume.

Summary: After storage error HA VMs failed to auto resume.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.5.0
Target Release:	4.5.0
Assignee:	Milan Zamazal
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-04 18:23 UTC by Frank DeLorey
Modified:	2022-08-21 11:17 UTC (History)
CC List:	7 users (show)
Fixed In Version:	vdsm-4.50.0.5
Doc Type:	Bug Fix
Doc Text:	Previously, if storage problems occurred and disappeared during a VM migration attempt, it sometimes led to the VM being paused and not resuming even if the VM had an auto-resume policy set. In this release, the VM is handled according to its resume behavior policy when the storage state changes during a VM migration attempt.
Clone Of:
Environment:
Last Closed:	2022-05-26 17:22:44 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43758	None	None	None	2021-10-04 18:29:35 UTC
Red Hat Product Errata	RHSA-2022:4764	None	None	None	2022-05-26 17:23:06 UTC
oVirt gerrit	117958	master	MERGED	virt: Prevent migrations of VMs paused due to an I/O error	2022-01-07 21:25:07 UTC
oVirt gerrit	117960	master	MERGED	virt: Postpone VM resume if blocked due to the VM status	2022-01-07 21:25:09 UTC
oVirt gerrit	117985	master	MERGED	virt: Permit resuming migrating VMs stopped due to I/O errors	2022-01-07 21:27:15 UTC

Comment 7 Polina 2022-04-24 08:41:32 UTC

verified on ovirt-engine-tools-4.5.0.2-0.7.el8ev.noarch

The following scenarios tested:

1. Run High Available VM with lease .Block connection to storage from host with the running VM . The VM is paused with IO ERROR storage error. No attempts of migrations in the engine.log. When storage is back the VM is recovered back to UP.

2. Run the High Available VM configured with 16384 MB , 4 CPUs and post-copy migration policy . on another hosts create CPU load so that the migration there is possible but slower because all of this configurations. Start migration process . while migration block connection to storage from host with the running VM. The VM is paused  

2022-04-20 17:49:09,251+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-9) [2f7d31b1] Migration of VM 'golden_env_mixed_virtio_2' to host 'host_mixed_2' failed: VM destroyed during the startup.
2022-04-20 17:49:09,289+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-27) [2f7d31b1] VM '92e17fe6-fdea-4a85-8685-4abadea95dd9'(golden_env_mixed_virtio_2) moved from 'MigratingFrom' --> 'Paused'
2022-04-20 17:49:09,301+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-27) [2f7d31b1] EVENT_ID: VM_PAUSED(1,025), VM golden_env_mixed_virtio_2 has been paused.
2022-04-20 17:49:09,311+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-27) [2f7d31b1] EVENT_ID: VM_PAUSED_ERROR(139), VM golden_env_mixed_virtio_2 has been paused
...

2022-04-20 17:51:34,978+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-31) [] VM '92e17fe6-fdea-4a85-8685-4abadea95dd9'(golden_env_mixed_virtio_2) moved from 'Paused' --> 'Up'
2022-04-20 17:51:34,991+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-31) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM golden_env_mixed_virtio_2 has recovered from paused back to up.

3. The same as the case 2 with the difference that storage was disconnected on two hosts - source and destination.
The HA VM is paused and remains paused after the storage domain is fixed since it is not I/O error, but unknown error(only VMs paused with I/O Error resumed automatically). Upon user resume attempt the VM is restarted. 


2022-04-24 10:50:13,033+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-80879) [74a5150f-1c65-469c-a676-18fff3bcca28] EVENT_ID: USER_RESUME_VM(40), VM golden_env_mixed_virtio_2 was resumed by admin@internal-authz (Host: host_mixed_1).
2022-04-24 10:50:21,247+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-63) [] VM '92e17fe6-fdea-4a85-8685-4abadea95dd9'(golden_env_mixed_virtio_2) moved from 'Paused' --> 'NotResponding'
...
2022-04-24 10:51:54,529+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-17) [43f0c3f6] Failed to destroy VM '92e17fe6-fdea-4a85-8685-4abadea95dd9' because VM does not exist, ignoring
2022-04-24 10:51:54,529+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-17) [43f0c3f6] FINISH, DestroyVDSCommand, return: , log id: 226d4497
2022-04-24 10:51:54,529+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-17) [43f0c3f6] VM '92e17fe6-fdea-4a85-8685-4abadea95dd9'(golden_env_mixed_virtio_2) moved from 'Paused' --> 'Down'
2022-04-24 10:51:54,615+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-17) [43f0c3f6] EVENT_ID: VM_DOWN_ERROR(119), VM golden_env_mixed_virtio_2 is down with error. Exit message: Down because paused for too long.
2022-04-24 10:51:54,616+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-17) [43f0c3f6] add VM '92e17fe6-fdea-4a85-8685-4abadea95dd9'(golden_env_mixed_virtio_2) to HA rerun treatment
2022-04-24 10:51:54,633+03 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (EE-ManagedThreadFactory-engine-Thread-80919) [704a0b1d] Running command: ProcessDownVmCommand internal: true.
2022-04-24 10:51:54,656+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-17) [43f0c3f6] EVENT_ID: HA_VM_FAILED(9,602), Highly Available VM golden_env_mixed_virtio_2 failed. It will be restarted automatically.
2022-04-24 10:51:55,299+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-72) [43df18ba] START, CreateBrokerVDSCommand(HostName = host_mixed_1, CreateVDSCommandParameters:{hostId='d9bae774-64a1-411c-a9ed-ddadbc2095f9', vmId='92e17fe6-fdea-4a85-8685-4abadea95dd9', vm='VM [golden_env_mixed_virtio_2]'}), log id: bf2ab5

022-04-24 10:53:06,573+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-51) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to restart VM golden_env_mixed_virtio_2 on Host host_mixed_1

VM restarted

Comment 14 errata-xmlrpc 2022-05-26 17:22:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: RHV RHEL Host (ovirt-host) [ovirt-4.5.0] security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4764

Note You need to log in before you can comment on or make changes to this bug.