Bug 1690475
Summary: | When a live storage migration fails, the auto generated snapshot does not get removed | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Steffen Froemer <sfroemer> | |
Component: | ovirt-engine | Assignee: | Benny Zlotnik <bzlotnik> | |
Status: | CLOSED ERRATA | QA Contact: | Ilan Zuckerman <izuckerm> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.2.7 | CC: | aefrat, bzlotnik, eshames, lsurette, mtessun, srevivo, tnisan, ycui | |
Target Milestone: | ovirt-4.4.0 | Keywords: | ZStream | |
Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
|
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | rhv-4.4.0-28 | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1702597 (view as bug list) | Environment: | ||
Last Closed: | 2020-08-04 13:16:55 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1785939, 1787378 | |||
Bug Blocks: | 1702597 |
Description
Steffen Froemer
2019-03-19 14:32:07 UTC
Benny, I recall we have an RFE for this issue Not sure, I think we have an RFE for removing if the VM was shutdown I am not entirely sure at which stage the cable is pulled? Live storage migration consists of: 1. Create a snapshot 2. Create image placeholder 3. Start replication 4. sync 5. finish replication 6. live merge After stage 2 and until the end of 5 the "snapshot" is present on both source and destination, and if the destination is blocked, we can't really clean it up and it will require manual intervention Though we can add a best-effort attempt to remove the auto-generated snapshot after failures (In reply to Benny Zlotnik from comment #2) > Not sure, I think we have an RFE for removing if the VM was shutdown > > I am not entirely sure at which stage the cable is pulled? In my case, the "pull cable" was caused by issue described in [1]. The scenario was as follow: The vDisk from VM should be moved from SD1 -> SD2. The snapshot was created on SD1 and migration started. The migration failed due to [1] and the Snapshot was not deleted. The expectation is, that in this case, the automatic created snapshot is removed automatically. I don't know, in which state the migration failed, but let me know, how I can help with additional logs, to get clarification to this. [1]: https://access.redhat.com/solutions/3086271 I see, I will a best-effort attempt to remove the auto-generated snapshot sync2jira sync2jira The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again. WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops Due to 1785939 I'm blocked and can't verify the bug yet. working version: 4.4.0-0.24.master.el8ev | Red Hat Enterprise Linux release 8.2 Beta (Ootpa) bug reproduced: 100% steps: 1. create nfs \ gluster disk and attach it to VM 2. make LSM to different nfs \ gluster Expected: LSM should work properly and the disk should migrated. Actual: An error raised when trying to LSM in one of the next cases: nfs -> nfs nfs -> gluster gluster -> gluster gluster -> nfs Engine Logs: 2020-03-17 12:38:59,268+02 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Command CopyData id: '19240136-d097-4416-b981-6deaee59c48a': execution was completed, the command status is 'FAILED' 2020-03-17 12:38:59,544+02 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Co mmand 'LiveMigrateDisk' (id: '4f37c28a-597f-4749-bc0a-a890358b9cf5') waiting on child command id: '0cedd7c9-06cd-417b-8b45-11146bb16134' type:'CopyImageGroupVolumesData' to complete 2020-03-17 12:39:00,557+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] End ing command 'org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand' with failure. 2020-03-17 12:39:00,573+02 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] C ommand 'CopyImageGroupVolumesData' id: '0cedd7c9-06cd-417b-8b45-11146bb16134' child commands '[19240136-d097-4416-b981-6deaee59c48a]' executions were completed, status 'FAILED' 2020-03-17 12:39:01,587+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [d3e546d9-6349-4b02-accd- a1d1d1e765b4] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand' with failure. 2020-03-17 12:39:01,662+02 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] C ommand 'LiveMigrateDisk' id: '4f37c28a-597f-4749-bc0a-a890358b9cf5' child commands '[a469ef90-ee0b-46f4-b263-cc6c6ddced92, 2239dbf0-5dff-4384-ae2d-f531b2c9ad72, 0cedd7c9-06cd-417b-8b45-11146bb16134]' executions were completed, status 'FAILED' 2020-03-17 12:39:02,751+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] End ing command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure. 2020-03-17 12:39:02,752+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Fai led during live storage migration of disk '1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c' of vm '0deb41e6-deb6-41c5-9f42-bc3dda1a1238', attempting to end replication before deleting the target disk 2020-03-17 12:39:02,753+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1 d1e765b4] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_1, VmReplicateDiskParameters:{hostId='9804e9ec-b212-40ce-9c8b-934675f1e896', vmId='0deb41e6-deb6-41c5-9f42-bc3dda1a1238', storagePoolId='677 451ec-4159-4fdc-8136-dec3fb9d861b', srcStorageDomainId='6543bae2-f959-48a7-8b82-84e2a0a5a1a7', targetStorageDomainId='6543bae2-f959-48a7-8b82-84e2a0a5a1a7', imageGroupId='1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c', i mageId='af3437c4-9fd3-4ebd-89c3-e3a9014eb206'}), log id: 387ea7c5 2020-03-17 12:39:02,882+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1 d1e765b4] FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 387ea7c5 2020-03-17 12:39:02,883+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Attempting to delete the target of disk '1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c' of vm '0deb41e6-deb6-41c5-9f42-bc3dda1a1238' Verified according steps that were agreed with DEV: ovirt-engine-4.4.0-0.33.master.el8ev.noarch vdsm-4.40.13-1.el8ev.x86_64 1. Trigger LSM for a blank vm iscsi -> iscsi 2. Grep for 'USER_CREATE_SNAPSHOT_FINISHED_SUCCESS' in engine log 3 then wait for "Running command: CloneImageGroupVolumesStructureCommand" (it will come very fast) 4. as soon as it starts, block connection to Storage on appropriate vdsm [root@storage-ge5-vdsm1 ~]# iptables -A OUTPUT -d XXXXX -j DROP 5. after short time, as soon as 'failed' / 'ERROR' appears, release the connection [root@storage-ge5-vdsm1 ~]# iptables -D OUTPUT -d XXXXX -j DROP 5. See that the temporary snapshot was removed from snapshots for that vm. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247 |