Bug 1709303 - [downstream clone - 4.3.4] Failure in creating snapshots during "Live Storage Migration" can result in a nonexistent snapshot
Summary: [downstream clone - 4.3.4] Failure in creating snapshots during "Live Storage...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.3.1
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ovirt-4.3.4
: 4.3.1
Assignee: Benny Zlotnik
QA Contact: Yosi Ben Shimon
URL:
Whiteboard:
Depends On: 1695026
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-13 11:18 UTC by RHV bug bot
Modified: 2020-08-03 15:26 UTC (History)
5 users (show)

Fixed In Version: ovirt-engine-4.3.4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1695026
Environment:
Last Closed: 2019-06-20 14:48:33 UTC
oVirt Team: Storage
Target Upstream Version:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3545971 0 Troubleshoot None Snapshot deletion (live merge) fails with error. 2019-05-13 11:19:02 UTC
Red Hat Product Errata RHEA-2019:1566 0 None None None 2019-06-20 14:48:45 UTC
oVirt gerrit 99928 0 'None' MERGED core: validate source domain in LSM 2020-11-19 15:48:59 UTC
oVirt gerrit 99929 0 'None' MERGED core: make MoveOrCopyDiskCommandTest vars consistent 2020-11-19 15:49:21 UTC
oVirt gerrit 99936 0 'None' MERGED core: validate source domain in LSM 2020-11-19 15:49:21 UTC

Description RHV bug bot 2019-05-13 11:18:50 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1695026 +++
======================================================================

Description of problem:

The VM was having below structure before the LSM and the snapshot id was f9feefbe and 8619b674

===

           image_guid              |            image_group_id            |               parentid               |            vm_snapshot_id            | imagestatus 
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+-------------
 b31e31e6-3f23-4b48-8f08-8f8c87524b14 | 83ca424e-408e-43a6-8371-5f3cc90677a5 | 0d4b6990-9d8c-417d-afbf-5cafb8207fc3 | f9feefbe-16b7-4343-bcd7-7d43213ee765 |           1
 0d4b6990-9d8c-417d-afbf-5cafb8207fc3 | 83ca424e-408e-43a6-8371-5f3cc90677a5 | 00000000-0000-0000-0000-000000000000 | 8619b674-c6d8-44d0-98d9-fee3dbd9d4ea |           1
 0d0a93c9-8a7a-4401-9b7c-cc251b23e7f4 | 73ac2a39-d52c-4aa3-bd90-2e42a9af81f6 | 00000000-0000-0000-0000-000000000000 | 8619b674-c6d8-44d0-98d9-fee3dbd9d4ea |           1
 8f0b95c1-d63c-4bd5-8f6d-28c1995796d7 | 73ac2a39-d52c-4aa3-bd90-2e42a9af81f6 | 0d0a93c9-8a7a-4401-9b7c-cc251b23e7f4 | f9feefbe-16b7-4343-bcd7-7d43213ee765 |           1
===

The source storage domain was having 0 GB disk space. Then a live storage migration of the disk 73ac2a39 was executed which created snapshot id 3067f179.

(structure during the LSM operation)

===
              image_guid              |            image_group_id            |               parentid               |            vm_snapshot_id            | imagestatus 
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+-------------
 b31e31e6-3f23-4b48-8f08-8f8c87524b14 | 83ca424e-408e-43a6-8371-5f3cc90677a5 | 0d4b6990-9d8c-417d-afbf-5cafb8207fc3 | 3067f179-8620-483c-bacd-6aea46b499be |           1
 0d4b6990-9d8c-417d-afbf-5cafb8207fc3 | 83ca424e-408e-43a6-8371-5f3cc90677a5 | 00000000-0000-0000-0000-000000000000 | 8619b674-c6d8-44d0-98d9-fee3dbd9d4ea |           1
 ed7c6d73-0d17-470b-8152-42fa45d2caea | 73ac2a39-d52c-4aa3-bd90-2e42a9af81f6 | 8f0b95c1-d63c-4bd5-8f6d-28c1995796d7 | 3067f179-8620-483c-bacd-6aea46b499be |           2
 8f0b95c1-d63c-4bd5-8f6d-28c1995796d7 | 73ac2a39-d52c-4aa3-bd90-2e42a9af81f6 | 0d0a93c9-8a7a-4401-9b7c-cc251b23e7f4 | f9feefbe-16b7-4343-bcd7-7d43213ee765 |           2
 0d0a93c9-8a7a-4401-9b7c-cc251b23e7f4 | 73ac2a39-d52c-4aa3-bd90-2e42a9af81f6 | 00000000-0000-0000-0000-000000000000 | 8619b674-c6d8-44d0-98d9-fee3dbd9d4ea |           2
===

The snapshot operation during the LSM failed because of low storage space in the source storage domain. The engine successfully deleted the snapshot 3067f179 from the snapshot table as a part of rollback.

===
engine=# select * from snapshots where snapshot_id = '3067f179-8620-483c-bacd-6aea46b499be';
 snapshot_id | vm_id | snapshot_type | status | description | creation_date | app_list | vm_configuration | _create_date | _update_date | memory_metadata_disk_id | memory_du
mp_disk_id | vm_configuration_broken 
-------------+-------+---------------+--------+-------------+---------------+----------+------------------+--------------+--------------+-------------------------+----------
-----------+-------------------------
(0 rows)
===

However, the other image 83ca424e is still pointing to 3067f179 which doesn't exist in the snapshot table.


(structure after failed snapshot operation)

===
              image_guid              |            image_group_id            |               parentid               |            vm_snapshot_id            | imagestatus 
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+-------------
 b31e31e6-3f23-4b48-8f08-8f8c87524b14 | 83ca424e-408e-43a6-8371-5f3cc90677a5 | 0d4b6990-9d8c-417d-afbf-5cafb8207fc3 | 3067f179-8620-483c-bacd-6aea46b499be |           1
 0d4b6990-9d8c-417d-afbf-5cafb8207fc3 | 83ca424e-408e-43a6-8371-5f3cc90677a5 | 00000000-0000-0000-0000-000000000000 | 8619b674-c6d8-44d0-98d9-fee3dbd9d4ea |           1
 8f0b95c1-d63c-4bd5-8f6d-28c1995796d7 | 73ac2a39-d52c-4aa3-bd90-2e42a9af81f6 | 0d0a93c9-8a7a-4401-9b7c-cc251b23e7f4 | f9feefbe-16b7-4343-bcd7-7d43213ee765 |           1
 0d0a93c9-8a7a-4401-9b7c-cc251b23e7f4 | 73ac2a39-d52c-4aa3-bd90-2e42a9af81f6 | 00000000-0000-0000-0000-000000000000 | 8619b674-c6d8-44d0-98d9-fee3dbd9d4ea |           1
===

Now if you try to delete the old snapshot of disk 83ca424e, it will fail with NPE and image will be marked as illegal.

===
              image_guid              |            image_group_id            |               parentid               |            vm_snapshot_id            | imagestatus 
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+-------------
 b31e31e6-3f23-4b48-8f08-8f8c87524b14 | 83ca424e-408e-43a6-8371-5f3cc90677a5 | 0d4b6990-9d8c-417d-afbf-5cafb8207fc3 | 3067f179-8620-483c-bacd-6aea46b499be |           1
 0d4b6990-9d8c-417d-afbf-5cafb8207fc3 | 83ca424e-408e-43a6-8371-5f3cc90677a5 | 00000000-0000-0000-0000-000000000000 | 8619b674-c6d8-44d0-98d9-fee3dbd9d4ea |           4
 0d0a93c9-8a7a-4401-9b7c-cc251b23e7f4 | 73ac2a39-d52c-4aa3-bd90-2e42a9af81f6 | 00000000-0000-0000-0000-000000000000 | f9feefbe-16b7-4343-bcd7-7d43213ee765 |           1



2019-04-01 00:00:55,761-04 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-50) [356357f5-391b-458a-9f03-fcbd96759637] null: java.lang.NullPointerException
	at org.ovirt.engine.core.common.action.MergeParameters.<init>(MergeParameters.java:30) [common.jar:]
	at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.buildMergeParameters(RemoveSnapshotSingleDiskLiveCommand.java:191) [bll.jar:]
	at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.performNextOperation(RemoveSnapshotSingleDiskLiveCommand.java:114) [bll.jar:]
	at org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32) [bll.jar:]
	at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:77) [bll.jar:]
===

Manual database changes are required to correct the structure.


Version-Release number of selected component (if applicable):

I tested it in rhvm-4.3.1.1-0.1.el7.noarch 

The customer is in 4.2 and hence issue is there in old releases as well.


How reproducible:

100%

Steps to Reproduce:

1. Create a VM with 2 disks. 
2. Create a snapshot for this VM including both the disk.
3. Fill the storage domain so that there is 0GB free space.
4. Try to migrate one of the disks. It will fail during the snapshot operation since there is no free space in the source storage domain. 
5. Check the structure of the image in the database. The other disk will be pointing to an inexistent snapshot.
6. Try to delete the snapshot created in [2]. One will fail with NPE.

Actual results:

Low disk space in source storage domain during "Live Storage Migration" can result in a nonexistent snapshot.

Expected results:

The engine should not initiate the LSM if there is low disk space on source storage domain. We are already preventing the manual snapshot operation if there is low disk space on the storage domain. However, the snapshot created during the LSM doesn't check about free space in the source storage domain before initiating the operation.

Also, we should not create an inconsistent structure if the snapshot operation failed during LSM. 

Additional info:

(Originally by Nijin Ashok)

Comment 4 RHV bug bot 2019-05-16 15:29:27 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

Comment 6 Yosi Ben Shimon 2019-06-02 07:05:52 UTC
Tested using:
ovirt-engine-4.3.4.1-0.1.el7.noarch


Tested using the above scenario:

1. Create a VM with 2 disks. 
2. Create a snapshot for this VM including both the disk.

got 2 snapshots: Active_VM and snap_1:

From the database:

snaphosts:

engine=# select snapshot_id, vm_id, status, description from snapshots where vm_id='22778cd4-25b3-4d00-86f4-0f3b009077c1';
             snapshot_id              |                vm_id                 | status | description 
--------------------------------------+--------------------------------------+--------+-------------
 db1b3c93-2466-443b-b8cc-ddbff9f21345 | 22778cd4-25b3-4d00-86f4-0f3b009077c1 | OK     | snap_1
 cb23b281-351d-4bc4-b391-f7da887460a3 | 22778cd4-25b3-4d00-86f4-0f3b009077c1 | OK     | Active VM
(2 rows)

images:

engine=# select image_guid, image_group_id, parentid, vm_snapshot_id from images;
              image_guid              |            image_group_id            |               parentid               |            vm_snapshot_id            
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------
 5c76875d-c18e-4f2c-9d98-50a212e85523 | b7a2feba-2b62-48e4-8749-e7a9ad7f9e88 | 00000000-0000-0000-0000-000000000000 | db1b3c93-2466-443b-b8cc-ddbff9f21345
 da45376d-0820-4b19-b510-29657e03f05c | c8ceb520-a7ee-4a9d-98a5-62804c36e098 | 1464b05a-509b-4f9a-892c-2a8a67b982f2 | cb23b281-351d-4bc4-b391-f7da887460a3
 ef55999e-5d48-4e04-86c6-45b4fbe52781 | b7a2feba-2b62-48e4-8749-e7a9ad7f9e88 | 5c76875d-c18e-4f2c-9d98-50a212e85523 | cb23b281-351d-4bc4-b391-f7da887460a3
 1464b05a-509b-4f9a-892c-2a8a67b982f2 | c8ceb520-a7ee-4a9d-98a5-62804c36e098 | 00000000-0000-0000-0000-000000000000 | db1b3c93-2466-443b-b8cc-ddbff9f21345


******* start the VM *********

3. Fill the storage domain so that there is 0GB free space:

Created a new preallocated disk using all available space in the storage domain (iscsi_0):

Engine log:

2019-06-02 08:55:21,476+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-Man
agedThreadFactory-engineScheduled-Thread-43) [] EVENT_ID: IRS_DISK_SPACE_LOW_ERROR(201), Critical, Low disk spa
ce. iscsi_0 domain has 0 GB of free space.
 
4. Try to migrate one of the disks. It will fail during the snapshot operation since there is no free space in the source storage domain.

The operation failed with this error in the engine log as expected:
2019-06-02 08:58:30,321+03 WARN  [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (default task-40) [d330c72f-1861-4e0d-8b60-4e1e74954209] Validation of action 'LiveMigrateDisk' failed for user admin@internal-authz. Reasons: VAR__ACTION__MOVE,VAR__TYPE__DISK,ACTION_TYPE_FAILED_DISK_SPACE_LOW_ON_STORAGE_DOMAIN,$storageName iscsi_0

5. Check the structure of the image in the database. The other disk will be pointing to an inexistent snapshot.

At this point - still got 2 snapshots from step 2 and same images:

engine=# select image_guid, image_group_id, parentid, vm_snapshot_id from images where image_group_id in ('b7a2feba-2b62-48e4-8749-e7a9ad7f9e88', 'c8ceb520-a7ee-4a9d-98a5-62804c36e098'); 
              image_guid              |            image_group_id            |               parentid               |            vm_snapshot_id            
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------
 5c76875d-c18e-4f2c-9d98-50a212e85523 | b7a2feba-2b62-48e4-8749-e7a9ad7f9e88 | 00000000-0000-0000-0000-000000000000 | db1b3c93-2466-443b-b8cc-ddbff9f21345
 da45376d-0820-4b19-b510-29657e03f05c | c8ceb520-a7ee-4a9d-98a5-62804c36e098 | 1464b05a-509b-4f9a-892c-2a8a67b982f2 | cb23b281-351d-4bc4-b391-f7da887460a3
 ef55999e-5d48-4e04-86c6-45b4fbe52781 | b7a2feba-2b62-48e4-8749-e7a9ad7f9e88 | 5c76875d-c18e-4f2c-9d98-50a212e85523 | cb23b281-351d-4bc4-b391-f7da887460a3
 1464b05a-509b-4f9a-892c-2a8a67b982f2 | c8ceb520-a7ee-4a9d-98a5-62804c36e098 | 00000000-0000-0000-0000-000000000000 | db1b3c93-2466-443b-b8cc-ddbff9f21345
(4 rows)


6. Try to delete the snapshot created in [2]. One will fail with NPE.

Failed to remove the snapshot from step 2 (snap_1) due to low space on storage - no NPE has seen in the logs

From the UI:
"Cannot remove Snapshot. Low disk space on Storage Domain iscsi_0."

From the engine log:

2019-06-02 09:43:15,436+03 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-48) [c28cbf70-fba7-4ab5-824b-7b31835e3926] Lock Acquired to object 'EngineLock:{exclusiveLocks='[22778cd4-25b3-4d00-86f4-0f3b009077c1=VM]', sharedLocks=''}'
2019-06-02 09:43:15,589+03 WARN  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-48) [c28cbf70-fba7-4ab5-824b-7b31835e3926] Validation of action 'RemoveSnapshot' failed for user admin@internal-authz. Reasons: VAR__TYPE__SNAPSHOT,VAR__ACTION__REMOVE,ACTION_TYPE_FAILED_DISK_SPACE_LOW_ON_STORAGE_DOMAIN,$storageName iscsi_0
2019-06-02 09:43:15,590+03 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-48) [c28cbf70-fba7-4ab5-824b-7b31835e3926] Lock freed to object 'EngineLock:{exclusiveLocks='[22778cd4-25b3-4d00-86f4-0f3b009077c1=VM]', sharedLocks=''}'


Power off VM and free some space from the storage

Try to delete the snapshot again:


2019-06-02 09:50:42,377+03 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-53) [21194e6f-775b-439b-9f50-25e96212c3bb] Lock Acquired to object 'EngineLock:{exclusiveLocks='[22778cd4-25b3-4d00-86f4-0f3b009077c1=VM]', sharedLocks=''}'
2019-06-02 09:50:42,487+03 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-53) [21194e6f-775b-439b-9f50-25e96212c3bb] Running command: RemoveSnapshotCommand internal: false. Entities affected :  ID: 22778cd4-25b3-4d00-86f4-0f3b009077c1 Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER
2019-06-02 09:50:42,499+03 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (default task-53) [21194e6f-775b-439b-9f50-25e96212c3bb] Lock freed to object 'EngineLock:{exclusiveLocks='[22778cd4-25b3-4d00-86f4-0f3b009077c1=VM]', sharedLocks=''}'

....

2019-06-02 09:51:47,575+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [21194e6f-775b-439b-9f50-25e96212c3bb] EVENT_ID: USER_REMOVE_SNAPSHOT_FINISHED_SUCCESS(356), Snapshot 'snap_1' deletion for VM 'test_vm' has been completed.



snapshot after delete:
engine=# select snapshot_id, vm_id, status, description from snapshots where vm_id='22778cd4-25b3-4d00-86f4-0f3b009077c1';
             snapshot_id              |                vm_id                 | status | description 
--------------------------------------+--------------------------------------+--------+-------------
 cb23b281-351d-4bc4-b391-f7da887460a3 | 22778cd4-25b3-4d00-86f4-0f3b009077c1 | OK     | Active VM
(1 row)

images:

engine=# select image_guid, image_group_id, parentid, vm_snapshot_id from images where image_group_id in ('b7a2feba-2b62-48e4-8749-e7a9ad7f9e88', 'c8ceb520-a7ee-4a9d-98a5-62804c36e098');
              image_guid              |            image_group_id            |               parentid               |            vm_snapshot_id            
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------
 5c76875d-c18e-4f2c-9d98-50a212e85523 | b7a2feba-2b62-48e4-8749-e7a9ad7f9e88 | 00000000-0000-0000-0000-000000000000 | cb23b281-351d-4bc4-b391-f7da887460a3
 1464b05a-509b-4f9a-892c-2a8a67b982f2 | c8ceb520-a7ee-4a9d-98a5-62804c36e098 | 00000000-0000-0000-0000-000000000000 | cb23b281-351d-4bc4-b391-f7da887460a3
(2 rows)



Works as expected. Moving to VERIFIED

Comment 8 errata-xmlrpc 2019-06-20 14:48:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1566

Comment 9 Daniel Gur 2019-08-28 13:13:05 UTC
sync2jira

Comment 10 Daniel Gur 2019-08-28 13:17:18 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.