1637066 – [downstream clone - 4.2.7] Live Snapshot creation on a "not responding" VM will fail during "GetQemuImageInfoVDS"

Bug 1637066 - [downstream clone - 4.2.7] Live Snapshot creation on a "not responding" VM will fail during "GetQemuImageInfoVDS"

Summary: [downstream clone - 4.2.7] Live Snapshot creation on a "not responding" VM wi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.5
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.2.7
Target Release:	---
Assignee:	shani
QA Contact:	Evelina Shames
Docs Contact:
URL:
Whiteboard:
Depends On:	1626907
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-08 14:22 UTC by RHV bug bot
Modified:	2021-12-10 18:03 UTC (History)
CC List:	11 users (show)
Fixed In Version:	ovirt-engine-4.2.7.3
Doc Type:	Bug Fix
Doc Text:	The current release blocks the creation of a live snapshot of an unresponsive virtual machine.
Clone Of:	1626907
Environment:
Last Closed:	2018-11-05 15:03:18 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-44311	None	None	None	2021-12-10 18:03:26 UTC
Red Hat Knowledge Base (Solution)	3623231	None	None	None	2018-10-08 14:23:14 UTC
Red Hat Product Errata	RHBA-2018:3480	None	None	None	2018-11-05 15:03:50 UTC
oVirt gerrit	94649	master	MERGED	core: Prevent snapshotting a 'Not Responding' VM	2020-08-05 10:42:29 UTC
oVirt gerrit	94694	ovirt-engine-4.2	MERGED	core: Prevent snapshotting a 'Not Responding' VM	2020-08-05 10:42:29 UTC

Description RHV bug bot 2018-10-08 14:22:25 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1626907 +++
======================================================================

Description of problem:

If a VM is not responding, and if we try to create a snapshot on this VM, it will create an inconsistent image structure in the database.

The VM went into "not responding".

===
2018-09-10 00:17:45,301-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [] VM '67d94860-76a1-42c3-b55e-34abccaa2253'(test_snap) moved from 'Up' --> 'NotResponding'
2018-09-10 00:17:45,318-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [] EVENT_ID: VM_NOT_RESPONDING(126), VM test_snap is not responding.
===

RHV allows creating a snapshot on this VM but it will not initiate SnapshotVDSCommand. The "prepareImage" will be done only as a part of "snapshot" command. So the new LV will not be active on the host where the VM is running. So the "qemu-img info" will fail with error "No such file or directory".

===
2018-09-10 00:21:02,467-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-18) [b00b0404-1d3e-4e59-8dfd-8caa9d9d9b96] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM 10.65.177.143 command GetQemuImageInfoVDS failed: Command ['/usr/bin/qemu-img', 'info', '--output', 'json', '-f', 'qcow2', '-U', u'/rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/84c8b978-bcb1-4f17-9cb0-2429c2adebd9'] failed with rc=1 out='' err="qemu-img: Could not open '/rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/84c8b978-bcb1-4f17-9cb0-2429c2adebd9': Could not open '/rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/84c8b978-bcb1-4f17-9cb0-2429c2adebd9': No such file or directory\n"
===

Now the database will show 84c8b978 as an active image but the VM will be using cf7aee7d.

===
              image_guid              |            image_group_id            |               parentid               | imagestatus | active 
--------------------------------------+--------------------------------------+--------------------------------------+-------------+--------
 cf7aee7d-4e24-4c81-9d54-ddd56459348d | 0a400feb-3b50-4d46-9205-6bbdebb2b73b | 00000000-0000-0000-0000-000000000000 |           1 | f
 84c8b978-bcb1-4f17-9cb0-2429c2adebd9 | 0a400feb-3b50-4d46-9205-6bbdebb2b73b | cf7aee7d-4e24-4c81-9d54-ddd56459348d |           1 | t

virsh -r domblklist test_snap
Target     Source
------------------------------------------------
sda        /rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/cf7aee7d-4e24-4c81-9d54-ddd56459348d
===


Version-Release number of selected component (if applicable):

rhvm-4.2.5.2-0.1.el7ev.noarch

How reproducible:

100%

Steps to Reproduce:

1. Make a VM not responding. I set a low value for vm_command_timeout.
2. Create a snapshot on this VM.

Actual results:

RHV is allowing to create a snapshot on a "not responding" VM which creates inconsistent snapshots.

Expected results:

I think it should prevent snapshot operation on a "not responding" VM.

Additional info:

(Originally by Nijin Ashok)

Comment 4 RHV bug bot 2018-10-08 14:22:43 UTC

This issue can also lead to ReduceVolume to be executed on an active image which VM is using if there are multiple snapshots failed operation and if we try to delete one of those snapshots.

For example, if there are two failed snapshot operation (snap_1 and snap_2) because of this bug.

===
engine=# select image_guid,image_group_id,parentid,vm_snapshot_id,imagestatus,active from images where image_group_id in 
(select image_group_id from vm_images_view where disk_id in 
(select device_id from vm_device where vm_id = (select vm_guid from vm_static where vm_name = 'test_vm') and device = 'disk')) ;
              image_guid              |            image_group_id            |               parentid               |            vm_snapshot_id            | imagestatus | ac
tive 
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+-------------+---
-----
 36228393-eed2-46d9-a889-d417b8bdd25f | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | 00000000-0000-0000-0000-000000000000 | 7ae324c4-5245-418d-b548-75c99317b5fd |           1 | f
 d8f210db-0efe-43b9-939b-fcf7b04620ad | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | 36228393-eed2-46d9-a889-d417b8bdd25f | e2f3b03c-1a5b-42bb-9257-ebb7e18bdefe |           1 | f
 e47bd3d0-c045-4c3b-ab49-bf051e0d03d3 | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | d8f210db-0efe-43b9-939b-fcf7b04620ad | 8d4518af-8ff7-4a0c-8cc2-1818aaa15705 |           1 | t


virsh -r domblklist test_vm
Target     Source
------------------------------------------------
hdc        -
hdd        /var/run/vdsm/payload/ec4aac3d-c710-45c8-b75d-39ce4b74cdf9.7e994787fed2f0cc041bf0799baea18a.img
sda        /rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/d7ea2159-dd3f-41d9-b76c-a4a39aac7832/36228393-eed2-46d9-a889-d417b8bdd25f

engine=# select description from snapshots where snapshot_id in ('7ae324c4-5245-418d-b548-75c99317b5fd','e2f3b03c-1a5b-42bb-9257-ebb7e18bdefe','8d4518af-8ff7-4a0c-8cc2-1818aaa15705');
 description 
-------------
 snap_1
 Active VM
 snap_2
(3 rows)
===

The 36228393 is the active image which is used by the VM. If we now try to delete the snap1 which is an "internal" snapshot, it will also initiate ReduceImageVDSCommand on image 36228393.

===
engine log

2018-09-10 07:19:31,167-04 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-8) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Drive image file could not be found, code = 13 (Failed with error imageErr and code 13)

2018-09-10 07:19:32,142-04 INFO  [org.ovirt.engine.core.bll.MergeCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] Merge command (jobId = null) has completed for images '36228393-eed2-46d9-a889-d417b8bdd25f'..'d8f210db-0efe-43b9-939b-fcf7b04620ad'

2018-09-10 07:19:49,461-04 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ReduceImageVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-1) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] START, ReduceImageVDSCommand( ReduceImageVDSCommandParameters:{storagePoolId='0aa2db2e-9bf5-11e8-8497-001a4a17015d', ignoreFailoverLimit='false', storageDomainId='c21260fa-bc27-4bd3-9841-99da4e2d0402', imageGroupId='d7ea2159-dd3f-41d9-b76c-a4a39aac7832', imageId='36228393-eed2-46d9-a889-d417b8bdd25f', allowActive='true'}), log id: 2539e1af
2018-09-10 07:19:49,854-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-1) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] EVENT_ID: USER_REDUCE_DISK_FINISHED_SUCCESS(382), Disk 'GlanceDisk-9a3800a' has been successfully reduced.

vdsm log.

2018-09-10 11:19:49,654+0530 DEBUG (jsonrpc/7) [jsonrpc.JsonRpcServer] Calling 'StoragePool.reduceVolume' in bridge with {u'allowActive': True, u'storagepoolID': u'0aa2db2e-9bf5-11e8-8497-001a4a17015d', u'imageID': u'd7ea2159-dd3f-41d9-b76c-a4a39aac7832', u'volumeID': u'36228393-eed2-46d9-a889-d417b8bdd25f', u'storagedomainID': u'c21260fa-bc27-4bd3-9841-99da4e2d0402'} (__init__:590)

2018-09-10 11:19:50,319+0530 INFO  (tasks/1) [storage.LVM] Reducing LV c21260fa-bc27-4bd3-9841-99da4e2d0402/36228393-eed2-46d9-a889-d417b8bdd25f to 3743 megabytes (force=True) (lvm:1242)
2018-09-10 11:19:50,319+0530 DEBUG (tasks/1) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /sbin/lvm lvreduce --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360014053f404fa44d844d9198cfee437|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --autobackup n --force --size 3743m c21260fa-bc27-4bd3-9841-99da4e2d0402/36228393-eed2-46d9-a889-d417b8bdd25f (cwd None) (commands:65)
====

This can lead to two major issues.

[1] Data corruption.

vdsm is executing lvreduce on the image where the VM is actively writing on it which can lead to data corruption.

[2] Incorrect block threshold value.

Since the active LV is reduced, the VM will be still using the old threshold limit which may be higher than the current active LV size.

virsh -r domstats test_vm --block |grep -i threshold
  block.2.threshold=4429185024

lvs|grep 36228393
  36228393-eed2-46d9-a889-d417b8bdd25f c21260fa-bc27-4bd3-9841-99da4e2d0402 -wi-ao----   3.75g

So if the VM is writing more data, the LV will never be extended since the threshold value is lower than the LV size and finally the VM will end up in the paused status with enospc error.

I think the other existing BZ's also can lead condition like this. For example this bug 1583424. Please let me know if we need to open a new bug specifically for this.

(Originally by Nijin Ashok)

Comment 5 RHV bug bot 2018-10-08 14:22:48 UTC

(In reply to nijin ashok from comment #2)
> I think the other existing BZ's also can lead condition like this. For
> example this bug 1583424. 

Typo here. I meant this bug 1554369.

(Originally by Nijin Ashok)

Comment 6 RHV bug bot 2018-10-08 14:22:52 UTC

Michal / Arik, your thoughts? Storage wise the snapshot creation seems to be working correctly but changing the active snapshot of the VM fails since the VM is not responsive

(Originally by Tal Nisan)

Comment 7 RHV bug bot 2018-10-08 14:22:56 UTC

we shouldn't allow any action on the VM involving communication with libvirt/qemu/guest/anything when the VM is Not Responding. It signalizes there is a problem in communication with *some* component, without differentiating which actions could potentially work or not.

(Originally by michal.skrivanek)

Comment 8 RHV bug bot 2018-10-08 14:23:00 UTC

quite often it's not responding due to the host being overloaded and communication lagging and timing out. A snapshot would make things even worse, so I believe we should forbid it

(Originally by michal.skrivanek)

Comment 11 Evelina Shames 2018-10-25 07:08:51 UTC

Verified.

Engine: 4.2.7.3-0.1.el7ev
VDSM: vdsm-4.20.43-1.el7ev.x86_64

Comment 14 errata-xmlrpc 2018-11-05 15:03:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3480

Comment 15 Daniel Gur 2019-08-28 13:13:57 UTC

sync2jira

Comment 16 Daniel Gur 2019-08-28 13:18:11 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.