Bug 1637066

Summary:	[downstream clone - 4.2.7] Live Snapshot creation on a "not responding" VM will fail during "GetQemuImageInfoVDS"
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	RHV bug bot <rhv-bugzilla-bot>
Component:	ovirt-engine	Assignee:	shani <sleviim>
Status:	CLOSED ERRATA	QA Contact:	Evelina Shames <eshames>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.2.5	CC:	ahadas, apinnick, bscalio, ebenahar, gveitmic, lveyde, michal.skrivanek, ratamir, Rhev-m-bugs, sleviim, tnisan
Target Milestone:	ovirt-4.2.7	Keywords:	ZStream
Target Release:	---	Flags:	lsvaty: testing_plan_complete-
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	ovirt-engine-4.2.7.3	Doc Type:	Bug Fix
Doc Text:	The current release blocks the creation of a live snapshot of an unresponsive virtual machine.	Story Points:	---
Clone Of:	1626907	Environment:
Last Closed:	2018-11-05 15:03:18 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1626907
Bug Blocks:

Description RHV bug bot 2018-10-08 14:22:25 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1626907 +++
======================================================================

Description of problem:

If a VM is not responding, and if we try to create a snapshot on this VM, it will create an inconsistent image structure in the database.

The VM went into "not responding".

===
2018-09-10 00:17:45,301-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [] VM '67d94860-76a1-42c3-b55e-34abccaa2253'(test_snap) moved from 'Up' --> 'NotResponding'
2018-09-10 00:17:45,318-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [] EVENT_ID: VM_NOT_RESPONDING(126), VM test_snap is not responding.
===

RHV allows creating a snapshot on this VM but it will not initiate SnapshotVDSCommand. The "prepareImage" will be done only as a part of "snapshot" command. So the new LV will not be active on the host where the VM is running. So the "qemu-img info" will fail with error "No such file or directory".

===
2018-09-10 00:21:02,467-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-18) [b00b0404-1d3e-4e59-8dfd-8caa9d9d9b96] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM 10.65.177.143 command GetQemuImageInfoVDS failed: Command ['/usr/bin/qemu-img', 'info', '--output', 'json', '-f', 'qcow2', '-U', u'/rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/84c8b978-bcb1-4f17-9cb0-2429c2adebd9'] failed with rc=1 out='' err="qemu-img: Could not open '/rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/84c8b978-bcb1-4f17-9cb0-2429c2adebd9': Could not open '/rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/84c8b978-bcb1-4f17-9cb0-2429c2adebd9': No such file or directory\n"
===

Now the database will show 84c8b978 as an active image but the VM will be using cf7aee7d.

===
              image_guid              |            image_group_id            |               parentid               | imagestatus | active 
--------------------------------------+--------------------------------------+--------------------------------------+-------------+--------
 cf7aee7d-4e24-4c81-9d54-ddd56459348d | 0a400feb-3b50-4d46-9205-6bbdebb2b73b | 00000000-0000-0000-0000-000000000000 |           1 | f
 84c8b978-bcb1-4f17-9cb0-2429c2adebd9 | 0a400feb-3b50-4d46-9205-6bbdebb2b73b | cf7aee7d-4e24-4c81-9d54-ddd56459348d |           1 | t

virsh -r domblklist test_snap
Target     Source
------------------------------------------------
sda        /rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/cf7aee7d-4e24-4c81-9d54-ddd56459348d
===


Version-Release number of selected component (if applicable):

rhvm-4.2.5.2-0.1.el7ev.noarch

How reproducible:

100%

Steps to Reproduce:

1. Make a VM not responding. I set a low value for vm_command_timeout.
2. Create a snapshot on this VM.

Actual results:

RHV is allowing to create a snapshot on a "not responding" VM which creates inconsistent snapshots.

Expected results:

I think it should prevent snapshot operation on a "not responding" VM.

Additional info:

(Originally by Nijin Ashok)

Comment 4 RHV bug bot 2018-10-08 14:22:43 UTC

This issue can also lead to ReduceVolume to be executed on an active image which VM is using if there are multiple snapshots failed operation and if we try to delete one of those snapshots.

For example, if there are two failed snapshot operation (snap_1 and snap_2) because of this bug.

===
engine=# select image_guid,image_group_id,parentid,vm_snapshot_id,imagestatus,active from images where image_group_id in 
(select image_group_id from vm_images_view where disk_id in 
(select device_id from vm_device where vm_id = (select vm_guid from vm_static where vm_name = 'test_vm') and device = 'disk')) ;
              image_guid              |            image_group_id            |               parentid               |            vm_snapshot_id            | imagestatus | ac
tive 
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+-------------+---
-----
 36228393-eed2-46d9-a889-d417b8bdd25f | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | 00000000-0000-0000-0000-000000000000 | 7ae324c4-5245-418d-b548-75c99317b5fd |           1 | f
 d8f210db-0efe-43b9-939b-fcf7b04620ad | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | 36228393-eed2-46d9-a889-d417b8bdd25f | e2f3b03c-1a5b-42bb-9257-ebb7e18bdefe |           1 | f
 e47bd3d0-c045-4c3b-ab49-bf051e0d03d3 | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | d8f210db-0efe-43b9-939b-fcf7b04620ad | 8d4518af-8ff7-4a0c-8cc2-1818aaa15705 |           1 | t


virsh -r domblklist test_vm
Target     Source
------------------------------------------------
hdc        -
hdd        /var/run/vdsm/payload/ec4aac3d-c710-45c8-b75d-39ce4b74cdf9.7e994787fed2f0cc041bf0799baea18a.img
sda        /rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/d7ea2159-dd3f-41d9-b76c-a4a39aac7832/36228393-eed2-46d9-a889-d417b8bdd25f

engine=# select description from snapshots where snapshot_id in ('7ae324c4-5245-418d-b548-75c99317b5fd','e2f3b03c-1a5b-42bb-9257-ebb7e18bdefe','8d4518af-8ff7-4a0c-8cc2-1818aaa15705');
 description 
-------------
 snap_1
 Active VM
 snap_2
(3 rows)
===

The 36228393 is the active image which is used by the VM. If we now try to delete the snap1 which is an "internal" snapshot, it will also initiate ReduceImageVDSCommand on image 36228393.

===
engine log

2018-09-10 07:19:31,167-04 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-8) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Drive image file could not be found, code = 13 (Failed with error imageErr and code 13)

2018-09-10 07:19:32,142-04 INFO  [org.ovirt.engine.core.bll.MergeCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] Merge command (jobId = null) has completed for images '36228393-eed2-46d9-a889-d417b8bdd25f'..'d8f210db-0efe-43b9-939b-fcf7b04620ad'

2018-09-10 07:19:49,461-04 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ReduceImageVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-1) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] START, ReduceImageVDSCommand( ReduceImageVDSCommandParameters:{storagePoolId='0aa2db2e-9bf5-11e8-8497-001a4a17015d', ignoreFailoverLimit='false', storageDomainId='c21260fa-bc27-4bd3-9841-99da4e2d0402', imageGroupId='d7ea2159-dd3f-41d9-b76c-a4a39aac7832', imageId='36228393-eed2-46d9-a889-d417b8bdd25f', allowActive='true'}), log id: 2539e1af
2018-09-10 07:19:49,854-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-1) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] EVENT_ID: USER_REDUCE_DISK_FINISHED_SUCCESS(382), Disk 'GlanceDisk-9a3800a' has been successfully reduced.

vdsm log.

2018-09-10 11:19:49,654+0530 DEBUG (jsonrpc/7) [jsonrpc.JsonRpcServer] Calling 'StoragePool.reduceVolume' in bridge with {u'allowActive': True, u'storagepoolID': u'0aa2db2e-9bf5-11e8-8497-001a4a17015d', u'imageID': u'd7ea2159-dd3f-41d9-b76c-a4a39aac7832', u'volumeID': u'36228393-eed2-46d9-a889-d417b8bdd25f', u'storagedomainID': u'c21260fa-bc27-4bd3-9841-99da4e2d0402'} (__init__:590)

2018-09-10 11:19:50,319+0530 INFO  (tasks/1) [storage.LVM] Reducing LV c21260fa-bc27-4bd3-9841-99da4e2d0402/36228393-eed2-46d9-a889-d417b8bdd25f to 3743 megabytes (force=True) (lvm:1242)
2018-09-10 11:19:50,319+0530 DEBUG (tasks/1) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /sbin/lvm lvreduce --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360014053f404fa44d844d9198cfee437|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --autobackup n --force --size 3743m c21260fa-bc27-4bd3-9841-99da4e2d0402/36228393-eed2-46d9-a889-d417b8bdd25f (cwd None) (commands:65)
====

This can lead to two major issues.

[1] Data corruption.

vdsm is executing lvreduce on the image where the VM is actively writing on it which can lead to data corruption.

[2] Incorrect block threshold value.

Since the active LV is reduced, the VM will be still using the old threshold limit which may be higher than the current active LV size.

virsh -r domstats test_vm --block |grep -i threshold
  block.2.threshold=4429185024

lvs|grep 36228393
  36228393-eed2-46d9-a889-d417b8bdd25f c21260fa-bc27-4bd3-9841-99da4e2d0402 -wi-ao----   3.75g

So if the VM is writing more data, the LV will never be extended since the threshold value is lower than the LV size and finally the VM will end up in the paused status with enospc error.

I think the other existing BZ's also can lead condition like this. For example this bug 1583424. Please let me know if we need to open a new bug specifically for this.

(Originally by Nijin Ashok)

Comment 5 RHV bug bot 2018-10-08 14:22:48 UTC

(In reply to nijin ashok from comment #2)
> I think the other existing BZ's also can lead condition like this. For
> example this bug 1583424. 

Typo here. I meant this bug 1554369.

(Originally by Nijin Ashok)

Comment 6 RHV bug bot 2018-10-08 14:22:52 UTC

Michal / Arik, your thoughts? Storage wise the snapshot creation seems to be working correctly but changing the active snapshot of the VM fails since the VM is not responsive

(Originally by Tal Nisan)

Comment 7 RHV bug bot 2018-10-08 14:22:56 UTC

we shouldn't allow any action on the VM involving communication with libvirt/qemu/guest/anything when the VM is Not Responding. It signalizes there is a problem in communication with *some* component, without differentiating which actions could potentially work or not.

(Originally by michal.skrivanek)

Comment 8 RHV bug bot 2018-10-08 14:23:00 UTC

quite often it's not responding due to the host being overloaded and communication lagging and timing out. A snapshot would make things even worse, so I believe we should forbid it

(Originally by michal.skrivanek)

Comment 11 Evelina Shames 2018-10-25 07:08:51 UTC

Verified.

Engine: 4.2.7.3-0.1.el7ev
VDSM: vdsm-4.20.43-1.el7ev.x86_64

Comment 14 errata-xmlrpc 2018-11-05 15:03:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3480

Comment 15 Daniel Gur 2019-08-28 13:13:57 UTC

sync2jira

Comment 16 Daniel Gur 2019-08-28 13:18:11 UTC

sync2jira