+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1626907 +++ ====================================================================== Description of problem: If a VM is not responding, and if we try to create a snapshot on this VM, it will create an inconsistent image structure in the database. The VM went into "not responding". === 2018-09-10 00:17:45,301-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [] VM '67d94860-76a1-42c3-b55e-34abccaa2253'(test_snap) moved from 'Up' --> 'NotResponding' 2018-09-10 00:17:45,318-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [] EVENT_ID: VM_NOT_RESPONDING(126), VM test_snap is not responding. === RHV allows creating a snapshot on this VM but it will not initiate SnapshotVDSCommand. The "prepareImage" will be done only as a part of "snapshot" command. So the new LV will not be active on the host where the VM is running. So the "qemu-img info" will fail with error "No such file or directory". === 2018-09-10 00:21:02,467-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-18) [b00b0404-1d3e-4e59-8dfd-8caa9d9d9b96] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM 10.65.177.143 command GetQemuImageInfoVDS failed: Command ['/usr/bin/qemu-img', 'info', '--output', 'json', '-f', 'qcow2', '-U', u'/rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/84c8b978-bcb1-4f17-9cb0-2429c2adebd9'] failed with rc=1 out='' err="qemu-img: Could not open '/rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/84c8b978-bcb1-4f17-9cb0-2429c2adebd9': Could not open '/rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/84c8b978-bcb1-4f17-9cb0-2429c2adebd9': No such file or directory\n" === Now the database will show 84c8b978 as an active image but the VM will be using cf7aee7d. === image_guid | image_group_id | parentid | imagestatus | active --------------------------------------+--------------------------------------+--------------------------------------+-------------+-------- cf7aee7d-4e24-4c81-9d54-ddd56459348d | 0a400feb-3b50-4d46-9205-6bbdebb2b73b | 00000000-0000-0000-0000-000000000000 | 1 | f 84c8b978-bcb1-4f17-9cb0-2429c2adebd9 | 0a400feb-3b50-4d46-9205-6bbdebb2b73b | cf7aee7d-4e24-4c81-9d54-ddd56459348d | 1 | t virsh -r domblklist test_snap Target Source ------------------------------------------------ sda /rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/0a400feb-3b50-4d46-9205-6bbdebb2b73b/cf7aee7d-4e24-4c81-9d54-ddd56459348d === Version-Release number of selected component (if applicable): rhvm-4.2.5.2-0.1.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1. Make a VM not responding. I set a low value for vm_command_timeout. 2. Create a snapshot on this VM. Actual results: RHV is allowing to create a snapshot on a "not responding" VM which creates inconsistent snapshots. Expected results: I think it should prevent snapshot operation on a "not responding" VM. Additional info: (Originally by Nijin Ashok)
This issue can also lead to ReduceVolume to be executed on an active image which VM is using if there are multiple snapshots failed operation and if we try to delete one of those snapshots. For example, if there are two failed snapshot operation (snap_1 and snap_2) because of this bug. === engine=# select image_guid,image_group_id,parentid,vm_snapshot_id,imagestatus,active from images where image_group_id in (select image_group_id from vm_images_view where disk_id in (select device_id from vm_device where vm_id = (select vm_guid from vm_static where vm_name = 'test_vm') and device = 'disk')) ; image_guid | image_group_id | parentid | vm_snapshot_id | imagestatus | ac tive --------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+-------------+--- ----- 36228393-eed2-46d9-a889-d417b8bdd25f | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | 00000000-0000-0000-0000-000000000000 | 7ae324c4-5245-418d-b548-75c99317b5fd | 1 | f d8f210db-0efe-43b9-939b-fcf7b04620ad | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | 36228393-eed2-46d9-a889-d417b8bdd25f | e2f3b03c-1a5b-42bb-9257-ebb7e18bdefe | 1 | f e47bd3d0-c045-4c3b-ab49-bf051e0d03d3 | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | d8f210db-0efe-43b9-939b-fcf7b04620ad | 8d4518af-8ff7-4a0c-8cc2-1818aaa15705 | 1 | t virsh -r domblklist test_vm Target Source ------------------------------------------------ hdc - hdd /var/run/vdsm/payload/ec4aac3d-c710-45c8-b75d-39ce4b74cdf9.7e994787fed2f0cc041bf0799baea18a.img sda /rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/d7ea2159-dd3f-41d9-b76c-a4a39aac7832/36228393-eed2-46d9-a889-d417b8bdd25f engine=# select description from snapshots where snapshot_id in ('7ae324c4-5245-418d-b548-75c99317b5fd','e2f3b03c-1a5b-42bb-9257-ebb7e18bdefe','8d4518af-8ff7-4a0c-8cc2-1818aaa15705'); description ------------- snap_1 Active VM snap_2 (3 rows) === The 36228393 is the active image which is used by the VM. If we now try to delete the snap1 which is an "internal" snapshot, it will also initiate ReduceImageVDSCommand on image 36228393. === engine log 2018-09-10 07:19:31,167-04 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-8) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Drive image file could not be found, code = 13 (Failed with error imageErr and code 13) 2018-09-10 07:19:32,142-04 INFO [org.ovirt.engine.core.bll.MergeCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] Merge command (jobId = null) has completed for images '36228393-eed2-46d9-a889-d417b8bdd25f'..'d8f210db-0efe-43b9-939b-fcf7b04620ad' 2018-09-10 07:19:49,461-04 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ReduceImageVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-1) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] START, ReduceImageVDSCommand( ReduceImageVDSCommandParameters:{storagePoolId='0aa2db2e-9bf5-11e8-8497-001a4a17015d', ignoreFailoverLimit='false', storageDomainId='c21260fa-bc27-4bd3-9841-99da4e2d0402', imageGroupId='d7ea2159-dd3f-41d9-b76c-a4a39aac7832', imageId='36228393-eed2-46d9-a889-d417b8bdd25f', allowActive='true'}), log id: 2539e1af 2018-09-10 07:19:49,854-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-1) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] EVENT_ID: USER_REDUCE_DISK_FINISHED_SUCCESS(382), Disk 'GlanceDisk-9a3800a' has been successfully reduced. vdsm log. 2018-09-10 11:19:49,654+0530 DEBUG (jsonrpc/7) [jsonrpc.JsonRpcServer] Calling 'StoragePool.reduceVolume' in bridge with {u'allowActive': True, u'storagepoolID': u'0aa2db2e-9bf5-11e8-8497-001a4a17015d', u'imageID': u'd7ea2159-dd3f-41d9-b76c-a4a39aac7832', u'volumeID': u'36228393-eed2-46d9-a889-d417b8bdd25f', u'storagedomainID': u'c21260fa-bc27-4bd3-9841-99da4e2d0402'} (__init__:590) 2018-09-10 11:19:50,319+0530 INFO (tasks/1) [storage.LVM] Reducing LV c21260fa-bc27-4bd3-9841-99da4e2d0402/36228393-eed2-46d9-a889-d417b8bdd25f to 3743 megabytes (force=True) (lvm:1242) 2018-09-10 11:19:50,319+0530 DEBUG (tasks/1) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /sbin/lvm lvreduce --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360014053f404fa44d844d9198cfee437|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --autobackup n --force --size 3743m c21260fa-bc27-4bd3-9841-99da4e2d0402/36228393-eed2-46d9-a889-d417b8bdd25f (cwd None) (commands:65) ==== This can lead to two major issues. [1] Data corruption. vdsm is executing lvreduce on the image where the VM is actively writing on it which can lead to data corruption. [2] Incorrect block threshold value. Since the active LV is reduced, the VM will be still using the old threshold limit which may be higher than the current active LV size. virsh -r domstats test_vm --block |grep -i threshold block.2.threshold=4429185024 lvs|grep 36228393 36228393-eed2-46d9-a889-d417b8bdd25f c21260fa-bc27-4bd3-9841-99da4e2d0402 -wi-ao---- 3.75g So if the VM is writing more data, the LV will never be extended since the threshold value is lower than the LV size and finally the VM will end up in the paused status with enospc error. I think the other existing BZ's also can lead condition like this. For example this bug 1583424. Please let me know if we need to open a new bug specifically for this. (Originally by Nijin Ashok)
(In reply to nijin ashok from comment #2) > I think the other existing BZ's also can lead condition like this. For > example this bug 1583424. Typo here. I meant this bug 1554369. (Originally by Nijin Ashok)
Michal / Arik, your thoughts? Storage wise the snapshot creation seems to be working correctly but changing the active snapshot of the VM fails since the VM is not responsive (Originally by Tal Nisan)
we shouldn't allow any action on the VM involving communication with libvirt/qemu/guest/anything when the VM is Not Responding. It signalizes there is a problem in communication with *some* component, without differentiating which actions could potentially work or not. (Originally by michal.skrivanek)
quite often it's not responding due to the host being overloaded and communication lagging and timing out. A snapshot would make things even worse, so I believe we should forbid it (Originally by michal.skrivanek)
Verified. Engine: 4.2.7.3-0.1.el7ev VDSM: vdsm-4.20.43-1.el7ev.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3480
sync2jira