Bug 1637066
Summary: | [downstream clone - 4.2.7] Live Snapshot creation on a "not responding" VM will fail during "GetQemuImageInfoVDS" | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | RHV bug bot <rhv-bugzilla-bot> |
Component: | ovirt-engine | Assignee: | shani <sleviim> |
Status: | CLOSED ERRATA | QA Contact: | Evelina Shames <eshames> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.2.5 | CC: | ahadas, apinnick, bscalio, ebenahar, gveitmic, lveyde, michal.skrivanek, ratamir, Rhev-m-bugs, sleviim, tnisan |
Target Milestone: | ovirt-4.2.7 | Keywords: | ZStream |
Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | ovirt-engine-4.2.7.3 | Doc Type: | Bug Fix |
Doc Text: |
The current release blocks the creation of a live snapshot of an unresponsive virtual machine.
|
Story Points: | --- |
Clone Of: | 1626907 | Environment: | |
Last Closed: | 2018-11-05 15:03:18 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1626907 | ||
Bug Blocks: |
Description
RHV bug bot
2018-10-08 14:22:25 UTC
This issue can also lead to ReduceVolume to be executed on an active image which VM is using if there are multiple snapshots failed operation and if we try to delete one of those snapshots. For example, if there are two failed snapshot operation (snap_1 and snap_2) because of this bug. === engine=# select image_guid,image_group_id,parentid,vm_snapshot_id,imagestatus,active from images where image_group_id in (select image_group_id from vm_images_view where disk_id in (select device_id from vm_device where vm_id = (select vm_guid from vm_static where vm_name = 'test_vm') and device = 'disk')) ; image_guid | image_group_id | parentid | vm_snapshot_id | imagestatus | ac tive --------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+-------------+--- ----- 36228393-eed2-46d9-a889-d417b8bdd25f | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | 00000000-0000-0000-0000-000000000000 | 7ae324c4-5245-418d-b548-75c99317b5fd | 1 | f d8f210db-0efe-43b9-939b-fcf7b04620ad | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | 36228393-eed2-46d9-a889-d417b8bdd25f | e2f3b03c-1a5b-42bb-9257-ebb7e18bdefe | 1 | f e47bd3d0-c045-4c3b-ab49-bf051e0d03d3 | d7ea2159-dd3f-41d9-b76c-a4a39aac7832 | d8f210db-0efe-43b9-939b-fcf7b04620ad | 8d4518af-8ff7-4a0c-8cc2-1818aaa15705 | 1 | t virsh -r domblklist test_vm Target Source ------------------------------------------------ hdc - hdd /var/run/vdsm/payload/ec4aac3d-c710-45c8-b75d-39ce4b74cdf9.7e994787fed2f0cc041bf0799baea18a.img sda /rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/d7ea2159-dd3f-41d9-b76c-a4a39aac7832/36228393-eed2-46d9-a889-d417b8bdd25f engine=# select description from snapshots where snapshot_id in ('7ae324c4-5245-418d-b548-75c99317b5fd','e2f3b03c-1a5b-42bb-9257-ebb7e18bdefe','8d4518af-8ff7-4a0c-8cc2-1818aaa15705'); description ------------- snap_1 Active VM snap_2 (3 rows) === The 36228393 is the active image which is used by the VM. If we now try to delete the snap1 which is an "internal" snapshot, it will also initiate ReduceImageVDSCommand on image 36228393. === engine log 2018-09-10 07:19:31,167-04 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-8) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Drive image file could not be found, code = 13 (Failed with error imageErr and code 13) 2018-09-10 07:19:32,142-04 INFO [org.ovirt.engine.core.bll.MergeCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] Merge command (jobId = null) has completed for images '36228393-eed2-46d9-a889-d417b8bdd25f'..'d8f210db-0efe-43b9-939b-fcf7b04620ad' 2018-09-10 07:19:49,461-04 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ReduceImageVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-1) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] START, ReduceImageVDSCommand( ReduceImageVDSCommandParameters:{storagePoolId='0aa2db2e-9bf5-11e8-8497-001a4a17015d', ignoreFailoverLimit='false', storageDomainId='c21260fa-bc27-4bd3-9841-99da4e2d0402', imageGroupId='d7ea2159-dd3f-41d9-b76c-a4a39aac7832', imageId='36228393-eed2-46d9-a889-d417b8bdd25f', allowActive='true'}), log id: 2539e1af 2018-09-10 07:19:49,854-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-1) [abc511cf-7e4b-4ab7-93b3-64ea5061bc10] EVENT_ID: USER_REDUCE_DISK_FINISHED_SUCCESS(382), Disk 'GlanceDisk-9a3800a' has been successfully reduced. vdsm log. 2018-09-10 11:19:49,654+0530 DEBUG (jsonrpc/7) [jsonrpc.JsonRpcServer] Calling 'StoragePool.reduceVolume' in bridge with {u'allowActive': True, u'storagepoolID': u'0aa2db2e-9bf5-11e8-8497-001a4a17015d', u'imageID': u'd7ea2159-dd3f-41d9-b76c-a4a39aac7832', u'volumeID': u'36228393-eed2-46d9-a889-d417b8bdd25f', u'storagedomainID': u'c21260fa-bc27-4bd3-9841-99da4e2d0402'} (__init__:590) 2018-09-10 11:19:50,319+0530 INFO (tasks/1) [storage.LVM] Reducing LV c21260fa-bc27-4bd3-9841-99da4e2d0402/36228393-eed2-46d9-a889-d417b8bdd25f to 3743 megabytes (force=True) (lvm:1242) 2018-09-10 11:19:50,319+0530 DEBUG (tasks/1) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /sbin/lvm lvreduce --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360014053f404fa44d844d9198cfee437|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --autobackup n --force --size 3743m c21260fa-bc27-4bd3-9841-99da4e2d0402/36228393-eed2-46d9-a889-d417b8bdd25f (cwd None) (commands:65) ==== This can lead to two major issues. [1] Data corruption. vdsm is executing lvreduce on the image where the VM is actively writing on it which can lead to data corruption. [2] Incorrect block threshold value. Since the active LV is reduced, the VM will be still using the old threshold limit which may be higher than the current active LV size. virsh -r domstats test_vm --block |grep -i threshold block.2.threshold=4429185024 lvs|grep 36228393 36228393-eed2-46d9-a889-d417b8bdd25f c21260fa-bc27-4bd3-9841-99da4e2d0402 -wi-ao---- 3.75g So if the VM is writing more data, the LV will never be extended since the threshold value is lower than the LV size and finally the VM will end up in the paused status with enospc error. I think the other existing BZ's also can lead condition like this. For example this bug 1583424. Please let me know if we need to open a new bug specifically for this. (Originally by Nijin Ashok) (In reply to nijin ashok from comment #2) > I think the other existing BZ's also can lead condition like this. For > example this bug 1583424. Typo here. I meant this bug 1554369. (Originally by Nijin Ashok) Michal / Arik, your thoughts? Storage wise the snapshot creation seems to be working correctly but changing the active snapshot of the VM fails since the VM is not responsive (Originally by Tal Nisan) we shouldn't allow any action on the VM involving communication with libvirt/qemu/guest/anything when the VM is Not Responding. It signalizes there is a problem in communication with *some* component, without differentiating which actions could potentially work or not. (Originally by michal.skrivanek) quite often it's not responding due to the host being overloaded and communication lagging and timing out. A snapshot would make things even worse, so I believe we should forbid it (Originally by michal.skrivanek) Verified. Engine: 4.2.7.3-0.1.el7ev VDSM: vdsm-4.20.43-1.el7ev.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3480 sync2jira sync2jira |