Bug 2126602 - engine allows destroying the storage domain when there are uncleared tasks and allows to move SPM host to maintenance mode causing even OVF corruption
Summary: engine allows destroying the storage domain when there are uncleared tasks an...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.7
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ovirt-4.5.3-async
: ---
Assignee: Benny Zlotnik
QA Contact: sshmulev
URL:
Whiteboard:
Depends On:
Blocks: 1541529
TreeView+ depends on / blocked
 
Reported: 2022-09-14 05:49 UTC by nijin ashok
Modified: 2022-11-30 07:53 UTC (History)
9 users (show)

Fixed In Version: ovirt-engine-4.5.3.3
Doc Type: Bug Fix
Doc Text:
Previously, when a host failed to get the list of existing disks on an imported storage domain it proceeded assuming there are no disks, which was not necessarily true. As a result, new OVF_STORE disks might be created, despite the existence of such disks, and might corrupt them. In this release, the attachment of the storage domain fails if there is a failure in getting the list of existing disks.
Clone Of:
Environment:
Last Closed: 2022-11-30 07:53:04 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 733 0 None Merged core: fail storage domain attach if getImagesList fails 2022-11-10 10:59:10 UTC
Github oVirt ovirt-engine pull 736 0 None Merged core: fail storage domain attach if getImagesList fails 2022-11-10 10:59:12 UTC
Red Hat Issue Tracker RHV-47906 0 None None None 2022-09-14 07:59:24 UTC
Red Hat Product Errata RHBA-2022:8695 0 None None None 2022-11-30 07:53:08 UTC

Description nijin ashok 2022-09-14 05:49:41 UTC
Description of problem:

The customer was doing the DR drill when they faced this issue.

Logs are from my test environment and match with the customer.
SPM was dell-r530-3.
The other host dell-r740-3 was in maintenance.

The storage domain was moved to maintenance mode which failed because of uncleared tasks.

~~~
2022-09-14 10:04:16,121+05 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [40a905b2-9cea-464a-bd8a-a6a55c96bf09] START, DeactivateStorageDomainVDSCommand( DeactivateStorageDomainVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', storageDomainId='ad170e60-073b-4349-a599-206bf1d8cb9f', masterDomainId='00000000-0000-0000-0000-000000000000', masterVersion='75'}), log id: 7c1707a8

2022-09-14 10:04:16,167+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [40a905b2-9cea-464a-bd8a-a6a55c96bf09] EVENT_ID: USER_DEACTIVATE_STORAGE_DOMAIN_FAILED(969), Failed to deactivate Storage Domain iscsi (Data Center new_dc).
~~~

The spmstop also failed with uncleared tasks error:

~~~
2022-09-14 10:04:16,150+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [40a905b2-9cea-464a-bd8a-a6a55c96bf09] EVENT_ID: VDS_ALERT_NOT_STOPPING_SPM_UNCLEARED_TASKS(9,030), Not stopping SPM on vds dell-r530-3, pool id 12c8ff21-0310-4f36-b4a0-f8e328a94a20 as there are uncleared tasks Task '1c6396b2-0869-4bc6-925c-fbb04c96e927', status 'finished'
~~~

However, the SD moved to unknown status at this phase:

~~~
2022-09-14 10:04:16,151+05 INFO  [org.ovirt.engine.core.bll.CommandCompensator] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [40a905b2-9cea-464a-bd8a-a6a55c96bf09] Command [id=366986d3-9cb4-45ec-8ec6-b3806cf9d6a1]: Compensating CHANGED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: EntityStatusSnapshot:{id='StoragePoolIsoMapId:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', storageId='ad170e60-073b-4349-a599-206bf1d8cb9f'}', status='Unknown'}.
~~~

This allowed the user to destroy the SD even when it got uncleared tasks in it:

~~~
2022-09-14 10:04:21,659+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-405796) [0277440d-c37b-415e-a228-a57a53bc4824] EVENT_ID: USER_FORCE_REMOVE_STORAGE_DOMAIN(981), Storage Domain iscsi was forcibly removed by admin@internal-authz
~~~

Then SPM dell-r530-3 was moved to maintenance mode and was powered down.

~~~
2022-09-14 10:05:20,952+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-101) [f137d0df-6179-4106-b20a-1e132bc5e9d0] EVENT_ID: USER_VDS_MAINTENANCE_WITHOUT_REASON(620), Host dell-r530-3 was switched to Maintenance mode by admin@internal-authz.
~~~

Activated the other host.

~~~
2022-09-14 10:05:29,072+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-85) [180e9898] EVENT_ID: VDS_DETECTED(13), Status of host dell-r740-3 was set to Up.
~~~

Tried to import and attach the SD. The imagelist failed with error below:

~~~
2022-09-14 10:12:45,909+05 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImagesListVDSCommand] (EE-ManagedThreadFactory-engine-Thread-405859) [e7bd4b4c-d4fd-4020-9ff0-15308cc2e5cd] START, GetImagesListVDSCommand( GetImagesListVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', sdUUID='ad170e60-073b-4349-a599-206bf1d8cb9f'}), log id: d18ce68
2022-09-14 10:12:45,910+05 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to dell-r530-3/10.65.176.76
2022-09-14 10:12:45,910+05 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connected to dell-r530-3/10.65.176.76:54321
2022-09-14 10:12:45,910+05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (EE-ManagedThreadFactory-engine-Thread-405859) [e7bd4b4c-d4fd-4020-9ff0-15308cc2e5cd] IrsBroker::Failed::GetImagesListVDS
2022-09-14 10:12:45,910+05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetImagesListVDSCommand] (EE-ManagedThreadFactory-engine-Thread-405859) [e7bd4b4c-d4fd-4020-9ff0-15308cc2e5cd] Command 'GetImagesListVDSCommand( GetImagesListVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', sdUUID='ad170e60-073b-4349-a599-206bf1d8cb9f'})' execution failed: IRSProtocolException:
2022-09-14 10:12:45,910+05 ERROR [org.ovirt.engine.core.bll.storage.disk.image.GetUnregisteredDisksQuery] (EE-ManagedThreadFactory-engine-Thread-405859) [e7bd4b4c-d4fd-4020-9ff0-15308cc2e5cd] Query 'GetUnregisteredDisksQuery' failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSProtocolException: IRSProtocolException:  (Failed with error ENGINE and code 5001)
~~~

It is trying to contact the old SPM which is in maintenance and in powered down status. So GetImagesListVDSCommand failed and hence it concluded that it doesn't have OVF_STORE.

~~~
2022-09-14 10:12:45,911+05 WARN  [org.ovirt.engine.core.bll.storage.pool.AddStoragePoolWithStoragesCommand] (EE-ManagedThreadFactory-engine-Thread-405859) [] There are no OVF_STORE disks on storage domain id ad170e60-073b-4349-a599-206bf1d8cb9f
~~~

Started the server dell-r530-3 and activated:

~~~
2022-09-14 10:34:08,488+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-9) [2fba19f2] EVENT_ID: VDS_DETECTED(13), Status of host dell-r530-3 was set to Up.
~~~

Attaching SD worked now. But since OVF_STORE was not registered, there was no VMs to import.

~~~
2022-09-14 10:35:00,170+05 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-406052) [42d9952f] START, ActivateStorageDomainVDSCommand( ActivateStorageDomainVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', storageDomainId='ad170e60-073b-4349-a599-206bf1d8cb9f'}), log id: 551e6a7c
2022-09-14 10:35:00,170+05 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to dell-r530-3/10.65.176.76
2022-09-14 10:35:00,170+05 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connected to dell-r530-3/10.65.176.76:54321
2022-09-14 10:35:30,148+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-406052) [519b4721] EVENT_ID: USER_ACTIVATED_STORAGE_DOMAIN(966), Storage Domain iscsi (Data Center new_dc) was activated by admin@internal-authz
~~~

Did update OVFs on SD which created two new OVFs.

~~~
2022-09-14 10:36:24,634+05 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.CreateOvfVolumeForStorageDomainCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [185cec60] Running command: CreateOvfVolumeForStorageDomainCommand internal: true. Entities affected :  ID: ad170e60-073b-4349-a599-206bf1d8cb9f Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN

2022-09-14 10:36:24,750+05 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [185cec60] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', storageDomainId='ad170e60-073b-4349-a599-206bf1d8cb9f', imageGroupId='6640a4fb-25bf-4032-af8a-a8b72714e75a', imageSizeInBytes='134217728', volumeFormat='RAW', newImageId='c012da8c-ffd7-4b50-8886-8857be2ba310', imageType='Preallocated', newImageDescription='{"DiskAlias":"OVF_STORE","DiskDescription":"OVF_STORE"}', imageInitialSizeInBytes='0', imageId='00000000-0000-0000-0000-000000000000', sourceImageGroupId='00000000-0000-0000-0000-000000000000', shouldAddBitmaps='false', legal='true', sequenceNumber='1', bitmap='null'}), log id: 1915e011

2022-09-14 10:36:25,527+05 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [185cec60] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', storageDomainId='ad170e60-073b-4349-a599-206bf1d8cb9f', imageGroupId='048c9d95-23b2-4864-9d8a-096e51219f92', imageSizeInBytes='134217728', volumeFormat='RAW', newImageId='ac7e18ca-39b9-45d4-b2ce-7cae420dd5a2', imageType='Preallocated', newImageDescription='{"DiskAlias":"OVF_STORE","DiskDescription":"OVF_STORE"}', imageInitialSizeInBytes='0', imageId='00000000-0000-0000-0000-000000000000', sourceImageGroupId='00000000-0000-0000-0000-000000000000', shouldAddBitmaps='false', legal='true', sequenceNumber='1', bitmap='null'}), log id: 6fbeef72
~~~

Now the SD has 4 OVFs where only 2 new are in DB which don't have any data.

If we detach and attach the SD again, all 4 OVFs will get registered. And in the next OVF update, it will write the empty OVF data to the original one losing all VMs OVF.
    

Version-Release number of selected component (if applicable):

rhvm-4.5.1.3-0.28.el8ev.noarch

How reproducible:

Reproduced 1 out of 5 times I tried.

Steps to Reproduce:

1. Create a task in the SD. I used vdsm-client to delete an image that will create a "Finished" task that will not get cleared.
2. With one host active, put the SD into maintenance mode.
3. When the SD moves to unknown, destroy it. You have to be quick here since it will be in unknown status only for 3-5 seconds.
4. Put the old SPM in maintenance, shut down it and activate the other host.
5. Import the same SD.
6. Attaching the SD will fail (was not reproducible all the time).
7. If it fails, activate the old SPM and try attaching again. It should work now.  
8. There will not be any VMs to import.
9. Try OVF update and it will create 2 new OVFs.

Actual results:

engine allows destroying the storage domain when there are uncleared tasks and allows to move SPM host to maintenance mode causing even OVF corruption

Expected results:

It should not allow destroying the SD at this stage when there are uncleared tasks as this leaves the DC in an inconsistent state. Destroy should fail if it has tasks giving the proper message to the user. Users can then manually clean up the tasks.   

Additional info:

Comment 4 nijin ashok 2022-09-22 09:10:21 UTC
Also, note that the new OVF_STORE was created because it failed to register old OVF due to failure in GetImagesListVDSCommand. I think we also need another check to confirm if OVF_STORE is available in SD just before creating a new OVF_STORE? This would take care of OVF store corruption in cases where GetImagesListVDSCommand for some reason. Please let me know if it requires a new bug.

Comment 12 Arik 2022-11-06 07:44:43 UTC
Benny, please document the fix and move to 'modified'

Comment 15 sshmulev 2022-11-14 07:20:04 UTC
Please add verification steps for this bug

Comment 16 sshmulev 2022-11-14 11:15:56 UTC
Verified according to the following flow:

1. Leave one host active
2. Add 'raise Exception('ex')' in the 'getImagesList' function.
3. Restart vdsm
4. Detach and remove a storage domain
5. import the SD

Expected behavior:
The attachment should fail.

Actual results:
Error while executing action Attach Storage Domain: Internal Engine Error

2022-11-14 13:04:32,353+02 ERROR [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (default task-40) [] Exception: java.lang.RuntimeException: Failed to retrieve unregistered disks


After removing the Exception line from the 'getImagesList' function, it was possible to attach the SD with OVF disks inside it.

Comment 17 sshmulev 2022-11-14 12:05:45 UTC
Versions verified on:
engine-4.5.3.3-1.el8ev
vdsm-4.50.3.5-1.el8ev

Comment 19 errata-xmlrpc 2022-11-30 07:53:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Manager (ovirt-engine) [ovirt-4.5.3-2] update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8695


Note You need to log in before you can comment on or make changes to this bug.