Description of problem: The customer was doing the DR drill when they faced this issue. Logs are from my test environment and match with the customer. SPM was dell-r530-3. The other host dell-r740-3 was in maintenance. The storage domain was moved to maintenance mode which failed because of uncleared tasks. ~~~ 2022-09-14 10:04:16,121+05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [40a905b2-9cea-464a-bd8a-a6a55c96bf09] START, DeactivateStorageDomainVDSCommand( DeactivateStorageDomainVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', storageDomainId='ad170e60-073b-4349-a599-206bf1d8cb9f', masterDomainId='00000000-0000-0000-0000-000000000000', masterVersion='75'}), log id: 7c1707a8 2022-09-14 10:04:16,167+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [40a905b2-9cea-464a-bd8a-a6a55c96bf09] EVENT_ID: USER_DEACTIVATE_STORAGE_DOMAIN_FAILED(969), Failed to deactivate Storage Domain iscsi (Data Center new_dc). ~~~ The spmstop also failed with uncleared tasks error: ~~~ 2022-09-14 10:04:16,150+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [40a905b2-9cea-464a-bd8a-a6a55c96bf09] EVENT_ID: VDS_ALERT_NOT_STOPPING_SPM_UNCLEARED_TASKS(9,030), Not stopping SPM on vds dell-r530-3, pool id 12c8ff21-0310-4f36-b4a0-f8e328a94a20 as there are uncleared tasks Task '1c6396b2-0869-4bc6-925c-fbb04c96e927', status 'finished' ~~~ However, the SD moved to unknown status at this phase: ~~~ 2022-09-14 10:04:16,151+05 INFO [org.ovirt.engine.core.bll.CommandCompensator] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [40a905b2-9cea-464a-bd8a-a6a55c96bf09] Command [id=366986d3-9cb4-45ec-8ec6-b3806cf9d6a1]: Compensating CHANGED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: EntityStatusSnapshot:{id='StoragePoolIsoMapId:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', storageId='ad170e60-073b-4349-a599-206bf1d8cb9f'}', status='Unknown'}. ~~~ This allowed the user to destroy the SD even when it got uncleared tasks in it: ~~~ 2022-09-14 10:04:21,659+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-405796) [0277440d-c37b-415e-a228-a57a53bc4824] EVENT_ID: USER_FORCE_REMOVE_STORAGE_DOMAIN(981), Storage Domain iscsi was forcibly removed by admin@internal-authz ~~~ Then SPM dell-r530-3 was moved to maintenance mode and was powered down. ~~~ 2022-09-14 10:05:20,952+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-101) [f137d0df-6179-4106-b20a-1e132bc5e9d0] EVENT_ID: USER_VDS_MAINTENANCE_WITHOUT_REASON(620), Host dell-r530-3 was switched to Maintenance mode by admin@internal-authz. ~~~ Activated the other host. ~~~ 2022-09-14 10:05:29,072+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-85) [180e9898] EVENT_ID: VDS_DETECTED(13), Status of host dell-r740-3 was set to Up. ~~~ Tried to import and attach the SD. The imagelist failed with error below: ~~~ 2022-09-14 10:12:45,909+05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.GetImagesListVDSCommand] (EE-ManagedThreadFactory-engine-Thread-405859) [e7bd4b4c-d4fd-4020-9ff0-15308cc2e5cd] START, GetImagesListVDSCommand( GetImagesListVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', sdUUID='ad170e60-073b-4349-a599-206bf1d8cb9f'}), log id: d18ce68 2022-09-14 10:12:45,910+05 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to dell-r530-3/10.65.176.76 2022-09-14 10:12:45,910+05 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connected to dell-r530-3/10.65.176.76:54321 2022-09-14 10:12:45,910+05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (EE-ManagedThreadFactory-engine-Thread-405859) [e7bd4b4c-d4fd-4020-9ff0-15308cc2e5cd] IrsBroker::Failed::GetImagesListVDS 2022-09-14 10:12:45,910+05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetImagesListVDSCommand] (EE-ManagedThreadFactory-engine-Thread-405859) [e7bd4b4c-d4fd-4020-9ff0-15308cc2e5cd] Command 'GetImagesListVDSCommand( GetImagesListVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', sdUUID='ad170e60-073b-4349-a599-206bf1d8cb9f'})' execution failed: IRSProtocolException: 2022-09-14 10:12:45,910+05 ERROR [org.ovirt.engine.core.bll.storage.disk.image.GetUnregisteredDisksQuery] (EE-ManagedThreadFactory-engine-Thread-405859) [e7bd4b4c-d4fd-4020-9ff0-15308cc2e5cd] Query 'GetUnregisteredDisksQuery' failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSProtocolException: IRSProtocolException: (Failed with error ENGINE and code 5001) ~~~ It is trying to contact the old SPM which is in maintenance and in powered down status. So GetImagesListVDSCommand failed and hence it concluded that it doesn't have OVF_STORE. ~~~ 2022-09-14 10:12:45,911+05 WARN [org.ovirt.engine.core.bll.storage.pool.AddStoragePoolWithStoragesCommand] (EE-ManagedThreadFactory-engine-Thread-405859) [] There are no OVF_STORE disks on storage domain id ad170e60-073b-4349-a599-206bf1d8cb9f ~~~ Started the server dell-r530-3 and activated: ~~~ 2022-09-14 10:34:08,488+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-9) [2fba19f2] EVENT_ID: VDS_DETECTED(13), Status of host dell-r530-3 was set to Up. ~~~ Attaching SD worked now. But since OVF_STORE was not registered, there was no VMs to import. ~~~ 2022-09-14 10:35:00,170+05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-406052) [42d9952f] START, ActivateStorageDomainVDSCommand( ActivateStorageDomainVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', storageDomainId='ad170e60-073b-4349-a599-206bf1d8cb9f'}), log id: 551e6a7c 2022-09-14 10:35:00,170+05 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to dell-r530-3/10.65.176.76 2022-09-14 10:35:00,170+05 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connected to dell-r530-3/10.65.176.76:54321 2022-09-14 10:35:30,148+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-406052) [519b4721] EVENT_ID: USER_ACTIVATED_STORAGE_DOMAIN(966), Storage Domain iscsi (Data Center new_dc) was activated by admin@internal-authz ~~~ Did update OVFs on SD which created two new OVFs. ~~~ 2022-09-14 10:36:24,634+05 INFO [org.ovirt.engine.core.bll.storage.ovfstore.CreateOvfVolumeForStorageDomainCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [185cec60] Running command: CreateOvfVolumeForStorageDomainCommand internal: true. Entities affected : ID: ad170e60-073b-4349-a599-206bf1d8cb9f Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN 2022-09-14 10:36:24,750+05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [185cec60] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', storageDomainId='ad170e60-073b-4349-a599-206bf1d8cb9f', imageGroupId='6640a4fb-25bf-4032-af8a-a8b72714e75a', imageSizeInBytes='134217728', volumeFormat='RAW', newImageId='c012da8c-ffd7-4b50-8886-8857be2ba310', imageType='Preallocated', newImageDescription='{"DiskAlias":"OVF_STORE","DiskDescription":"OVF_STORE"}', imageInitialSizeInBytes='0', imageId='00000000-0000-0000-0000-000000000000', sourceImageGroupId='00000000-0000-0000-0000-000000000000', shouldAddBitmaps='false', legal='true', sequenceNumber='1', bitmap='null'}), log id: 1915e011 2022-09-14 10:36:25,527+05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [185cec60] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='12c8ff21-0310-4f36-b4a0-f8e328a94a20', ignoreFailoverLimit='false', storageDomainId='ad170e60-073b-4349-a599-206bf1d8cb9f', imageGroupId='048c9d95-23b2-4864-9d8a-096e51219f92', imageSizeInBytes='134217728', volumeFormat='RAW', newImageId='ac7e18ca-39b9-45d4-b2ce-7cae420dd5a2', imageType='Preallocated', newImageDescription='{"DiskAlias":"OVF_STORE","DiskDescription":"OVF_STORE"}', imageInitialSizeInBytes='0', imageId='00000000-0000-0000-0000-000000000000', sourceImageGroupId='00000000-0000-0000-0000-000000000000', shouldAddBitmaps='false', legal='true', sequenceNumber='1', bitmap='null'}), log id: 6fbeef72 ~~~ Now the SD has 4 OVFs where only 2 new are in DB which don't have any data. If we detach and attach the SD again, all 4 OVFs will get registered. And in the next OVF update, it will write the empty OVF data to the original one losing all VMs OVF. Version-Release number of selected component (if applicable): rhvm-4.5.1.3-0.28.el8ev.noarch How reproducible: Reproduced 1 out of 5 times I tried. Steps to Reproduce: 1. Create a task in the SD. I used vdsm-client to delete an image that will create a "Finished" task that will not get cleared. 2. With one host active, put the SD into maintenance mode. 3. When the SD moves to unknown, destroy it. You have to be quick here since it will be in unknown status only for 3-5 seconds. 4. Put the old SPM in maintenance, shut down it and activate the other host. 5. Import the same SD. 6. Attaching the SD will fail (was not reproducible all the time). 7. If it fails, activate the old SPM and try attaching again. It should work now. 8. There will not be any VMs to import. 9. Try OVF update and it will create 2 new OVFs. Actual results: engine allows destroying the storage domain when there are uncleared tasks and allows to move SPM host to maintenance mode causing even OVF corruption Expected results: It should not allow destroying the SD at this stage when there are uncleared tasks as this leaves the DC in an inconsistent state. Destroy should fail if it has tasks giving the proper message to the user. Users can then manually clean up the tasks. Additional info:
Also, note that the new OVF_STORE was created because it failed to register old OVF due to failure in GetImagesListVDSCommand. I think we also need another check to confirm if OVF_STORE is available in SD just before creating a new OVF_STORE? This would take care of OVF store corruption in cases where GetImagesListVDSCommand for some reason. Please let me know if it requires a new bug.
Benny, please document the fix and move to 'modified'
Please add verification steps for this bug
Verified according to the following flow: 1. Leave one host active 2. Add 'raise Exception('ex')' in the 'getImagesList' function. 3. Restart vdsm 4. Detach and remove a storage domain 5. import the SD Expected behavior: The attachment should fail. Actual results: Error while executing action Attach Storage Domain: Internal Engine Error 2022-11-14 13:04:32,353+02 ERROR [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (default task-40) [] Exception: java.lang.RuntimeException: Failed to retrieve unregistered disks After removing the Exception line from the 'getImagesList' function, it was possible to attach the SD with OVF disks inside it.
Versions verified on: engine-4.5.3.3-1.el8ev vdsm-4.50.3.5-1.el8ev
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV Manager (ovirt-engine) [ovirt-4.5.3-2] update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8695