Created attachment 1380998 [details] logs Description of problem: Disk removal, with postZeros='true', initiated as part of live storage migration, fails on with NullPointerException Version-Release number of selected component (if applicable): rhvm-4.2.1-0.2.el7.noarch vdsm-4.20.11-1.el7ev.x86_64 libvirt-3.9.0-6.el7.x86_64 qemu-kvm-rhev-2.10.0-15.el7.x86_64 How reproducible: Not always Steps to Reproduce: 1. Create a 60G disk on block domain with wipe after delete as true 2. Remove the disk Actual results: 2018-01-14 14:22:56,857+02 ERROR [org.ovirt.engine.core.bll.job.JobRepositoryImpl] (EE-ManagedThreadFactory-engine-Thread-218345) [35610449] Failed to save step 'f6bef7d3-6902-4840-a270-972b26e61898', 'FINALIZIN G': CallableStatementCallback; SQL [{call insertstep(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}]; ERROR: insert or update on table "step" violates foreign key constraint "fk_step_job" Detail: Key (job_id)=(8381b310-d76a-4ad8-a5ec-cb55105a2054) is not present in table "job". Where: SQL statement "INSERT INTO step ( step_id, parent_step_id, job_id, step_type, description, . . . 2018-01-14 14:22:56,869+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-218345) [35610449] [within thread]: endAction for action type RemoveImage threw an exception.: java.lang.NullPointerException at org.ovirt.engine.core.bll.job.ExecutionHandler.endJob(ExecutionHandler.java:520) [bll.jar:] at org.ovirt.engine.core.bll.job.ExecutionHandler.endTaskJobIfNeeded(ExecutionHandler.java:851) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.endStepsAndJobIfNeeded(CommandBase.java:606) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:579) [bll.jar:] Eventually, the disk is being removed from the storage. It might be related to the zeroing process being done as part of the disk removal, it took ~20 minutes: 2018-01-14 14:07:37,367+0200 INFO (jsonrpc/4) [vdsm.api] START deleteImage(sdUUID='ab4e8afa-669d-4d6d-9fc5-65da708a96ce', spUUID='cd0dcf40-4bda-4c0a-9c54-4508fe8d1e49', imgUUID='e0528294-511e-4aaa-a615-0bce3b699871', postZero='true', force='false', discard=True) from=::ffff:10.35.161.182,36072, flow_id=35610449, task_id=11483837-602e-4a6c-ac57-624076478fd2 (api:46) 2018-01-14 14:07:37,370+0200 DEBUG (jsonrpc/4) [storage.ResourceManager] Trying to register resource '00_storage.e0528294-511e-4aaa-a615-0bce3b699871_ab4e8afa-669d-4d6d-9fc5-65da708a96ce' for lock type 'exclusive' (resourceManager:495) 2018-01-14 14:27:45,877+0200 DEBUG (tasks/9) [storage.StorageDomainManifest] removed image dir: /rhev/data-center/mnt/blockSD/25f1eed9-e7ec-4453-8c8e-f3a461e85d63/images/e0528294-511e-4aaa-a615-0bce3b699871 (blo ckSD:698) Expected results: Disk deletion should succeed Additional info: Disk creation: 2018-01-14 13:58:49,883+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand] (default task-29) [disks_create_0e17ddae-60f8-4b93] START, CreateImageVDSCommand( CreateImageVDSCommandParameters:{storagePoolId='cd0dcf40-4bda-4c0a-9c54-4508fe8d1e49', ignoreFailoverLimit='false', storageDomainId='ab4e8afa-669d-4d6d-9fc5-65da708a96ce', imageGroupId='e0528294-511e-4aaa-a615-0bce3b699871', imageSizeInBytes='64424509440', volumeFormat='RAW', newImageId='3233b08d-c0ba-4a73-8721-a07d1e00dc40', imageType='Preallocated', newImageDescription='{"DiskAlias":"disk_type_1413584940","DiskDescription":""}', imageInitialSizeInBytes='0'}), log id: 5881d51d Disk deletion: host3-vdsm.log: 2018-01-14 14:07:37,239+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.RemoveImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [35610449] Running command: RemoveImageCommand internal: true. Entities affected : ID: ab4e8afa-669d-4d6d-9fc5-65da708a96ce Type: Storage 2018-01-14 14:07:37,360+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [35610449] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='cd0dcf40-4bda-4c0a-9c54-4508fe8d1e49', ignoreFailoverLimit='false', storageDomainId='ab4e8afa-669d-4d6d-9fc5-65da708a96ce', imageGroupId='e0528294-511e-4aaa-a615-0bce3b699871', postZeros='true', discard='true', forceDelete='false'}), log id: 5b09e964
Steps to Reproduce: 1. Create a VM with an attached 60G disk, resides on block domain with wipe after delete as true 2. start the VM and move the disk to a different domain
I assume it's a regression?
It's a bit hard to say. Since we've not seen it in previous versions (checked in latest 4.1 automation execution we've done), then, theoretically, yes. But, this test execution was modified (to pytest parameterize) which might be the reason for the bug to be exposed only now.
To reproduce, you should have a situation where the source image removal takes a lot of time (you can use WAD), enough time so that the LiveMigrateDiskCommand makes it and finishes before the RemoveImage of the source image ends. Also, you should make sure that the endAction of the RemoveImage command (from the engine log: "Attempting to endAction 'RemoveImage'") occurs only after the job is removed from the DB (unless you will not have the exception even without the fix). It's pretty hard to make it happen, I used the debugger and changed vdsm code that wipes volumes to sleep for a few minutes only when it wipes the source image. Also, in order to make sure that the job is cleaned quickly after LiveMigrateDiskCommand ends, I changed the config values "JobCleanupRateInMinutes" and "SucceededJobCleanupTimeInMinutes" to 1 (I guess that I could change only one of them, didn't check).
Tested multiple times using: ovirt-engine-4.2.2.5-0.1.el7.noarch No exception found. Moving to VERIFIED.
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.