Bug 1534241 - RemoveImage, with postZero='true', initiated as part of live storage migration, sometimes fails with NullPointerException
Summary: RemoveImage, with postZero='true', initiated as part of live storage migratio...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.1
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ovirt-4.2.2
: ---
Assignee: Idan Shaby
QA Contact: Yosi Ben Shimon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-14 13:00 UTC by Elad
Modified: 2018-03-29 11:08 UTC (History)
6 users (show)

Fixed In Version: ovirt-engine-4.2.2.4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-29 11:08:05 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.2+
ylavi: blocker+


Attachments (Terms of Use)
logs (9.81 MB, application/x-gzip)
2018-01-14 13:00 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 88797 0 'None' 'MERGED' 'core: delete LiveStorageMigrationHelper' 2019-11-23 14:18:18 UTC
oVirt gerrit 88798 0 'None' 'MERGED' 'core: fix LiveMigrateDiskCommand race' 2019-11-23 14:18:18 UTC
oVirt gerrit 88890 0 'None' 'MERGED' 'core: delete LiveStorageMigrationHelper' 2019-11-23 14:18:18 UTC
oVirt gerrit 88891 0 'None' 'MERGED' 'core: fix LiveMigrateDiskCommand race' 2019-11-23 14:18:18 UTC

Description Elad 2018-01-14 13:00:06 UTC
Created attachment 1380998 [details]
logs

Description of problem:
Disk removal, with postZeros='true', initiated as part of live storage migration, fails on with NullPointerException

Version-Release number of selected component (if applicable):
rhvm-4.2.1-0.2.el7.noarch
vdsm-4.20.11-1.el7ev.x86_64
libvirt-3.9.0-6.el7.x86_64
qemu-kvm-rhev-2.10.0-15.el7.x86_64


How reproducible:
Not always

Steps to Reproduce:
1. Create a 60G disk on block domain with wipe after delete as true
2. Remove the disk


Actual results:

2018-01-14 14:22:56,857+02 ERROR [org.ovirt.engine.core.bll.job.JobRepositoryImpl] (EE-ManagedThreadFactory-engine-Thread-218345) [35610449] Failed to save step 'f6bef7d3-6902-4840-a270-972b26e61898', 'FINALIZIN
G': CallableStatementCallback; SQL [{call insertstep(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}]; ERROR: insert or update on table "step" violates foreign key constraint "fk_step_job"
  Detail: Key (job_id)=(8381b310-d76a-4ad8-a5ec-cb55105a2054) is not present in table "job".
  Where: SQL statement "INSERT INTO step (
        step_id,
        parent_step_id,
        job_id,
        step_type,
        description,
.
.
.

2018-01-14 14:22:56,869+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-218345) [35610449] [within thread]: endAction for action type RemoveImage threw an exception.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.job.ExecutionHandler.endJob(ExecutionHandler.java:520) [bll.jar:]
        at org.ovirt.engine.core.bll.job.ExecutionHandler.endTaskJobIfNeeded(ExecutionHandler.java:851) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endStepsAndJobIfNeeded(CommandBase.java:606) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:579) [bll.jar:]



Eventually, the disk is being removed from the storage.
It might be related to the zeroing process being done as part of the disk removal, it took ~20 minutes:


2018-01-14 14:07:37,367+0200 INFO  (jsonrpc/4) [vdsm.api] START deleteImage(sdUUID='ab4e8afa-669d-4d6d-9fc5-65da708a96ce', spUUID='cd0dcf40-4bda-4c0a-9c54-4508fe8d1e49', imgUUID='e0528294-511e-4aaa-a615-0bce3b699871', postZero='true', force='false', discard=True) from=::ffff:10.35.161.182,36072, flow_id=35610449, task_id=11483837-602e-4a6c-ac57-624076478fd2 (api:46)
2018-01-14 14:07:37,370+0200 DEBUG (jsonrpc/4) [storage.ResourceManager] Trying to register resource '00_storage.e0528294-511e-4aaa-a615-0bce3b699871_ab4e8afa-669d-4d6d-9fc5-65da708a96ce' for lock type 'exclusive' (resourceManager:495)



2018-01-14 14:27:45,877+0200 DEBUG (tasks/9) [storage.StorageDomainManifest] removed image dir: /rhev/data-center/mnt/blockSD/25f1eed9-e7ec-4453-8c8e-f3a461e85d63/images/e0528294-511e-4aaa-a615-0bce3b699871 (blo
ckSD:698)



Expected results:
Disk deletion should succeed

Additional info:

Disk creation:

2018-01-14 13:58:49,883+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand] (default task-29) [disks_create_0e17ddae-60f8-4b93] START, CreateImageVDSCommand( CreateImageVDSCommandParameters:{storagePoolId='cd0dcf40-4bda-4c0a-9c54-4508fe8d1e49', ignoreFailoverLimit='false', storageDomainId='ab4e8afa-669d-4d6d-9fc5-65da708a96ce', imageGroupId='e0528294-511e-4aaa-a615-0bce3b699871', imageSizeInBytes='64424509440', volumeFormat='RAW', newImageId='3233b08d-c0ba-4a73-8721-a07d1e00dc40', imageType='Preallocated', newImageDescription='{"DiskAlias":"disk_type_1413584940","DiskDescription":""}', imageInitialSizeInBytes='0'}), log id: 5881d51d


Disk deletion:

host3-vdsm.log:

2018-01-14 14:07:37,239+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.RemoveImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [35610449] Running command: RemoveImageCommand internal: true. Entities affected :  ID: ab4e8afa-669d-4d6d-9fc5-65da708a96ce Type: Storage
2018-01-14 14:07:37,360+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [35610449] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='cd0dcf40-4bda-4c0a-9c54-4508fe8d1e49', ignoreFailoverLimit='false', storageDomainId='ab4e8afa-669d-4d6d-9fc5-65da708a96ce', imageGroupId='e0528294-511e-4aaa-a615-0bce3b699871', postZeros='true', discard='true', forceDelete='false'}), log id: 5b09e964

Comment 1 Elad 2018-01-14 14:44:20 UTC
Steps to Reproduce:
1. Create a VM with an attached 60G disk, resides on block domain with wipe after delete as true
2. start the VM and move the disk to a different domain

Comment 2 Yaniv Kaul 2018-01-15 07:06:00 UTC
I assume it's a regression?

Comment 3 Elad 2018-01-15 07:55:54 UTC
It's a bit hard to say. Since we've not seen it in previous versions (checked in latest 4.1 automation execution we've done), then, theoretically, yes. But, this test execution was modified (to pytest parameterize) which might be the reason for the bug to be exposed only now.

Comment 4 Idan Shaby 2018-03-13 13:59:33 UTC
To reproduce, you should have a situation where the source image removal takes a lot of time (you can use WAD), enough time so that the LiveMigrateDiskCommand makes it and finishes before the RemoveImage of the source image ends.
Also, you should make sure that the endAction of the RemoveImage command (from the engine log: "Attempting to endAction 'RemoveImage'") occurs only after the job is removed from the DB (unless you will not have the exception even without the fix).

It's pretty hard to make it happen, I used the debugger and changed vdsm code that wipes volumes to sleep for a few minutes only when it wipes the source image. Also, in order to make sure that the job is cleaned quickly after LiveMigrateDiskCommand ends, I changed the config values "JobCleanupRateInMinutes" and "SucceededJobCleanupTimeInMinutes" to 1 (I guess that I could change only one of them, didn't check).

Comment 5 Yosi Ben Shimon 2018-03-26 10:47:43 UTC
Tested multiple times using:
ovirt-engine-4.2.2.5-0.1.el7.noarch

No exception found.

Moving to VERIFIED.

Comment 6 Sandro Bonazzola 2018-03-29 11:08:05 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.