Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1534241

Summary:

RemoveImage, with postZero='true', initiated as part of live storage migration, sometimes fails with NullPointerException

Product:

[oVirt] ovirt-engine

Reporter:

Elad <ebenahar>

Component:

BLL.Storage

Assignee:

Idan Shaby <ishaby>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Yosi Ben Shimon <ybenshim>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.2.1

CC:

amureini, bugs, ebenahar, lveyde, tnisan, ylavi

Target Milestone:

ovirt-4.2.2

Keywords:

Automation

Target Release:

---

Flags:

rule-engine: ovirt-4.2+
ylavi: blocker+

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

Fixed In Version:

ovirt-engine-4.2.2.4

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-03-29 11:08:05 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs	none

Description Elad 2018-01-14 13:00:06 UTC

Created attachment 1380998 [details]
logs

Description of problem:
Disk removal, with postZeros='true', initiated as part of live storage migration, fails on with NullPointerException

Version-Release number of selected component (if applicable):
rhvm-4.2.1-0.2.el7.noarch
vdsm-4.20.11-1.el7ev.x86_64
libvirt-3.9.0-6.el7.x86_64
qemu-kvm-rhev-2.10.0-15.el7.x86_64


How reproducible:
Not always

Steps to Reproduce:
1. Create a 60G disk on block domain with wipe after delete as true
2. Remove the disk


Actual results:

2018-01-14 14:22:56,857+02 ERROR [org.ovirt.engine.core.bll.job.JobRepositoryImpl] (EE-ManagedThreadFactory-engine-Thread-218345) [35610449] Failed to save step 'f6bef7d3-6902-4840-a270-972b26e61898', 'FINALIZIN
G': CallableStatementCallback; SQL [{call insertstep(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}]; ERROR: insert or update on table "step" violates foreign key constraint "fk_step_job"
  Detail: Key (job_id)=(8381b310-d76a-4ad8-a5ec-cb55105a2054) is not present in table "job".
  Where: SQL statement "INSERT INTO step (
        step_id,
        parent_step_id,
        job_id,
        step_type,
        description,
.
.
.

2018-01-14 14:22:56,869+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-218345) [35610449] [within thread]: endAction for action type RemoveImage threw an exception.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.job.ExecutionHandler.endJob(ExecutionHandler.java:520) [bll.jar:]
        at org.ovirt.engine.core.bll.job.ExecutionHandler.endTaskJobIfNeeded(ExecutionHandler.java:851) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endStepsAndJobIfNeeded(CommandBase.java:606) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:579) [bll.jar:]



Eventually, the disk is being removed from the storage.
It might be related to the zeroing process being done as part of the disk removal, it took ~20 minutes:


2018-01-14 14:07:37,367+0200 INFO  (jsonrpc/4) [vdsm.api] START deleteImage(sdUUID='ab4e8afa-669d-4d6d-9fc5-65da708a96ce', spUUID='cd0dcf40-4bda-4c0a-9c54-4508fe8d1e49', imgUUID='e0528294-511e-4aaa-a615-0bce3b699871', postZero='true', force='false', discard=True) from=::ffff:10.35.161.182,36072, flow_id=35610449, task_id=11483837-602e-4a6c-ac57-624076478fd2 (api:46)
2018-01-14 14:07:37,370+0200 DEBUG (jsonrpc/4) [storage.ResourceManager] Trying to register resource '00_storage.e0528294-511e-4aaa-a615-0bce3b699871_ab4e8afa-669d-4d6d-9fc5-65da708a96ce' for lock type 'exclusive' (resourceManager:495)



2018-01-14 14:27:45,877+0200 DEBUG (tasks/9) [storage.StorageDomainManifest] removed image dir: /rhev/data-center/mnt/blockSD/25f1eed9-e7ec-4453-8c8e-f3a461e85d63/images/e0528294-511e-4aaa-a615-0bce3b699871 (blo
ckSD:698)



Expected results:
Disk deletion should succeed

Additional info:

Disk creation:

2018-01-14 13:58:49,883+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand] (default task-29) [disks_create_0e17ddae-60f8-4b93] START, CreateImageVDSCommand( CreateImageVDSCommandParameters:{storagePoolId='cd0dcf40-4bda-4c0a-9c54-4508fe8d1e49', ignoreFailoverLimit='false', storageDomainId='ab4e8afa-669d-4d6d-9fc5-65da708a96ce', imageGroupId='e0528294-511e-4aaa-a615-0bce3b699871', imageSizeInBytes='64424509440', volumeFormat='RAW', newImageId='3233b08d-c0ba-4a73-8721-a07d1e00dc40', imageType='Preallocated', newImageDescription='{"DiskAlias":"disk_type_1413584940","DiskDescription":""}', imageInitialSizeInBytes='0'}), log id: 5881d51d


Disk deletion:

host3-vdsm.log:

2018-01-14 14:07:37,239+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.RemoveImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [35610449] Running command: RemoveImageCommand internal: true. Entities affected :  ID: ab4e8afa-669d-4d6d-9fc5-65da708a96ce Type: Storage
2018-01-14 14:07:37,360+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [35610449] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='cd0dcf40-4bda-4c0a-9c54-4508fe8d1e49', ignoreFailoverLimit='false', storageDomainId='ab4e8afa-669d-4d6d-9fc5-65da708a96ce', imageGroupId='e0528294-511e-4aaa-a615-0bce3b699871', postZeros='true', discard='true', forceDelete='false'}), log id: 5b09e964

Comment 1 Elad 2018-01-14 14:44:20 UTC

Steps to Reproduce:
1. Create a VM with an attached 60G disk, resides on block domain with wipe after delete as true
2. start the VM and move the disk to a different domain

Comment 2 Yaniv Kaul 2018-01-15 07:06:00 UTC

I assume it's a regression?

Comment 3 Elad 2018-01-15 07:55:54 UTC

It's a bit hard to say. Since we've not seen it in previous versions (checked in latest 4.1 automation execution we've done), then, theoretically, yes. But, this test execution was modified (to pytest parameterize) which might be the reason for the bug to be exposed only now.

Comment 4 Idan Shaby 2018-03-13 13:59:33 UTC

To reproduce, you should have a situation where the source image removal takes a lot of time (you can use WAD), enough time so that the LiveMigrateDiskCommand makes it and finishes before the RemoveImage of the source image ends.
Also, you should make sure that the endAction of the RemoveImage command (from the engine log: "Attempting to endAction 'RemoveImage'") occurs only after the job is removed from the DB (unless you will not have the exception even without the fix).

It's pretty hard to make it happen, I used the debugger and changed vdsm code that wipes volumes to sleep for a few minutes only when it wipes the source image. Also, in order to make sure that the job is cleaned quickly after LiveMigrateDiskCommand ends, I changed the config values "JobCleanupRateInMinutes" and "SucceededJobCleanupTimeInMinutes" to 1 (I guess that I could change only one of them, didn't check).

Comment 5 Yosi Ben Shimon 2018-03-26 10:47:43 UTC

Tested multiple times using:
ovirt-engine-4.2.2.5-0.1.el7.noarch

No exception found.

Moving to VERIFIED.

Comment 6 Sandro Bonazzola 2018-03-29 11:08:05 UTC

This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.