Bug 1651874 - Live merge failed with NPE on endAction of DestroyImage
Summary: Live merge failed with NPE on endAction of DestroyImage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.2.7
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ovirt-4.3.5
: 4.3.5
Assignee: Dana
QA Contact: Avihai
URL:
Whiteboard:
: 1694888 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-21 05:50 UTC by Germano Veit Michel
Modified: 2019-08-28 13:17 UTC (History)
12 users (show)

Fixed In Version: ovirt-engine-4.3.5.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-12 11:53:27 UTC
oVirt Team: Infra
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3380701 Troubleshoot None RHV: Retrying a failed snapshot deletion fails with "General command validation failure" 2019-06-05 09:13:48 UTC
Red Hat Knowledge Base (Solution) 3954081 None None Live Merge fails at the very last stage with a Null Pointer Exception 2019-03-01 17:30:22 UTC
Red Hat Product Errata RHEA-2019:2431 None None None 2019-08-12 11:53:39 UTC
oVirt gerrit 100677 master MERGED engine: add json mixin to ignore isLiveMerge() 2019-06-12 13:39:11 UTC
oVirt gerrit 100786 ovirt-engine-4.3 MERGED engine: add json mixin to ignore isLiveMerge() 2019-06-13 08:23:49 UTC

Description Germano Veit Michel 2018-11-21 05:50:24 UTC
Description of problem:

The VM has 3 disks, snapshot deletion triggered 3 merges, all 3 disks merged fine and got to the DESTROY_IMAGE step. On this step, 2 of the disks succeeded and 1 failed. 

The failure on the engine is as follows:

2018-11-20 03:35:30,770+11 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-9) [432f4ed3-bc9c-49ef-9021-d743a3ebebc9] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='ba7fd744-6f1b-4598-9b4d-a93ef907ea3e', ignoreFailoverLimit='false', storageDomainId='bce96ddf-4372-4169-8cf2-014f75d718b2', imageGroupId='9ec4328d-cea3-4616-bc81-d464637a2898', imageId='00000000-0000-0000-0000-000000000000', imageList='[2d4223f2-1e19-4679-80f8-c6fecef9f4e6]', postZero='true', force='false'}), log id: 2f7366d5

2018-11-20 03:35:38,983+11 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-2263) [432f4ed3-bc9c-49ef-9021-d743a3ebebc9] [within thread]: endAction for action type DestroyImage threw an exception.: org.apache.commons.lang.SerializationException: org.codehaus.jackson.map.JsonMappingException: (was java.lang.NullPointerException) (through reference chain: org.ovirt.engine.core.common.action.DestroyImageParameters["liveMerge"])
        at org.ovirt.engine.core.utils.serialization.json.JsonObjectSerializer.writeJsonAsString(JsonObjectSerializer.java:81) [utils.jar:]
        at org.ovirt.engine.core.utils.serialization.json.JsonObjectSerializer.serialize(JsonObjectSerializer.java:64) [utils.jar:]
        at org.ovirt.engine.core.dao.CommandEntityDaoImpl.serializeParameters(CommandEntityDaoImpl.java:112) [dal.jar:]
        at org.ovirt.engine.core.dao.CommandEntityDaoImpl.createFullParametersMapper(CommandEntityDaoImpl.java:96) [dal.jar:]
        at org.ovirt.engine.core.dao.CommandEntityDaoImpl.saveOrUpdate(CommandEntityDaoImpl.java:147) [dal.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandsCacheImpl.saveOrUpdateWithoutTransaction(CommandsCacheImpl.java:112) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandsCacheImpl.updateCommandStatus(CommandsCacheImpl.java:87) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandsRepository.updateCommandStatus(CommandsRepository.java:158) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.updateCommandStatus(CommandCoordinatorImpl.java:160) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCoordinatorUtil.updateCommandStatus(CommandCoordinatorUtil.java:341) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.setCommandStatus(CommandBase.java:2294) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.setCommandStatus(CommandBase.java:2286) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.handleCommandExecutionEnded(CommandBase.java:623) [bll.jar:]

The deleteVolume worked fine on VDSM, it activated the LV, zeroed it, teardown, removed the lv and returned.

2018-11-20 03:35:30,773+1100 INFO  (jsonrpc/3) [vdsm.api] START deleteVolume(sdUUID=u'bce96ddf-4372-4169-8cf2-014f75d718b2', spUUID=u'ba7fd744-6f1b-4598-9b4d-a93ef907ea3e', imgUUID=u'9ec4328d-cea3-4616-bc81-d464637a2898', volumes=[u'2d4223f2-1e19-4679-80f8-c6fecef9f4e6'], postZero=u'true', force=u'false', discard=False) from=::ffff:192.168.18.179,59360, flow_id=432f4ed3-bc9c-49ef-9021-d743a3ebebc9, task_id=556cb20b-5ba8-41db-b709-cdf759e6a4a8 (api:46)

I don't see anything wrong with the json returned to the engine:

2018-11-20 03:35:38,975+1100 INFO  (jsonrpc/1) [vdsm.api] FINISH getAllTasksStatuses return={'allTasksStatus': {'556cb20b-5ba8-41db-b709-cdf759e6a4a8': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '556cb20b-5ba8-41db-b709-cdf759e6a4a8'}}} from=::ffff:192.168.18.179,50904, task_id=cbe6ea78-2acf-49b6-935d-1ad7cc92d422 (api:52)


Version-Release number of selected component (if applicable):
rhvm-4.2.7.4-0.1.el7ev.noarch
vdsm-4.20.43-1.el7ev.x86_64

How reproducible:
Unknown

Comment 4 Sandro Bonazzola 2019-01-28 09:44:29 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 14 Tal Nisan 2019-05-28 09:50:16 UTC
Martin, we cannot seem be able to reproduce but it appears the NPE occurs while serializing to JSON, can someone from Infra please have a look?

Comment 15 Martin Perina 2019-05-28 10:36:14 UTC
From the stacktrace it seems that either DestroyImageParameters [1] or StorageDomainParametersBase [2] classes are not serializable to Json, my guess is that the problem is Phase enum in StorageDomainParametersBase. If so, then we need to write some MixIn class similar to those we already have in [3]. Ravi, could you please take a look and suggest the fix?



[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/action/DestroyImageParameters.java
[2] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/action/StorageDomainParametersBase.java
[3] https://github.com/oVirt/ovirt-engine/tree/master/backend/manager/modules/utils/src/main/java/org/ovirt/engine/core/utils/serialization/json

Comment 16 Benny Zlotnik 2019-06-05 09:13:29 UTC
Just a note: this bug happens at around the same time every time 03:35:35 which is the time the CommandEntityCleanupManager runs, you can see it's running in 4.1 (Gordon's logs) one but not in >4.2, probably because of the migration from quartz to jboss scheduled threads.

While CommandEntityCleanupManager should not cleanup commands not older than 30 days, perhaps there is some issue with the reload of the commands entity cache?

Comment 17 Eyal Shenitzky 2019-06-05 09:13:49 UTC
*** Bug 1694888 has been marked as a duplicate of this bug. ***

Comment 18 Eyal Shenitzky 2019-06-05 09:16:26 UTC
see also bug 1694888 for newer logs.

Comment 20 Avihai 2019-06-17 11:36:21 UTC
Hi,

We/QE needs a clear reproduction scenario so we can qa_ack/Nack(if we do not have HW/capacity) this bug and later verify.

If this is something that is pure code, as it seems its serialization issue with JSON and the user/QE cannot reproduce it with a user flow you have the green light to verify this as DEV.

Comment 21 Martin Perina 2019-06-17 13:00:55 UTC
(In reply to Avihai from comment #20)
> Hi,
> 
> We/QE needs a clear reproduction scenario so we can qa_ack/Nack(if we do not
> have HW/capacity) this bug and later verify.
> 
> If this is something that is pure code, as it seems its serialization issue
> with JSON and the user/QE cannot reproduce it with a user flow you have the
> green light to verify this as DEV.

We don't have a clear reproducing steps, we have fixed the code according to the provided stacktrace. According to the code we know this issue can happen when you execute DestroyImageCommand and engine saves parameters of this command as JSON object into database.

Comment 22 Benny Zlotnik 2019-06-17 13:17:20 UTC
(In reply to Martin Perina from comment #21)
> (In reply to Avihai from comment #20)
> > Hi,
> > 
> > We/QE needs a clear reproduction scenario so we can qa_ack/Nack(if we do not
> > have HW/capacity) this bug and later verify.
> > 
> > If this is something that is pure code, as it seems its serialization issue
> > with JSON and the user/QE cannot reproduce it with a user flow you have the
> > green light to verify this as DEV.
> 
> We don't have a clear reproducing steps, we have fixed the code according to
> the provided stacktrace. According to the code we know this issue can happen
> when you execute DestroyImageCommand and engine saves parameters of this
> command as JSON object into database.

I was able to reproduce the issue (with breakpoints) after Moti asked me to verify the patch:
The issue is here being that CommandEntityCleanupManager forces a refresh of the CommandsCache which retrieves the commands from the command_entities table.
Since the DestroyImageParameters is persisted without the parent parameters, this refresh causes them to be reloaded without it and causes the NPE later on,
this also explains why this issue only happens when CommandEntityCleanupManager runs as it's the only things the forces a refresh.
To reproduce one could set CommandEntityCleanupTime to some other time and have multiple concurrent live merge commands run at the same time

Comment 23 Avihai 2019-06-17 13:56:50 UTC
> I was able to reproduce the issue (with breakpoints) after Moti asked me to
> verify the patch:
> The issue is here being that CommandEntityCleanupManager forces a refresh of
> the CommandsCache which retrieves the commands from the command_entities
> table.
> Since the DestroyImageParameters is persisted without the parent parameters,
> this refresh causes them to be reloaded without it and causes the NPE later
> on,
> this also explains why this issue only happens when
> CommandEntityCleanupManager runs as it's the only things the forces a
> refresh.
> To reproduce one could set CommandEntityCleanupTime to some other time and
> have multiple concurrent live merge commands run at the same time

So as there is no way to reproduce this without using break-points in the engine, right?
If so, as you already verified this fix, can you please verify this bug?

Comment 25 errata-xmlrpc 2019-08-12 11:53:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2431

Comment 26 Daniel Gur 2019-08-28 13:13:20 UTC
sync2jira

Comment 27 Daniel Gur 2019-08-28 13:17:32 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.