Description of problem: The VM has 3 disks, snapshot deletion triggered 3 merges, all 3 disks merged fine and got to the DESTROY_IMAGE step. On this step, 2 of the disks succeeded and 1 failed. The failure on the engine is as follows: 2018-11-20 03:35:30,770+11 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-9) [432f4ed3-bc9c-49ef-9021-d743a3ebebc9] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='ba7fd744-6f1b-4598-9b4d-a93ef907ea3e', ignoreFailoverLimit='false', storageDomainId='bce96ddf-4372-4169-8cf2-014f75d718b2', imageGroupId='9ec4328d-cea3-4616-bc81-d464637a2898', imageId='00000000-0000-0000-0000-000000000000', imageList='[2d4223f2-1e19-4679-80f8-c6fecef9f4e6]', postZero='true', force='false'}), log id: 2f7366d5 2018-11-20 03:35:38,983+11 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-2263) [432f4ed3-bc9c-49ef-9021-d743a3ebebc9] [within thread]: endAction for action type DestroyImage threw an exception.: org.apache.commons.lang.SerializationException: org.codehaus.jackson.map.JsonMappingException: (was java.lang.NullPointerException) (through reference chain: org.ovirt.engine.core.common.action.DestroyImageParameters["liveMerge"]) at org.ovirt.engine.core.utils.serialization.json.JsonObjectSerializer.writeJsonAsString(JsonObjectSerializer.java:81) [utils.jar:] at org.ovirt.engine.core.utils.serialization.json.JsonObjectSerializer.serialize(JsonObjectSerializer.java:64) [utils.jar:] at org.ovirt.engine.core.dao.CommandEntityDaoImpl.serializeParameters(CommandEntityDaoImpl.java:112) [dal.jar:] at org.ovirt.engine.core.dao.CommandEntityDaoImpl.createFullParametersMapper(CommandEntityDaoImpl.java:96) [dal.jar:] at org.ovirt.engine.core.dao.CommandEntityDaoImpl.saveOrUpdate(CommandEntityDaoImpl.java:147) [dal.jar:] at org.ovirt.engine.core.bll.tasks.CommandsCacheImpl.saveOrUpdateWithoutTransaction(CommandsCacheImpl.java:112) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandsCacheImpl.updateCommandStatus(CommandsCacheImpl.java:87) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandsRepository.updateCommandStatus(CommandsRepository.java:158) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.updateCommandStatus(CommandCoordinatorImpl.java:160) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCoordinatorUtil.updateCommandStatus(CommandCoordinatorUtil.java:341) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.setCommandStatus(CommandBase.java:2294) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.setCommandStatus(CommandBase.java:2286) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.handleCommandExecutionEnded(CommandBase.java:623) [bll.jar:] The deleteVolume worked fine on VDSM, it activated the LV, zeroed it, teardown, removed the lv and returned. 2018-11-20 03:35:30,773+1100 INFO (jsonrpc/3) [vdsm.api] START deleteVolume(sdUUID=u'bce96ddf-4372-4169-8cf2-014f75d718b2', spUUID=u'ba7fd744-6f1b-4598-9b4d-a93ef907ea3e', imgUUID=u'9ec4328d-cea3-4616-bc81-d464637a2898', volumes=[u'2d4223f2-1e19-4679-80f8-c6fecef9f4e6'], postZero=u'true', force=u'false', discard=False) from=::ffff:192.168.18.179,59360, flow_id=432f4ed3-bc9c-49ef-9021-d743a3ebebc9, task_id=556cb20b-5ba8-41db-b709-cdf759e6a4a8 (api:46) I don't see anything wrong with the json returned to the engine: 2018-11-20 03:35:38,975+1100 INFO (jsonrpc/1) [vdsm.api] FINISH getAllTasksStatuses return={'allTasksStatus': {'556cb20b-5ba8-41db-b709-cdf759e6a4a8': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': '556cb20b-5ba8-41db-b709-cdf759e6a4a8'}}} from=::ffff:192.168.18.179,50904, task_id=cbe6ea78-2acf-49b6-935d-1ad7cc92d422 (api:52) Version-Release number of selected component (if applicable): rhvm-4.2.7.4-0.1.el7ev.noarch vdsm-4.20.43-1.el7ev.x86_64 How reproducible: Unknown
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
Martin, we cannot seem be able to reproduce but it appears the NPE occurs while serializing to JSON, can someone from Infra please have a look?
From the stacktrace it seems that either DestroyImageParameters [1] or StorageDomainParametersBase [2] classes are not serializable to Json, my guess is that the problem is Phase enum in StorageDomainParametersBase. If so, then we need to write some MixIn class similar to those we already have in [3]. Ravi, could you please take a look and suggest the fix? [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/action/DestroyImageParameters.java [2] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/action/StorageDomainParametersBase.java [3] https://github.com/oVirt/ovirt-engine/tree/master/backend/manager/modules/utils/src/main/java/org/ovirt/engine/core/utils/serialization/json
Just a note: this bug happens at around the same time every time 03:35:35 which is the time the CommandEntityCleanupManager runs, you can see it's running in 4.1 (Gordon's logs) one but not in >4.2, probably because of the migration from quartz to jboss scheduled threads. While CommandEntityCleanupManager should not cleanup commands not older than 30 days, perhaps there is some issue with the reload of the commands entity cache?
*** Bug 1694888 has been marked as a duplicate of this bug. ***
see also bug 1694888 for newer logs.
Hi, We/QE needs a clear reproduction scenario so we can qa_ack/Nack(if we do not have HW/capacity) this bug and later verify. If this is something that is pure code, as it seems its serialization issue with JSON and the user/QE cannot reproduce it with a user flow you have the green light to verify this as DEV.
(In reply to Avihai from comment #20) > Hi, > > We/QE needs a clear reproduction scenario so we can qa_ack/Nack(if we do not > have HW/capacity) this bug and later verify. > > If this is something that is pure code, as it seems its serialization issue > with JSON and the user/QE cannot reproduce it with a user flow you have the > green light to verify this as DEV. We don't have a clear reproducing steps, we have fixed the code according to the provided stacktrace. According to the code we know this issue can happen when you execute DestroyImageCommand and engine saves parameters of this command as JSON object into database.
(In reply to Martin Perina from comment #21) > (In reply to Avihai from comment #20) > > Hi, > > > > We/QE needs a clear reproduction scenario so we can qa_ack/Nack(if we do not > > have HW/capacity) this bug and later verify. > > > > If this is something that is pure code, as it seems its serialization issue > > with JSON and the user/QE cannot reproduce it with a user flow you have the > > green light to verify this as DEV. > > We don't have a clear reproducing steps, we have fixed the code according to > the provided stacktrace. According to the code we know this issue can happen > when you execute DestroyImageCommand and engine saves parameters of this > command as JSON object into database. I was able to reproduce the issue (with breakpoints) after Moti asked me to verify the patch: The issue is here being that CommandEntityCleanupManager forces a refresh of the CommandsCache which retrieves the commands from the command_entities table. Since the DestroyImageParameters is persisted without the parent parameters, this refresh causes them to be reloaded without it and causes the NPE later on, this also explains why this issue only happens when CommandEntityCleanupManager runs as it's the only things the forces a refresh. To reproduce one could set CommandEntityCleanupTime to some other time and have multiple concurrent live merge commands run at the same time
> I was able to reproduce the issue (with breakpoints) after Moti asked me to > verify the patch: > The issue is here being that CommandEntityCleanupManager forces a refresh of > the CommandsCache which retrieves the commands from the command_entities > table. > Since the DestroyImageParameters is persisted without the parent parameters, > this refresh causes them to be reloaded without it and causes the NPE later > on, > this also explains why this issue only happens when > CommandEntityCleanupManager runs as it's the only things the forces a > refresh. > To reproduce one could set CommandEntityCleanupTime to some other time and > have multiple concurrent live merge commands run at the same time So as there is no way to reproduce this without using break-points in the engine, right? If so, as you already verified this fix, can you please verify this bug?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2431
sync2jira