Bug 1509629 - Cold merge failed to remove all volumes
Summary: Cold merge failed to remove all volumes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ovirt-4.1.9
: ---
Assignee: Ala Hino
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks: 1518562 1532567
TreeView+ depends on / blocked
 
Reported: 2017-11-05 09:42 UTC by Raz Tamir
Modified: 2018-01-24 10:39 UTC (History)
7 users (show)

Fixed In Version: ovirt-engine-4.1.9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1532567 (view as bug list)
Environment:
Last Closed: 2018-01-24 10:39:58 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: ovirt-4.2+
rule-engine: blocker+


Attachments (Terms of Use)
engine and vdsm logs (1.96 MB, application/x-gzip)
2017-11-05 09:42 UTC, Raz Tamir
no flags Details
engine log (683.71 KB, application/x-gzip)
2017-11-23 10:13 UTC, Raz Tamir
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 83612 0 master MERGED core: Add complete step to cold merge flow 2021-01-20 01:56:37 UTC
oVirt gerrit 84821 0 master MERGED core: Add end procedure when running DestroyImageCommand 2021-01-20 01:57:18 UTC
oVirt gerrit 85227 0 ovirt-engine-4.1 MERGED core: Add end procedure when running DestroyImageCommand 2021-01-20 01:56:37 UTC

Description Raz Tamir 2017-11-05 09:42:49 UTC
Created attachment 1348061 [details]
engine and vdsm logs

Description of problem:

In our automation (tier 1) we see that a clod merge operation is failing to remove all volumes.

2017-11-02 19:40:00,671+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-6690) [] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'DestroyImage',
2017-11-02 19:40:00,672+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-6690) [] [within thread]: endAction for action type DestroyImage threw an exception.: java.lang.NullPointerExceptio
n
        at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:351) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:347) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:160) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.lambda$endActionIfNecessary$0(CommandAsyncTask.java:112) [bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:96) [utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_144]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_144]
        at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent-1.0.jar:]
        at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)

2017-11-02 19:40:00,672+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-6690) [] CommandAsyncTask::HandleEndActionResult: endAction for action type 'DestroyImage' threw an unrecoverable R
untimeException the task will be cleared.

As a result, not all volumes get removed


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20171101152059.git8b85834.el7.centos.noarch


How reproducible:
Fails on all storage types

Steps to Reproduce:
1. Create VM with OS disk + 4 disks
2. Create files on the VM's disks and create snapshot
3. cold merge the snapshot

Actual results:
Cold merge failed - before cold merge: 20 volumes, after cold merge: 18 volumes, snapshot contains 5 disks

Expected results:
Cold merge succeeds - before cold merge: 20 volumes, after cold merge: 15 volumes, snapshot contains 5 disks

Additional info:
* No exception in vdsm side
* Regression because this case passed in previous commits

Comment 1 Ala Hino 2017-11-05 13:09:09 UTC
I was able to reproduce the issue; actually, I see the same exception in the engine log, however, the snapshot is removed.

Can you please confirm whether the snapshot removed in your env?

Comment 2 Ala Hino 2017-11-05 13:14:23 UTC
Based on the following message:

Snapshot 'snapshot_18894_fcp_1' deletion for VM 'vm_TestCase18894_0219280637' has been completed

Looks that we are hitting the same issue - the snapshot is removed but there is the NPE. I will look into this.

Comment 3 Raz Tamir 2017-11-05 13:15:26 UTC
Yes, the snapshot removed

Comment 4 Ala Hino 2017-11-05 13:19:43 UTC
(In reply to Raz Tamir from comment #3)
> Yes, the snapshot removed

In the description you mentioned "we see that a clod merge operation is failing to remove all volumes", what do you mean by that? Do the volumes still exist after cold merge?

Comment 5 Raz Tamir 2017-11-05 13:24:55 UTC
Yes,

This is the output of our tests:
Cold merge failed - before cold merge: 20 volumes, after cold merge: 18 volumes, snapshot contains 5 disks

It checks lvs before the remove snap and after

Comment 6 Raz Tamir 2017-11-23 10:13:06 UTC
I still see the same issue in engine.log:
2017-11-23 12:02:37,743+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-28260) [] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'DestroyImage',
2017-11-23 12:02:37,744+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-28260) [] [within thread]: endAction for action type DestroyImage threw an exception.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:351) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:347) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:160) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.lambda$endActionIfNecessary$0(CommandAsyncTask.java:112) [bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:96) [utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_151]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_151]
        at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent-1.0.jar:]
        at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)

2017-11-23 12:02:37,746+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-28260) [] CommandAsyncTask::HandleEndActionResult: endAction for action type 'DestroyImage' threw an unrecoverable RuntimeException the task will be cleared.
2017-11-23 12:02:37,748+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-19) [] BaseAsyncTask::onTaskEndSuccess: Task 'f06cf00c-76d8-4329-8a8a-fcdec3008c56' (Parent Command 'DestroyImage', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended successfully.
2017-11-23 12:02:37,749+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-19) [] CommandAsyncTask::endActionIfNecessary: All tasks of command 'fac1cac6-8f5d-49ba-af34-3b3b8d13e151' has ended -> executing 'endAction'

Comment 7 Raz Tamir 2017-11-23 10:13:30 UTC
Created attachment 1358115 [details]
engine log

Comment 8 Ala Hino 2017-11-23 11:24:34 UTC
What's the engine version?

Comment 9 Raz Tamir 2017-11-23 11:27:06 UTC
ovirt-engine-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos

Comment 10 Ala Hino 2017-11-27 08:23:00 UTC
Raz,

Can you please set log level to DEBUG, try again and send the logs?
If possible to give access to the env after the test fails, that would be great.

Comment 11 Raz Tamir 2017-11-28 11:55:55 UTC
(In reply to Ala Hino from comment #10)
> Raz,
> 
> Can you please set log level to DEBUG, try again and send the logs?
The provided logs are with DEBUG level
> If possible to give access to the env after the test fails, that would be
> great.

Comment 12 Ala Hino 2017-11-28 12:01:50 UTC
(In reply to Raz Tamir from comment #11)
> (In reply to Ala Hino from comment #10)
> > Raz,
> > 
> > Can you please set log level to DEBUG, try again and send the logs?
> The provided logs are with DEBUG level
> > If possible to give access to the env after the test fails, that would be
> > great.

The hosts are with DEBUG logs but not the engine.
This is an engine side bug.

Comment 14 Ala Hino 2017-11-29 11:11:15 UTC
(In reply to Yaniv Kaul from comment #13)
> Ala - Engine log in DEBUG is available @
> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-
> x86_64/2493/artifact/exported-artifacts/basic-suite-master__logs/test_logs/
> basic-suite-master/post-007_sd_reattach.py/lago-basic-suite-master-engine/
> _var_log/ovirt-engine/engine.log
> 
> Please analyze.
> 
> (CI test failure)

Same error as found by automation.

Comment 15 Raz Tamir 2017-12-07 20:17:05 UTC
Passed automation run on rhvm-4.2.0-0.6.el7

Verified

Comment 16 Ala Hino 2017-12-10 10:15:16 UTC
This bug was verified on 4.2 but the issue encountered on 4.1.8 too.
Re-targeting this bug to 4.1.9.

Comment 17 Raz Tamir 2017-12-10 10:27:46 UTC
Relevant log

engine.log:

2017-12-08 07:08:07,117+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-48) [1ec95d85] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'DestroyImage',
2017-12-08 07:08:07,118+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-48) [1ec95d85] [within thread]: endAction for action type DestroyImage threw an exception.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:335) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:340) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:154) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.lambda$endActionIfNecessary$0(CommandAsyncTask.java:106) [bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:84) [utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_151]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_151]

2017-12-08 07:08:07,118+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-48) [1ec95d85] CommandAsyncTask::HandleEndActionResult: endAction for action type 'DestroyImage' threw an unrecoverable RuntimeException the task will be cleared.

Comment 18 nijin ashok 2018-01-05 16:35:20 UTC
One of the customer environment is having the same issue on live merge as well. IIUC, we have fix in both cold merge and live merge as per https://gerrit.ovirt.org/#/c/85227/. However, would like to confirm because the title only mentions cold merge.

Could you please confirm if the fix is there for both cold and live merge?

Comment 19 Ala Hino 2018-01-08 08:07:43 UTC
The fix applies both for cold and live merge.

Comment 20 Raz Tamir 2018-01-14 09:36:42 UTC
Passed automation run of tier 1 - Verified
ovirt-engine-4.1.9-0.2.el7
vdsm-4.19.44-1.el7ev.x86_64

Comment 21 Sandro Bonazzola 2018-01-24 10:39:58 UTC
This bugzilla is included in oVirt 4.1.9 release, published on Jan 24th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.1.9 release, published on Jan 24th 2018, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.