Bug 1509629

Summary: Cold merge failed to remove all volumes
Product: [oVirt] ovirt-engine Reporter: Raz Tamir <ratamir>
Component: BLL.StorageAssignee: Ala Hino <ahino>
Status: CLOSED CURRENTRELEASE QA Contact: Raz Tamir <ratamir>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.2.0CC: ahino, amureini, bugs, lveyde, nashok, ratamir, ylavi
Target Milestone: ovirt-4.1.9Keywords: Automation, Regression
Target Release: ---Flags: rule-engine: ovirt-4.1+
rule-engine: ovirt-4.2+
rule-engine: blocker+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.1.9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1532567 (view as bug list) Environment:
Last Closed: 2018-01-24 10:39:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1518562, 1532567    
Attachments:
Description Flags
engine and vdsm logs
none
engine log none

Description Raz Tamir 2017-11-05 09:42:49 UTC
Created attachment 1348061 [details]
engine and vdsm logs

Description of problem:

In our automation (tier 1) we see that a clod merge operation is failing to remove all volumes.

2017-11-02 19:40:00,671+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-6690) [] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'DestroyImage',
2017-11-02 19:40:00,672+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-6690) [] [within thread]: endAction for action type DestroyImage threw an exception.: java.lang.NullPointerExceptio
n
        at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:351) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:347) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:160) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.lambda$endActionIfNecessary$0(CommandAsyncTask.java:112) [bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:96) [utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_144]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_144]
        at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent-1.0.jar:]
        at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)

2017-11-02 19:40:00,672+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-6690) [] CommandAsyncTask::HandleEndActionResult: endAction for action type 'DestroyImage' threw an unrecoverable R
untimeException the task will be cleared.

As a result, not all volumes get removed


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20171101152059.git8b85834.el7.centos.noarch


How reproducible:
Fails on all storage types

Steps to Reproduce:
1. Create VM with OS disk + 4 disks
2. Create files on the VM's disks and create snapshot
3. cold merge the snapshot

Actual results:
Cold merge failed - before cold merge: 20 volumes, after cold merge: 18 volumes, snapshot contains 5 disks

Expected results:
Cold merge succeeds - before cold merge: 20 volumes, after cold merge: 15 volumes, snapshot contains 5 disks

Additional info:
* No exception in vdsm side
* Regression because this case passed in previous commits

Comment 1 Ala Hino 2017-11-05 13:09:09 UTC
I was able to reproduce the issue; actually, I see the same exception in the engine log, however, the snapshot is removed.

Can you please confirm whether the snapshot removed in your env?

Comment 2 Ala Hino 2017-11-05 13:14:23 UTC
Based on the following message:

Snapshot 'snapshot_18894_fcp_1' deletion for VM 'vm_TestCase18894_0219280637' has been completed

Looks that we are hitting the same issue - the snapshot is removed but there is the NPE. I will look into this.

Comment 3 Raz Tamir 2017-11-05 13:15:26 UTC
Yes, the snapshot removed

Comment 4 Ala Hino 2017-11-05 13:19:43 UTC
(In reply to Raz Tamir from comment #3)
> Yes, the snapshot removed

In the description you mentioned "we see that a clod merge operation is failing to remove all volumes", what do you mean by that? Do the volumes still exist after cold merge?

Comment 5 Raz Tamir 2017-11-05 13:24:55 UTC
Yes,

This is the output of our tests:
Cold merge failed - before cold merge: 20 volumes, after cold merge: 18 volumes, snapshot contains 5 disks

It checks lvs before the remove snap and after

Comment 6 Raz Tamir 2017-11-23 10:13:06 UTC
I still see the same issue in engine.log:
2017-11-23 12:02:37,743+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-28260) [] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'DestroyImage',
2017-11-23 12:02:37,744+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-28260) [] [within thread]: endAction for action type DestroyImage threw an exception.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:351) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:347) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:160) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.lambda$endActionIfNecessary$0(CommandAsyncTask.java:112) [bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:96) [utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_151]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_151]
        at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent-1.0.jar:]
        at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)

2017-11-23 12:02:37,746+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-28260) [] CommandAsyncTask::HandleEndActionResult: endAction for action type 'DestroyImage' threw an unrecoverable RuntimeException the task will be cleared.
2017-11-23 12:02:37,748+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-19) [] BaseAsyncTask::onTaskEndSuccess: Task 'f06cf00c-76d8-4329-8a8a-fcdec3008c56' (Parent Command 'DestroyImage', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended successfully.
2017-11-23 12:02:37,749+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-19) [] CommandAsyncTask::endActionIfNecessary: All tasks of command 'fac1cac6-8f5d-49ba-af34-3b3b8d13e151' has ended -> executing 'endAction'

Comment 7 Raz Tamir 2017-11-23 10:13:30 UTC
Created attachment 1358115 [details]
engine log

Comment 8 Ala Hino 2017-11-23 11:24:34 UTC
What's the engine version?

Comment 9 Raz Tamir 2017-11-23 11:27:06 UTC
ovirt-engine-4.2.0-0.0.master.20171121184703.git173fe83.el7.centos

Comment 10 Ala Hino 2017-11-27 08:23:00 UTC
Raz,

Can you please set log level to DEBUG, try again and send the logs?
If possible to give access to the env after the test fails, that would be great.

Comment 11 Raz Tamir 2017-11-28 11:55:55 UTC
(In reply to Ala Hino from comment #10)
> Raz,
> 
> Can you please set log level to DEBUG, try again and send the logs?
The provided logs are with DEBUG level
> If possible to give access to the env after the test fails, that would be
> great.

Comment 12 Ala Hino 2017-11-28 12:01:50 UTC
(In reply to Raz Tamir from comment #11)
> (In reply to Ala Hino from comment #10)
> > Raz,
> > 
> > Can you please set log level to DEBUG, try again and send the logs?
> The provided logs are with DEBUG level
> > If possible to give access to the env after the test fails, that would be
> > great.

The hosts are with DEBUG logs but not the engine.
This is an engine side bug.

Comment 14 Ala Hino 2017-11-29 11:11:15 UTC
(In reply to Yaniv Kaul from comment #13)
> Ala - Engine log in DEBUG is available @
> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-
> x86_64/2493/artifact/exported-artifacts/basic-suite-master__logs/test_logs/
> basic-suite-master/post-007_sd_reattach.py/lago-basic-suite-master-engine/
> _var_log/ovirt-engine/engine.log
> 
> Please analyze.
> 
> (CI test failure)

Same error as found by automation.

Comment 15 Raz Tamir 2017-12-07 20:17:05 UTC
Passed automation run on rhvm-4.2.0-0.6.el7

Verified

Comment 16 Ala Hino 2017-12-10 10:15:16 UTC
This bug was verified on 4.2 but the issue encountered on 4.1.8 too.
Re-targeting this bug to 4.1.9.

Comment 17 Raz Tamir 2017-12-10 10:27:46 UTC
Relevant log

engine.log:

2017-12-08 07:08:07,117+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-48) [1ec95d85] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'DestroyImage',
2017-12-08 07:08:07,118+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-48) [1ec95d85] [within thread]: endAction for action type DestroyImage threw an exception.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:335) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:340) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:154) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.lambda$endActionIfNecessary$0(CommandAsyncTask.java:106) [bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:84) [utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_151]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_151]

2017-12-08 07:08:07,118+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-48) [1ec95d85] CommandAsyncTask::HandleEndActionResult: endAction for action type 'DestroyImage' threw an unrecoverable RuntimeException the task will be cleared.

Comment 18 nijin ashok 2018-01-05 16:35:20 UTC
One of the customer environment is having the same issue on live merge as well. IIUC, we have fix in both cold merge and live merge as per https://gerrit.ovirt.org/#/c/85227/. However, would like to confirm because the title only mentions cold merge.

Could you please confirm if the fix is there for both cold and live merge?

Comment 19 Ala Hino 2018-01-08 08:07:43 UTC
The fix applies both for cold and live merge.

Comment 20 Raz Tamir 2018-01-14 09:36:42 UTC
Passed automation run of tier 1 - Verified
ovirt-engine-4.1.9-0.2.el7
vdsm-4.19.44-1.el7ev.x86_64

Comment 21 Sandro Bonazzola 2018-01-24 10:39:58 UTC
This bugzilla is included in oVirt 4.1.9 release, published on Jan 24th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.1.9 release, published on Jan 24th 2018, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.