Bug 1728212

Summary: [downstream clone - 4.3.5] Jobs are not properly cleaned after a failed task.
Product: Red Hat Enterprise Virtualization Manager Reporter: RHV bug bot <rhv-bugzilla-bot>
Component: ovirt-engineAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED ERRATA QA Contact: Yosi Ben Shimon <ybenshim>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.2.6CC: aefrat, bzlotnik, pelauter, Rhev-m-bugs, tnisan
Target Milestone: ovirt-4.3.5Keywords: ZStream
Target Release: ---Flags: lsvaty: testing_plan_complete-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1648345 Environment:
Last Closed: 2019-08-12 11:53:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1648345    
Bug Blocks:    
Attachments:
Description Flags
logs - with sleep none

Description RHV bug bot 2019-07-09 10:36:12 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1648345 +++
======================================================================

Created attachment 1503647 [details]
Engine logs

Description of problem:
The job is not removed if there is an exception in the endAction of the RemoveSnapshot.

Version-Release number of selected component (if applicable):
RHV 4.2.6

How reproducible:
Under some circumstances

Steps to Reproduce:
Hard to say. It requires a storage-related issue during the end action.

Actual results:
Jot is in state STARTED

Expected results:
The JOB is removed, when the issue is sorted out.

Additional information:
jobID = 14d992a6-1c64-40fe-b0d8-21f399c826bb
CorrelationID = d466a458-60a5-4038-9c70-8a82746a29cb

2018-10-12 23:07:58,883+02 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommand] (EE-ManagedThreadFactory-engine-Thread-287984) [d466a458-60a5-4038-9c70-8a82746a29cb] Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommand' successfully.
...
org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-287984) [d466a458-60a5-4038-9c70-8a82746a29cb] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM zahqinfrhevh02 command GetVolumeInfoVDS failed: Resource timeout: ()
2018-10-12 23:09:58,933+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engine-Thread-287984) [d466a458-60a5-4038-9c70-8a82746a29cb] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand' return value '
VolumeInfoReturn:{status='Status [code=851, message=Resource timeout: ()]'}
'
...

2018-10-12 23:14:00,869+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-287984) [] EVENT_ID: USER_REMOVE_SNAPSHOT_FINISHED_FAILURE(357), Failed to delete snapshot '_GX_BACKUP_zahqdevlecm01_26560_6988_zahqinfvsa01' for VM 'zahqdevlecm01'.
2018-10-12 23:14:00,869+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-287984) [] [within thread]: endAction for action type RemoveSnapshot threw an exception.: javax.ejb.EJBTransactionRolledbackException
    at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleInCallerTx(CMTTxInterceptor.java:160) [wildfly-ejb3-7.1.4.GA-redhat-1.jar:7.1.4.GA-redhat-1]
    at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:257) [wildfly-ejb3-7.1.4.GA-redhat-1.jar:7.1.4.GA-redhat-1]
    at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:381) [wildfly-ejb3-7.1.4.GA-redhat-1.jar:7.1.4.GA-redhat-1]
    at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:244) [wildfly-ejb3-7.1.4.GA-redhat-1.jar:7.1.4.GA-redhat-1]
    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:422)
...
Caused by: java.lang.NullPointerException
    at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.handleForwardMerge(RemoveSnapshotSingleDiskCommandBase.java:198) [bll.jar:]
    at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.lambda$syncDbRecords$0(RemoveSnapshotSingleDiskCommandBase.java:161) [bll.jar:]
    at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [utils.jar:]
    at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.syncDbRecords(RemoveSnapshotSingleDiskCommandBase.java:151) [bll.jar:]
    at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommand.endSuccessfully(RemoveSnapshotSingleDiskCommand.java:115) [bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:675) [bll.jar:]
...

(Originally by Roman Hodain)

Comment 2 RHV bug bot 2019-07-09 10:36:15 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

(Originally by Sandro Bonazzola)

Comment 4 RHV bug bot 2019-07-09 10:36:19 UTC
IIUC Ravi's patch[1] should handle these cases. Ravi, can you keep me honest?


[1] - https://gerrit.ovirt.org/#/c/95873/

(Originally by Benny Zlotnik)

Comment 5 RHV bug bot 2019-07-09 10:36:20 UTC
(In reply to Benny Zlotnik from comment #4)
> IIUC Ravi's patch[1] should handle these cases. Ravi, can you keep me honest?
> 
> 
> [1] - https://gerrit.ovirt.org/#/c/95873/

A lot of these issues should be fixed in 4.3 as part of BZ 1633777. I have not verified the specific flow described in comment #1, but I have verified as part of BZ 163377 that exceptions thrown in endAction of AddDisk and AddVmTemplate commands were handled properly.

(Originally by rnori)

Comment 6 RHV bug bot 2019-07-09 10:36:22 UTC
I tested throwing exceptions in live/cold merge endActions and it seems to be working correctly, moving to modified

(Originally by Benny Zlotnik)

Comment 7 RHV bug bot 2019-07-09 10:36:23 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 8 RHV bug bot 2019-07-09 10:36:25 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 9 RHV bug bot 2019-07-09 10:36:26 UTC
From the initial description, it is hard to understand what is the scenario here.
Please provide a clear scenario so we can verify this issue.

(Originally by Avihai Efrat)

Comment 10 RHV bug bot 2019-07-09 10:36:28 UTC
(In reply to Avihai from comment #9)
> From the initial description, it is hard to understand what is the scenario
> here.
> Please provide a clear scenario so we can verify this issue.

You could try to block the connection to vdsm when once it starts executing the endAction, indicated by this log message:
"Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommand' successfully"

(Originally by Benny Zlotnik)

Comment 11 RHV bug bot 2019-07-09 10:36:29 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 12 Avihai 2019-07-09 14:14:49 UTC
Benny, can you please add the relevant fix patches to this bug?

Comment 13 Benny Zlotnik 2019-07-09 14:17:39 UTC
There are none, I believe it was fixed as part of anoter bug

Comment 15 Avihai 2019-07-11 03:55:11 UTC
I'm testing it using this setup:
- 1 VM with 1 disk on iscsi SD
- 1 Snapshot
- 1 active host (the other two are in maintenance)

I'm listening to the engine log and wait for:
"Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand' successfully"

Once this line appears in the log, I block the connection to the host using iptables:
# iptables --delete OUTPUT --destination storage-ge17-vdsm1.scl.lab.tlv.redhat.com --jump DROP --protocol all

I don't see any exceptions or failures - only the ones that related to the blocked connection.
The snapshot deleted successfully and no uncleared jobs in the database as well.

Is this the expected result or I'm missing something in my steps?
I think that I didn't even hit the issue, so maybe I can test it somehow else.
Or maybe the connection was blocked too late?

Comment 16 RHV bug bot 2019-07-11 12:37:00 UTC
INFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 17 Yosi Ben Shimon 2019-07-14 08:39:55 UTC
Tested again on the latest version:
ovirt-engine-4.3.5.4-0.1.el7.noarch

During snapshot creation and deletion, the method getInfo (Volume class on VDSM) get called 3 times:
- Once when creating the snapshot
- Twice when deleting it

So I've injected a counter and a time.sleep() to the python file API.py in Volume.getInfo()
When the LAST getInfo() get called, I:
1. Printed to the console for indication
1. Pause the python code for 2 minute
2. Block the connection from the host to the storage domain

The steps I've used are:
1. Create a VM with a disk on iscsi storage domain
2. Create a snapshot
3. Inject the code to Volume.getInfo() in API.py file
3. tail -f and wait for the log I've injected
4. when hitting the sleep in step 3 - I've blocked the connection from that host to the iscsi storage using iptables

I've seen exception but:
- The snapshot failed to delete as expected
- No job got stuck (all were either "FINISH" or "FAILED")


From the engine log:

Ending the ColdMergeSnapshotSingleDiskCommand:

2019-07-14 10:40:21,527+03 INFO  [org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Ending command
 'org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand' successfully.


Getting the volume info (last getInfo()):

2019-07-14 10:40:21,532+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Command 'GetVolumeI
nfoVDSCommand(HostName = host_mixed_1, GetVolumeInfoVDSCommandParameters:{hostId='bbf14fb7-b273-4c68-92f3-55e986deff54', storagePoolId='d01b2205-b585-455d-adee-feb620b3c1fd', storageDomainId='bfba2c50-fc45-4ab0-
99e8-f63fecfb9fa7', imageGroupId='976bf5c0-6ad7-442e-b0b7-718273d77ab4', imageId='6cbfbbeb-084e-4085-9653-b5609f43414b'})' execution failed: java.net.ConnectException: Connection refused
2019-07-14 10:40:21,532+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] FINISH, GetVolumeIn
foVDSCommand, return: , log id: 3c96d7fa
2019-07-14 10:40:21,532+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Failed to get the vo
lume information, marking as FAILED
2019-07-14 10:40:21,532+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] FINISH, GetImageInfo
VDSCommand, return: , log id: 5501c63b


transaction rolled back:


2019-07-14 10:40:21,548+03 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] transaction rolled back


Callback failure:


2019-07-14 10:40:21,560+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Failed invoking callback end m
ethod 'onSucceeded' for command 'cd0c9ede-0452-4958-a58b-74a2fdbf1456' with exception 'null', the callback is marked for end method retries but max number of retries have been attempted. The command will be mark
ed as Failed.
2019-07-14 10:40:21,561+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Exception in invoking callback
 of command ColdMergeSnapshotSingleDisk (cd0c9ede-0452-4958-a58b-74a2fdbf1456): NullPointerException: 
2019-07-14 10:40:21,561+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Error invoking callback method
 'onSucceeded' for 'SUCCEEDED' command 'cd0c9ede-0452-4958-a58b-74a2fdbf1456'

2019-07-14 10:40:21,561+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Exception: java.lang.NullPoint
erException
        at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.handleBackwardMerge(RemoveSnapshotSingleDiskCommandBase.java:253) [bll.jar:]
        at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.lambda$syncDbRecords$0(RemoveSnapshotSingleDiskCommandBase.java:174) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [utils.jar:]
        at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.syncDbRecords(RemoveSnapshotSingleDiskCommandBase.java:162) [bll.jar:]
        at org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand.endSuccessfully(ColdMergeSnapshotSingleDiskCommand.java:119) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:699) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:654) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1966) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:519) [bll.jar:]
        at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.endAction(ChildCommandsCallbackBase.java:134) [bll.jar:]
        at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.onSucceeded(ChildCommandsCallbackBase.java:166) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.endCallback(CommandCallbacksPoller.java:69) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:166) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109) [bll.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_222-ea]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_222-ea]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
        at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_222-ea]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_222-ea]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_222-ea]
        at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]

2019-07-14 10:40:22,485+03 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to storage-ge6-vdsm1.scl.lab.tlv.redhat.com/10.35.82.71
2019-07-14 10:40:22,488+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-80) [] Unable to RefreshCapabilities: ConnectException: Connection refused
2019-07-14 10:40:22,489+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-80) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{hostId='bbf14fb7-b273-4c68-92f3-55e986deff54', vds='Host[host_mixed_1,bbf14fb7-b273-4c68-92f3-55e986deff54]'})' execution failed: java.net.ConnectException: Connection refused
2019-07-14 10:40:22,588+03 ERROR [org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [416571f1-74bf-4202-9024-4414086a62a5] Ending command 'org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand' with failure.
2019-07-14 10:40:23,611+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [416571f1-74bf-4202-9024-4414086a62a5] Command 'RemoveSnapshot' id: '4dff8ebc-3794-4132-a1c6-a5925d213c37' child commands '[cd0c9ede-0452-4958-a58b-74a2fdbf1456]' executions were completed, status 'FAILED'
2019-07-14 10:40:24,648+03 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [416571f1-74bf-4202-9024-4414086a62a5] Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand' with failure.
2019-07-14 10:40:24,689+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [416571f1-74bf-4202-9024-4414086a62a5] EVENT_ID: USER_REMOVE_SNAPSHOT_FINISHED_FAILURE(357), Failed to delete snapshot 'snap_5' for VM 'test_vm'.


The timing looks ok and seems like I've hit the issue.
Please ack for this one.



*  Attaching engine & VDSM logs
** Engine correlation id: 416571f1-74bf-4202-9024-4414086a62a5

Comment 18 Yosi Ben Shimon 2019-07-14 08:40:37 UTC
Created attachment 1590486 [details]
logs - with sleep

Comment 19 Yosi Ben Shimon 2019-07-14 08:42:12 UTC
Missed VDSM version in comment #17:
vdsm-4.30.24-2.el7ev.x86_64

Comment 20 Benny Zlotnik 2019-07-14 11:57:23 UTC
Looks good

Comment 21 Yosi Ben Shimon 2019-07-14 12:09:02 UTC
Great.
Moving to VERIFIED

Comment 22 RHV bug bot 2019-07-18 11:02:11 UTC
INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 24 errata-xmlrpc 2019-08-12 11:53:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2431

Comment 25 Daniel Gur 2019-08-28 13:15:02 UTC
sync2jira

Comment 26 Daniel Gur 2019-08-28 13:20:05 UTC
sync2jira