Bug 1728212
Summary: | [downstream clone - 4.3.5] Jobs are not properly cleaned after a failed task. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | RHV bug bot <rhv-bugzilla-bot> | ||||
Component: | ovirt-engine | Assignee: | Benny Zlotnik <bzlotnik> | ||||
Status: | CLOSED ERRATA | QA Contact: | Yosi Ben Shimon <ybenshim> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.2.6 | CC: | aefrat, bzlotnik, pelauter, Rhev-m-bugs, tnisan | ||||
Target Milestone: | ovirt-4.3.5 | Keywords: | ZStream | ||||
Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 1648345 | Environment: | |||||
Last Closed: | 2019-08-12 11:53:28 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1648345 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
RHV bug bot
2019-07-09 10:36:12 UTC
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1. (Originally by Sandro Bonazzola) IIUC Ravi's patch[1] should handle these cases. Ravi, can you keep me honest? [1] - https://gerrit.ovirt.org/#/c/95873/ (Originally by Benny Zlotnik) (In reply to Benny Zlotnik from comment #4) > IIUC Ravi's patch[1] should handle these cases. Ravi, can you keep me honest? > > > [1] - https://gerrit.ovirt.org/#/c/95873/ A lot of these issues should be fixed in 4.3 as part of BZ 1633777. I have not verified the specific flow described in comment #1, but I have verified as part of BZ 163377 that exceptions thrown in endAction of AddDisk and AddVmTemplate commands were handled properly. (Originally by rnori) I tested throwing exceptions in live/cold merge endActions and it seems to be working correctly, moving to modified (Originally by Benny Zlotnik) WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops (Originally by rhv-bugzilla-bot) WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops (Originally by rhv-bugzilla-bot) From the initial description, it is hard to understand what is the scenario here. Please provide a clear scenario so we can verify this issue. (Originally by Avihai Efrat) (In reply to Avihai from comment #9) > From the initial description, it is hard to understand what is the scenario > here. > Please provide a clear scenario so we can verify this issue. You could try to block the connection to vdsm when once it starts executing the endAction, indicated by this log message: "Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommand' successfully" (Originally by Benny Zlotnik) WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops (Originally by rhv-bugzilla-bot) Benny, can you please add the relevant fix patches to this bug? There are none, I believe it was fixed as part of anoter bug I'm testing it using this setup: - 1 VM with 1 disk on iscsi SD - 1 Snapshot - 1 active host (the other two are in maintenance) I'm listening to the engine log and wait for: "Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand' successfully" Once this line appears in the log, I block the connection to the host using iptables: # iptables --delete OUTPUT --destination storage-ge17-vdsm1.scl.lab.tlv.redhat.com --jump DROP --protocol all I don't see any exceptions or failures - only the ones that related to the blocked connection. The snapshot deleted successfully and no uncleared jobs in the database as well. Is this the expected result or I'm missing something in my steps? I think that I didn't even hit the issue, so maybe I can test it somehow else. Or maybe the connection was blocked too late? INFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [No relevant external trackers attached] For more info please contact: rhv-devops Tested again on the latest version: ovirt-engine-4.3.5.4-0.1.el7.noarch During snapshot creation and deletion, the method getInfo (Volume class on VDSM) get called 3 times: - Once when creating the snapshot - Twice when deleting it So I've injected a counter and a time.sleep() to the python file API.py in Volume.getInfo() When the LAST getInfo() get called, I: 1. Printed to the console for indication 1. Pause the python code for 2 minute 2. Block the connection from the host to the storage domain The steps I've used are: 1. Create a VM with a disk on iscsi storage domain 2. Create a snapshot 3. Inject the code to Volume.getInfo() in API.py file 3. tail -f and wait for the log I've injected 4. when hitting the sleep in step 3 - I've blocked the connection from that host to the iscsi storage using iptables I've seen exception but: - The snapshot failed to delete as expected - No job got stuck (all were either "FINISH" or "FAILED") From the engine log: Ending the ColdMergeSnapshotSingleDiskCommand: 2019-07-14 10:40:21,527+03 INFO [org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Ending command 'org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand' successfully. Getting the volume info (last getInfo()): 2019-07-14 10:40:21,532+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Command 'GetVolumeI nfoVDSCommand(HostName = host_mixed_1, GetVolumeInfoVDSCommandParameters:{hostId='bbf14fb7-b273-4c68-92f3-55e986deff54', storagePoolId='d01b2205-b585-455d-adee-feb620b3c1fd', storageDomainId='bfba2c50-fc45-4ab0- 99e8-f63fecfb9fa7', imageGroupId='976bf5c0-6ad7-442e-b0b7-718273d77ab4', imageId='6cbfbbeb-084e-4085-9653-b5609f43414b'})' execution failed: java.net.ConnectException: Connection refused 2019-07-14 10:40:21,532+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] FINISH, GetVolumeIn foVDSCommand, return: , log id: 3c96d7fa 2019-07-14 10:40:21,532+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Failed to get the vo lume information, marking as FAILED 2019-07-14 10:40:21,532+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] FINISH, GetImageInfo VDSCommand, return: , log id: 5501c63b transaction rolled back: 2019-07-14 10:40:21,548+03 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] transaction rolled back Callback failure: 2019-07-14 10:40:21,560+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Failed invoking callback end m ethod 'onSucceeded' for command 'cd0c9ede-0452-4958-a58b-74a2fdbf1456' with exception 'null', the callback is marked for end method retries but max number of retries have been attempted. The command will be mark ed as Failed. 2019-07-14 10:40:21,561+03 INFO [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Exception in invoking callback of command ColdMergeSnapshotSingleDisk (cd0c9ede-0452-4958-a58b-74a2fdbf1456): NullPointerException: 2019-07-14 10:40:21,561+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Error invoking callback method 'onSucceeded' for 'SUCCEEDED' command 'cd0c9ede-0452-4958-a58b-74a2fdbf1456' 2019-07-14 10:40:21,561+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-27) [416571f1-74bf-4202-9024-4414086a62a5] Exception: java.lang.NullPoint erException at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.handleBackwardMerge(RemoveSnapshotSingleDiskCommandBase.java:253) [bll.jar:] at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.lambda$syncDbRecords$0(RemoveSnapshotSingleDiskCommandBase.java:174) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [utils.jar:] at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.syncDbRecords(RemoveSnapshotSingleDiskCommandBase.java:162) [bll.jar:] at org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand.endSuccessfully(ColdMergeSnapshotSingleDiskCommand.java:119) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:699) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:654) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1966) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:519) [bll.jar:] at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.endAction(ChildCommandsCallbackBase.java:134) [bll.jar:] at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.onSucceeded(ChildCommandsCallbackBase.java:166) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.endCallback(CommandCallbacksPoller.java:69) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:166) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109) [bll.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_222-ea] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_222-ea] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_222-ea] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_222-ea] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_222-ea] at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1] 2019-07-14 10:40:22,485+03 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to storage-ge6-vdsm1.scl.lab.tlv.redhat.com/10.35.82.71 2019-07-14 10:40:22,488+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-80) [] Unable to RefreshCapabilities: ConnectException: Connection refused 2019-07-14 10:40:22,489+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-80) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{hostId='bbf14fb7-b273-4c68-92f3-55e986deff54', vds='Host[host_mixed_1,bbf14fb7-b273-4c68-92f3-55e986deff54]'})' execution failed: java.net.ConnectException: Connection refused 2019-07-14 10:40:22,588+03 ERROR [org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [416571f1-74bf-4202-9024-4414086a62a5] Ending command 'org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand' with failure. 2019-07-14 10:40:23,611+03 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [416571f1-74bf-4202-9024-4414086a62a5] Command 'RemoveSnapshot' id: '4dff8ebc-3794-4132-a1c6-a5925d213c37' child commands '[cd0c9ede-0452-4958-a58b-74a2fdbf1456]' executions were completed, status 'FAILED' 2019-07-14 10:40:24,648+03 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [416571f1-74bf-4202-9024-4414086a62a5] Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand' with failure. 2019-07-14 10:40:24,689+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [416571f1-74bf-4202-9024-4414086a62a5] EVENT_ID: USER_REMOVE_SNAPSHOT_FINISHED_FAILURE(357), Failed to delete snapshot 'snap_5' for VM 'test_vm'. The timing looks ok and seems like I've hit the issue. Please ack for this one. * Attaching engine & VDSM logs ** Engine correlation id: 416571f1-74bf-4202-9024-4414086a62a5 Created attachment 1590486 [details]
logs - with sleep
Missed VDSM version in comment #17: vdsm-4.30.24-2.el7ev.x86_64 Looks good Great. Moving to VERIFIED INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [No relevant external trackers attached] For more info please contact: rhv-devops Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2431 sync2jira sync2jira |