1633777 – VM cloned from template remains locked, subsequent clones fail

Bug 1633777 - VM cloned from template remains locked, subsequent clones fail

Summary: VM cloned from template remains locked, subsequent clones fail

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.5
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.3.0
Target Release:	---
Assignee:	Ravi Nori
QA Contact:	Yosi Ben Shimon
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-27 18:12 UTC by Gordon Watson
Modified:	2020-02-25 12:36 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ovirt-engine-4.3.0_rc
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-08 12:38:22 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Thread dump for 4.1.11 (12.51 KB, application/x-gzip) 2018-11-22 09:05 UTC, Pavel Gashev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3629221	None	None	None	2018-09-27 22:04:14 UTC
Red Hat Product Errata	RHEA-2019:1085	None	None	None	2019-05-08 12:38:39 UTC
oVirt gerrit	95872	'None'	MERGED	engine: Add debug and info logs to CommandCallbackPoller	2020-08-05 18:46:35 UTC
oVirt gerrit	95873	'None'	MERGED	engine: Introduce CommandCallback finalizeCommand method	2020-08-05 18:46:34 UTC
oVirt gerrit	95874	'None'	MERGED	engine: Fix negative flow in AddImageFromStratchCommand	2020-08-05 18:46:34 UTC
oVirt gerrit	95875	'None'	MERGED	engine: Handle exceptions during compensate	2020-08-05 18:46:34 UTC
oVirt gerrit	95876	'None'	MERGED	engine: Handle EXECUTION_FAILED status in callback	2020-08-05 18:46:34 UTC
oVirt gerrit	95877	'None'	MERGED	engine: Wrap all statements in CommandCallbacksPoller in try catch	2020-08-05 18:46:34 UTC

Description Gordon Watson 2018-09-27 18:12:32 UTC

Description of problem:

A VM was cloned from a template, the sequence on the engine and SPM host seemed to complete, at least the task was removed from the database. However, the newly-created VM and its disk image remain locked.

This prevents any subsequent use of this template.


Version-Release number of selected component (if applicable):

RHV 4.2.5
RHEL 7.5 host (SPM) w/vdsm-4.20.35-1


How reproducible:

Not.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 10 Arik 2018-10-03 15:47:57 UTC

Seems like the commands monitoring has stopped for some reason.

On normal execution, AddVmFromTemplate fires CreateCloneOfTemplate that fires CopyImageGroupWithData. At that point, CopyImageGroupWithData should be monitored as a child of CreateCloneOfTemplate, resulting in log printings like:

[org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [82947a98-d464-4d80-ab04-22ac8d0074c8] Command 'CreateCloneOfTemplate' (id: '4e4608d3-d792-45d6-b9ab-70e9d50cbb9c') waiting on child command id: '470d0df7-9f60-4735-ae23-66775aa44493' type:'CopyImageGroupWithData' to complete

And also AddVmFromTemplate should be waiting on its child command(s):

2018-10-03 18:00:31,767+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [82947a98-d464-4d80-ab04-22ac8d0074c8] Command 'AddVmFromTemplate' (id: '8113f23e-d196-448e-a571-45c5ed06bad3') waiting on child command id: '4e4608d3-d792-45d6-b9ab-70e9d50cbb9c' type:'CreateCloneOfTemplate' to complete

Until at some point the child commands are done and AddVmFromTemplate is finished:

2018-10-03 18:07:06,298+03 INFO  [org.ovirt.engine.core.bll.AddVmFromTemplateCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [82947a98-d464-4d80-ab04-22ac8d0074c8] Ending command 'org.ovirt.engine.core.bll.AddVmFromTemplateCo
mmand' successfully.

However, there are no relevant log printings from ConcurrentChildCommandsExecutionCallback in the reported case and no log for AddVmFromTemplate being finished.

In general, I don't see any log from ConcurrentChildCommandsExecutionCallback nor from SerialChildCommandsExecutionCallback in the attached log. So there is a chance that CoCo stopped monitoring all the executed commands on this environment.

Ravi, can you please take a look?

Comment 11 Michal Skrivanek 2018-10-04 06:49:16 UTC

anything in server.log perhaps?

Comment 13 Ravi Nori 2018-10-17 10:57:36 UTC

@Arik

Do you have a reproducer?

I tried cloning a VM from template and the AddVmFromTemplateCommand succeeded and the disk is unlocked.

Comment 14 Arik 2018-10-17 20:06:05 UTC

(In reply to Ravi Nori from comment #13)
> @Arik
> 
> Do you have a reproducer?

Unfortunately, no.

We encountered a similar situation in which CoCo stopped functioning (in regards to jobs monitoring) on rhv.tlv and didn't manage to reproduce it back then and I've been told that the scale team encountered that several times quite recently.

Apparently, there is no easy way to reproduce it. We are usually told about that when the jobs stopped being monitored which causes commands to get stuck (it doesn't necessarily affect AddVmFromTemplate, it may happen to other commands as well) and we don't know what happens exactly when the monitoring stops.

IIRC, Roy prepared a script/configuration that should allow us to instrument the code once it happens so we'll be able to get more information, but I don't know if it has ever been applied.

Roy, can you please shed some light on this?

> 
> I tried cloning a VM from template and the AddVmFromTemplateCommand
> succeeded and the disk is unlocked.

Right, I also tested this flow - it generally works properly when CoCo functions well.

Comment 15 Roy Golan 2018-10-22 10:48:15 UTC

Arik please the see the byteman.sh script I posted under the contrib folder https://github.com/oVirt/ovirt-engine/blob/master/contrib/byteman.sh

I believe you'll figure out from the script how to talior this to coco, if not let me know. For the record a slightly modified version was used by QE to put an artificial delay into one of our command to reproduce a scenario and it works.

Comment 16 Martin Perina 2018-10-30 14:14:30 UTC

Patch attached to wrong bug

Comment 17 Pavel Gashev 2018-10-31 09:26:05 UTC

Note that we hit the same issue on the latest oVirt 4.1. Commands related to creating disks (add disks, creating snapshots, templates) stuck at some point. Workaround is restarting ovirt-engine. ConcurrentChildCommandsExecutionCallback appears in logs after engine restart.

Comment 18 Marina Kalinin 2018-11-08 12:58:49 UTC

(In reply to Pavel Gashev from comment #17)
> Note that we hit the same issue on the latest oVirt 4.1. Commands related to
> creating disks (add disks, creating snapshots, templates) stuck at some
> point. Workaround is restarting ovirt-engine.
> ConcurrentChildCommandsExecutionCallback appears in logs after engine
> restart.

Did you mean oVirt 4.2?

Comment 19 Pavel Gashev 2018-11-08 13:04:49 UTC

No, I mean that oVirt 4.1 has this bug too.

Comment 20 Pavel Gashev 2018-11-16 09:45:33 UTC

Ok, I hit it again on oVirt 4.1.11.2 / vdsm 4.19.51.

After it stuck on one VM, I executed the script and tried to make a snapshot on another VM. It stuck exactly in the same way as described at https://access.redhat.com/solutions/3629221

There is nothing special in the trace log. Only GetStatsVDS, GetAllVmStatsVDS, SpmStatusVDS and:

2018-11-16 08:56:25,250Z INFO  [stdout] (DefaultQuartzScheduler10) *** execute vdsbroker HSMGetAllTasksStatusesVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d]
2018-11-16 08:56:25,257Z INFO  [stdout] (DefaultQuartzScheduler10) *** execute vdsbroker HSMGetAllTasksStatusesVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d] return value: {17ee7349-c57b-486d-9ac5-b67452c751fe=AsyncTaskStatus:{status='finished', result='success', exception='null', message=''}}
2018-11-16 08:56:25,314Z INFO  [stdout] (org.ovirt.thread.pool-7-thread-32) *** execute vdsbroker HSMClearTaskVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d]
2018-11-16 08:56:25,328Z INFO  [stdout] (org.ovirt.thread.pool-7-thread-32) *** execute vdsbroker HSMClearTaskVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d] return value: null

Disk is created, but the engine doesn't even try to go further and make snapshot.

After engine restart the scheduler received ConcurrentChildCommandsExecutionCallback and everything went further: 

2018-11-16 09:10:23,494Z INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler8) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] Command 'CreateAllSnapshotsFromVm' id: 
'12906297-8160-423e-9f62-1d217a7aac62' child commands '[5e9de915-2637-410a-8a55-3518226ce751]' executions were completed, status 'SUCCEEDED'
2018-11-16 09:10:29,604Z INFO  [org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand' successfully.
2018-11-16 09:10:29,652Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] START, SnapshotVDSCommand(HostName = c5b6, SnapshotVDSCommandParameters:{runAsync='true', hostId='61d48552-ff67-4a67-8aea-289d8d9a6151', vmId='876bba52-080b-460f-a190-33f378a5b273'}), log id: c9eb87a
2018-11-16 09:10:30,730Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] FINISH, SnapshotVDSCommand, log id: c9eb87a
2018-11-16 09:10:30,749Z INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand' successfully.

Comment 21 Roy Golan 2018-11-21 07:50:32 UTC

There is probably a real issue here but triggering it is tricky. We don't hit it in CI, and I'm pretty sure that scale and performance QE team didn't see that as well. So in order to make progress we must get a thread dump - otherwise it is hard to tell what went wrong. A Debug log would also be nice. Reminder - you can set the debug log level at runtime

Comment 22 Pavel Gashev 2018-11-22 09:05:23 UTC

Created attachment 1507894 [details]
Thread dump for 4.1.11

Comment 23 Pavel Gashev 2018-11-22 09:14:33 UTC

Please find attached the thread dump.

Also I have bmsubmitted the following rule:

RULE trace invokeCallbackMethods
CLASS org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller
METHOD invokeCallbackMethods
AT ENTRY
IF TRUE
DO traceln("*** execute invokeCallbackMethods")
ENDRULE

It shows no traces. So invokeCallbackMethods is NOT executed.

Comment 24 Pavel Gashev 2018-12-17 12:56:58 UTC

Note that the Quartz threads could be constantly busy by statistics if you have more than 10 hosts. More hosts you have, more VMs are running on hosts, slower disk for PostgreSQL - more chances to hit this bug. Moving the engine to SSD resolves the issue.

Comment 26 Elad 2019-02-06 12:28:33 UTC

Ravi, unless you have any other suggestion, as the steps to reproduce here are unclear, we'll verify based on regression testing results.

Comment 27 Ravi Nori 2019-02-06 12:42:07 UTC

We don't have a reproducer for the issue so verifying based on regression testing results sounds good.

Comment 28 Yosi Ben Shimon 2019-02-07 12:45:00 UTC

No regressions related to this were found.
Currently, moving to VERIFIED.

Comment 29 Yosi Ben Shimon 2019-02-07 14:02:59 UTC

Last execution envrinoment:
ovirt-engine-4.3.0.4-0.1.el7.noarch
vdsm-4.30.8-2.el7ev.x86_64

Comment 31 errata-xmlrpc 2019-05-08 12:38:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 32 Daniel Gur 2019-08-28 13:12:18 UTC

sync2jira

Comment 33 Daniel Gur 2019-08-28 13:16:30 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.