Bug 1633777 - VM cloned from template remains locked, subsequent clones fail
Summary: VM cloned from template remains locked, subsequent clones fail
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.2.5
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ovirt-4.3.0
: ---
Assignee: Ravi Nori
QA Contact: Yosi Ben Shimon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-27 18:12 UTC by Gordon Watson
Modified: 2020-02-25 12:36 UTC (History)
14 users (show)

Fixed In Version: ovirt-engine-4.3.0_rc
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-08 12:38:22 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Thread dump for 4.1.11 (12.51 KB, application/x-gzip)
2018-11-22 09:05 UTC, Pavel Gashev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3629221 0 None None None 2018-09-27 22:04:14 UTC
Red Hat Product Errata RHEA-2019:1085 0 None None None 2019-05-08 12:38:39 UTC
oVirt gerrit 95872 0 'None' MERGED engine: Add debug and info logs to CommandCallbackPoller 2020-08-05 18:46:35 UTC
oVirt gerrit 95873 0 'None' MERGED engine: Introduce CommandCallback finalizeCommand method 2020-08-05 18:46:34 UTC
oVirt gerrit 95874 0 'None' MERGED engine: Fix negative flow in AddImageFromStratchCommand 2020-08-05 18:46:34 UTC
oVirt gerrit 95875 0 'None' MERGED engine: Handle exceptions during compensate 2020-08-05 18:46:34 UTC
oVirt gerrit 95876 0 'None' MERGED engine: Handle EXECUTION_FAILED status in callback 2020-08-05 18:46:34 UTC
oVirt gerrit 95877 0 'None' MERGED engine: Wrap all statements in CommandCallbacksPoller in try catch 2020-08-05 18:46:34 UTC

Description Gordon Watson 2018-09-27 18:12:32 UTC
Description of problem:

A VM was cloned from a template, the sequence on the engine and SPM host seemed to complete, at least the task was removed from the database. However, the newly-created VM and its disk image remain locked.

This prevents any subsequent use of this template.


Version-Release number of selected component (if applicable):

RHV 4.2.5
RHEL 7.5 host (SPM) w/vdsm-4.20.35-1


How reproducible:

Not.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 10 Arik 2018-10-03 15:47:57 UTC
Seems like the commands monitoring has stopped for some reason.

On normal execution, AddVmFromTemplate fires CreateCloneOfTemplate that fires CopyImageGroupWithData. At that point, CopyImageGroupWithData should be monitored as a child of CreateCloneOfTemplate, resulting in log printings like:

[org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [82947a98-d464-4d80-ab04-22ac8d0074c8] Command 'CreateCloneOfTemplate' (id: '4e4608d3-d792-45d6-b9ab-70e9d50cbb9c') waiting on child command id: '470d0df7-9f60-4735-ae23-66775aa44493' type:'CopyImageGroupWithData' to complete

And also AddVmFromTemplate should be waiting on its child command(s):

2018-10-03 18:00:31,767+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [82947a98-d464-4d80-ab04-22ac8d0074c8] Command 'AddVmFromTemplate' (id: '8113f23e-d196-448e-a571-45c5ed06bad3') waiting on child command id: '4e4608d3-d792-45d6-b9ab-70e9d50cbb9c' type:'CreateCloneOfTemplate' to complete

Until at some point the child commands are done and AddVmFromTemplate is finished:

2018-10-03 18:07:06,298+03 INFO  [org.ovirt.engine.core.bll.AddVmFromTemplateCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [82947a98-d464-4d80-ab04-22ac8d0074c8] Ending command 'org.ovirt.engine.core.bll.AddVmFromTemplateCo
mmand' successfully.

However, there are no relevant log printings from ConcurrentChildCommandsExecutionCallback in the reported case and no log for AddVmFromTemplate being finished.

In general, I don't see any log from ConcurrentChildCommandsExecutionCallback nor from SerialChildCommandsExecutionCallback in the attached log. So there is a chance that CoCo stopped monitoring all the executed commands on this environment.

Ravi, can you please take a look?

Comment 11 Michal Skrivanek 2018-10-04 06:49:16 UTC
anything in server.log perhaps?

Comment 13 Ravi Nori 2018-10-17 10:57:36 UTC
@Arik

Do you have a reproducer?

I tried cloning a VM from template and the AddVmFromTemplateCommand succeeded and the disk is unlocked.

Comment 14 Arik 2018-10-17 20:06:05 UTC
(In reply to Ravi Nori from comment #13)
> @Arik
> 
> Do you have a reproducer?

Unfortunately, no.

We encountered a similar situation in which CoCo stopped functioning (in regards to jobs monitoring) on rhv.tlv and didn't manage to reproduce it back then and I've been told that the scale team encountered that several times quite recently.

Apparently, there is no easy way to reproduce it. We are usually told about that when the jobs stopped being monitored which causes commands to get stuck (it doesn't necessarily affect AddVmFromTemplate, it may happen to other commands as well) and we don't know what happens exactly when the monitoring stops.

IIRC, Roy prepared a script/configuration that should allow us to instrument the code once it happens so we'll be able to get more information, but I don't know if it has ever been applied.

Roy, can you please shed some light on this?

> 
> I tried cloning a VM from template and the AddVmFromTemplateCommand
> succeeded and the disk is unlocked.

Right, I also tested this flow - it generally works properly when CoCo functions well.

Comment 15 Roy Golan 2018-10-22 10:48:15 UTC
Arik please the see the byteman.sh script I posted under the contrib folder https://github.com/oVirt/ovirt-engine/blob/master/contrib/byteman.sh

I believe you'll figure out from the script how to talior this to coco, if not let me know. For the record a slightly modified version was used by QE to put an artificial delay into one of our command to reproduce a scenario and it works.

Comment 16 Martin Perina 2018-10-30 14:14:30 UTC
Patch attached to wrong bug

Comment 17 Pavel Gashev 2018-10-31 09:26:05 UTC
Note that we hit the same issue on the latest oVirt 4.1. Commands related to creating disks (add disks, creating snapshots, templates) stuck at some point. Workaround is restarting ovirt-engine. ConcurrentChildCommandsExecutionCallback appears in logs after engine restart.

Comment 18 Marina Kalinin 2018-11-08 12:58:49 UTC
(In reply to Pavel Gashev from comment #17)
> Note that we hit the same issue on the latest oVirt 4.1. Commands related to
> creating disks (add disks, creating snapshots, templates) stuck at some
> point. Workaround is restarting ovirt-engine.
> ConcurrentChildCommandsExecutionCallback appears in logs after engine
> restart.

Did you mean oVirt 4.2?

Comment 19 Pavel Gashev 2018-11-08 13:04:49 UTC
No, I mean that oVirt 4.1 has this bug too.

Comment 20 Pavel Gashev 2018-11-16 09:45:33 UTC
Ok, I hit it again on oVirt 4.1.11.2 / vdsm 4.19.51.

After it stuck on one VM, I executed the script and tried to make a snapshot on another VM. It stuck exactly in the same way as described at https://access.redhat.com/solutions/3629221

There is nothing special in the trace log. Only GetStatsVDS, GetAllVmStatsVDS, SpmStatusVDS and:

2018-11-16 08:56:25,250Z INFO  [stdout] (DefaultQuartzScheduler10) *** execute vdsbroker HSMGetAllTasksStatusesVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d]
2018-11-16 08:56:25,257Z INFO  [stdout] (DefaultQuartzScheduler10) *** execute vdsbroker HSMGetAllTasksStatusesVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d] return value: {17ee7349-c57b-486d-9ac5-b67452c751fe=AsyncTaskStatus:{status='finished', result='success', exception='null', message=''}}
2018-11-16 08:56:25,314Z INFO  [stdout] (org.ovirt.thread.pool-7-thread-32) *** execute vdsbroker HSMClearTaskVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d]
2018-11-16 08:56:25,328Z INFO  [stdout] (org.ovirt.thread.pool-7-thread-32) *** execute vdsbroker HSMClearTaskVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d] return value: null

Disk is created, but the engine doesn't even try to go further and make snapshot.

After engine restart the scheduler received ConcurrentChildCommandsExecutionCallback and everything went further: 

2018-11-16 09:10:23,494Z INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler8) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] Command 'CreateAllSnapshotsFromVm' id: 
'12906297-8160-423e-9f62-1d217a7aac62' child commands '[5e9de915-2637-410a-8a55-3518226ce751]' executions were completed, status 'SUCCEEDED'
2018-11-16 09:10:29,604Z INFO  [org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand' successfully.
2018-11-16 09:10:29,652Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] START, SnapshotVDSCommand(HostName = c5b6, SnapshotVDSCommandParameters:{runAsync='true', hostId='61d48552-ff67-4a67-8aea-289d8d9a6151', vmId='876bba52-080b-460f-a190-33f378a5b273'}), log id: c9eb87a
2018-11-16 09:10:30,730Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] FINISH, SnapshotVDSCommand, log id: c9eb87a
2018-11-16 09:10:30,749Z INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand' successfully.

Comment 21 Roy Golan 2018-11-21 07:50:32 UTC
There is probably a real issue here but triggering it is tricky. We don't hit it in CI, and I'm pretty sure that scale and performance QE team didn't see that as well. So in order to make progress we must get a thread dump - otherwise it is hard to tell what went wrong. A Debug log would also be nice. Reminder - you can set the debug log level at runtime

Comment 22 Pavel Gashev 2018-11-22 09:05:23 UTC
Created attachment 1507894 [details]
Thread dump for 4.1.11

Comment 23 Pavel Gashev 2018-11-22 09:14:33 UTC
Please find attached the thread dump.

Also I have bmsubmitted the following rule:

RULE trace invokeCallbackMethods
CLASS org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller
METHOD invokeCallbackMethods
AT ENTRY
IF TRUE
DO traceln("*** execute invokeCallbackMethods")
ENDRULE

It shows no traces. So invokeCallbackMethods is NOT executed.

Comment 24 Pavel Gashev 2018-12-17 12:56:58 UTC
Note that the Quartz threads could be constantly busy by statistics if you have more than 10 hosts. More hosts you have, more VMs are running on hosts, slower disk for PostgreSQL - more chances to hit this bug. Moving the engine to SSD resolves the issue.

Comment 26 Elad 2019-02-06 12:28:33 UTC
Ravi, unless you have any other suggestion, as the steps to reproduce here are unclear, we'll verify based on regression testing results.

Comment 27 Ravi Nori 2019-02-06 12:42:07 UTC
We don't have a reproducer for the issue so verifying based on regression testing results sounds good.

Comment 28 Yosi Ben Shimon 2019-02-07 12:45:00 UTC
No regressions related to this were found.
Currently, moving to VERIFIED.

Comment 29 Yosi Ben Shimon 2019-02-07 14:02:59 UTC
Last execution envrinoment:
ovirt-engine-4.3.0.4-0.1.el7.noarch
vdsm-4.30.8-2.el7ev.x86_64

Comment 31 errata-xmlrpc 2019-05-08 12:38:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 32 Daniel Gur 2019-08-28 13:12:18 UTC
sync2jira

Comment 33 Daniel Gur 2019-08-28 13:16:30 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.