Description of problem: A VM was cloned from a template, the sequence on the engine and SPM host seemed to complete, at least the task was removed from the database. However, the newly-created VM and its disk image remain locked. This prevents any subsequent use of this template. Version-Release number of selected component (if applicable): RHV 4.2.5 RHEL 7.5 host (SPM) w/vdsm-4.20.35-1 How reproducible: Not. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Seems like the commands monitoring has stopped for some reason. On normal execution, AddVmFromTemplate fires CreateCloneOfTemplate that fires CopyImageGroupWithData. At that point, CopyImageGroupWithData should be monitored as a child of CreateCloneOfTemplate, resulting in log printings like: [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [82947a98-d464-4d80-ab04-22ac8d0074c8] Command 'CreateCloneOfTemplate' (id: '4e4608d3-d792-45d6-b9ab-70e9d50cbb9c') waiting on child command id: '470d0df7-9f60-4735-ae23-66775aa44493' type:'CopyImageGroupWithData' to complete And also AddVmFromTemplate should be waiting on its child command(s): 2018-10-03 18:00:31,767+03 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [82947a98-d464-4d80-ab04-22ac8d0074c8] Command 'AddVmFromTemplate' (id: '8113f23e-d196-448e-a571-45c5ed06bad3') waiting on child command id: '4e4608d3-d792-45d6-b9ab-70e9d50cbb9c' type:'CreateCloneOfTemplate' to complete Until at some point the child commands are done and AddVmFromTemplate is finished: 2018-10-03 18:07:06,298+03 INFO [org.ovirt.engine.core.bll.AddVmFromTemplateCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [82947a98-d464-4d80-ab04-22ac8d0074c8] Ending command 'org.ovirt.engine.core.bll.AddVmFromTemplateCo mmand' successfully. However, there are no relevant log printings from ConcurrentChildCommandsExecutionCallback in the reported case and no log for AddVmFromTemplate being finished. In general, I don't see any log from ConcurrentChildCommandsExecutionCallback nor from SerialChildCommandsExecutionCallback in the attached log. So there is a chance that CoCo stopped monitoring all the executed commands on this environment. Ravi, can you please take a look?
anything in server.log perhaps?
@Arik Do you have a reproducer? I tried cloning a VM from template and the AddVmFromTemplateCommand succeeded and the disk is unlocked.
(In reply to Ravi Nori from comment #13) > @Arik > > Do you have a reproducer? Unfortunately, no. We encountered a similar situation in which CoCo stopped functioning (in regards to jobs monitoring) on rhv.tlv and didn't manage to reproduce it back then and I've been told that the scale team encountered that several times quite recently. Apparently, there is no easy way to reproduce it. We are usually told about that when the jobs stopped being monitored which causes commands to get stuck (it doesn't necessarily affect AddVmFromTemplate, it may happen to other commands as well) and we don't know what happens exactly when the monitoring stops. IIRC, Roy prepared a script/configuration that should allow us to instrument the code once it happens so we'll be able to get more information, but I don't know if it has ever been applied. Roy, can you please shed some light on this? > > I tried cloning a VM from template and the AddVmFromTemplateCommand > succeeded and the disk is unlocked. Right, I also tested this flow - it generally works properly when CoCo functions well.
Arik please the see the byteman.sh script I posted under the contrib folder https://github.com/oVirt/ovirt-engine/blob/master/contrib/byteman.sh I believe you'll figure out from the script how to talior this to coco, if not let me know. For the record a slightly modified version was used by QE to put an artificial delay into one of our command to reproduce a scenario and it works.
Patch attached to wrong bug
Note that we hit the same issue on the latest oVirt 4.1. Commands related to creating disks (add disks, creating snapshots, templates) stuck at some point. Workaround is restarting ovirt-engine. ConcurrentChildCommandsExecutionCallback appears in logs after engine restart.
(In reply to Pavel Gashev from comment #17) > Note that we hit the same issue on the latest oVirt 4.1. Commands related to > creating disks (add disks, creating snapshots, templates) stuck at some > point. Workaround is restarting ovirt-engine. > ConcurrentChildCommandsExecutionCallback appears in logs after engine > restart. Did you mean oVirt 4.2?
No, I mean that oVirt 4.1 has this bug too.
Ok, I hit it again on oVirt 4.1.11.2 / vdsm 4.19.51. After it stuck on one VM, I executed the script and tried to make a snapshot on another VM. It stuck exactly in the same way as described at https://access.redhat.com/solutions/3629221 There is nothing special in the trace log. Only GetStatsVDS, GetAllVmStatsVDS, SpmStatusVDS and: 2018-11-16 08:56:25,250Z INFO [stdout] (DefaultQuartzScheduler10) *** execute vdsbroker HSMGetAllTasksStatusesVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d] 2018-11-16 08:56:25,257Z INFO [stdout] (DefaultQuartzScheduler10) *** execute vdsbroker HSMGetAllTasksStatusesVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d] return value: {17ee7349-c57b-486d-9ac5-b67452c751fe=AsyncTaskStatus:{status='finished', result='success', exception='null', message=''}} 2018-11-16 08:56:25,314Z INFO [stdout] (org.ovirt.thread.pool-7-thread-32) *** execute vdsbroker HSMClearTaskVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d] 2018-11-16 08:56:25,328Z INFO [stdout] (org.ovirt.thread.pool-7-thread-32) *** execute vdsbroker HSMClearTaskVDS on Host[c6b1,e0932853-908d-4def-ac2f-32794aec216d] return value: null Disk is created, but the engine doesn't even try to go further and make snapshot. After engine restart the scheduler received ConcurrentChildCommandsExecutionCallback and everything went further: 2018-11-16 09:10:23,494Z INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler8) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] Command 'CreateAllSnapshotsFromVm' id: '12906297-8160-423e-9f62-1d217a7aac62' child commands '[5e9de915-2637-410a-8a55-3518226ce751]' executions were completed, status 'SUCCEEDED' 2018-11-16 09:10:29,604Z INFO [org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand' successfully. 2018-11-16 09:10:29,652Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] START, SnapshotVDSCommand(HostName = c5b6, SnapshotVDSCommandParameters:{runAsync='true', hostId='61d48552-ff67-4a67-8aea-289d8d9a6151', vmId='876bba52-080b-460f-a190-33f378a5b273'}), log id: c9eb87a 2018-11-16 09:10:30,730Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] FINISH, SnapshotVDSCommand, log id: c9eb87a 2018-11-16 09:10:30,749Z INFO [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (DefaultQuartzScheduler6) [1b828b66-3103-4a2e-a9f1-aa0afbc5a249] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand' successfully.
There is probably a real issue here but triggering it is tricky. We don't hit it in CI, and I'm pretty sure that scale and performance QE team didn't see that as well. So in order to make progress we must get a thread dump - otherwise it is hard to tell what went wrong. A Debug log would also be nice. Reminder - you can set the debug log level at runtime
Created attachment 1507894 [details] Thread dump for 4.1.11
Please find attached the thread dump. Also I have bmsubmitted the following rule: RULE trace invokeCallbackMethods CLASS org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller METHOD invokeCallbackMethods AT ENTRY IF TRUE DO traceln("*** execute invokeCallbackMethods") ENDRULE It shows no traces. So invokeCallbackMethods is NOT executed.
Note that the Quartz threads could be constantly busy by statistics if you have more than 10 hosts. More hosts you have, more VMs are running on hosts, slower disk for PostgreSQL - more chances to hit this bug. Moving the engine to SSD resolves the issue.
Ravi, unless you have any other suggestion, as the steps to reproduce here are unclear, we'll verify based on regression testing results.
We don't have a reproducer for the issue so verifying based on regression testing results sounds good.
No regressions related to this were found. Currently, moving to VERIFIED.
Last execution envrinoment: ovirt-engine-4.3.0.4-0.1.el7.noarch vdsm-4.30.8-2.el7ev.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1085
sync2jira