+++ This bug is a downstream clone. The original bug is: +++
+++ bug 1641430 +++
======================================================================
Description of problem:
Several VMs were doing cold reboot treatment short time from each other. But they were never started again.
1. Reboot
2018-10-22 03:40:55,311+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] VM 'af567438-41b6-47a4-bf6a-e8099b08792f'(<removed>) moved from 'Up' --> 'RebootInProgress'
2. Destroy
2018-10-22 03:40:58,488+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-5) [] START, DestroyVDSCommand(HostName = <removed>, DestroyVmVDSCommandParameters:{hostId='b662414b-d7b7-4a93-8093-5a2c3086e7a1', vmId='af567438-41b6-47a4-bf6a-e8099b08792f', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 36ad9ff9
2018-10-22 03:40:58,496+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-5) [] VM 'af567438-41b6-47a4-bf6a-e8099b08792f'(<removed>) moved from 'RebootInProgress' --> 'Down'
3. Down:
2018-10-22 03:40:58,570+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-5) [] EVENT_ID: VM_DOWN(61), VM <removed> is down. Exit message: Down as a part of the reboot process
4. Cold Reboot:
2018-10-22 03:40:58,571+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-5) [] add VM 'af567438-41b6-47a4-bf6a-e8099b08792f'(<removed>) to cold reboot treatment
2018-10-22 03:40:58,579+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-5) [] EVENT_ID: COLD_REBOOT_VM_DOWN(9,611), VM <removed> is down as a part of cold reboot process
2018-10-22 03:40:58,579+02 INFO [org.ovirt.engine.core.bll.VdsEventListener] (ForkJoinPool-1-worker-5) [] VM is down as a part of cold reboot process. Attempting to restart. VM Name '<removed>', VM Id 'af567438-41b6-47a4-bf6a-e8099b08792f
5. UpdateVm
2018-10-22 03:40:58,619+02 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (EE-ManagedThreadFactory-engine-Thread-389) [62524a71] Running command: ProcessDownVmCommand internal: true.
2018-10-22 03:40:58,868+02 INFO [org.ovirt.engine.core.bll.UpdateVmCommand] (EE-ManagedThreadFactory-engine-Thread-389) [651462f2] Running command: UpdateVmCommand internal: true. Entities affected : ID: af567438-41b6-47a4-bf6a-e8099b08792f Type: VMAction group EDIT_VM_PROPERTIES with role type USER
2018-10-22 03:40:58,952+02 INFO [org.ovirt.engine.core.bll.UpdateGraphicsDeviceCommand] (EE-ManagedThreadFactory-engine-Thread-389) [551097ef] Running command: UpdateGraphicsDeviceCommand internal: true. Entities affected : ID: af567438-41b6-47a4-bf6a-e8099b08792f Type: VMAction group EDIT_VM_PROPERTIES with role type USER
2018-10-22 03:40:58,972+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-389) [551097ef] EVENT_ID: SYSTEM_UPDATE_VM(253), VM <removed> configuration was updated by system.
2018-10-22 03:40:58,977+02 INFO [org.ovirt.engine.core.bll.UpdateVmCommand] (EE-ManagedThreadFactory-engine-Thread-389) [551097ef] Lock freed to object 'EngineLock:{exclusiveLocks='[<removed>=VM_NAME]', sharedLocks='[af567438-41b6-47a4-bf6a-e8099b08792f=VM]'}'
6. Nothing happens, later the user attempts to start it manually, gets stuck:
2018-10-22 05:45:57,537+02 INFO [org.ovirt.engine.core.bll.RunVmCommand] (default task-101) [a91813db-5ad2-4d6e-aa65-64b990422981] Running command: RunVmCommand internal: false. Entities affected : ID: af567438-41b6-47a4-bf6a-e8099b08792f Type: VMAction group RUN_VM with role type USER
job_id | action_type | description | status | correlation_id
--------------------------------------+-------------+-------------------------+---------+--------------------------------------
291f6a6d-6030-4f99-858e-b4781ff75021 | RunVm | Launching VM <removed> | STARTED | a91813db-5ad2-4d6e-aa65-64b990422981
task-101 is doing nothing:
"default task-101" #467 prio=5 os_prio=0 tid=0x00000000090d9800 nid=0x208c waiting on condition [0x00007f5872d63000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006c7755630> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Version-Release number of selected component (if applicable):
rhvm-4.2.6.4-0.1.el7ev.noarch
How reproducible:
Unknown at this point, this is initial investigation
(Originally by Germano Veit Michel)
Did not reproduced on latest 4.2.8.
ovirt-hosted-engine-setup-2.2.33-0.0.master.20181105145335.git470d6c6.el7.noarch
ovirt-hosted-engine-ha-2.2.19-0.0.master.20181002122327.20181002122322.gitb449616.el7.noarch
ovirt-engine-appliance-4.2-20181118.1.el7.noarch
Moving to verified.
Please reopen if you still getting the issue.
In order to reproduce I've created 6 different VMs using VM pool and tried to reboot them and poweroff them one by one and during this changed VM pool's settings and applied and then started all 6 VMs just fine.
I returned to the same steps several iterations and did not hit the issue.
Moving to verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2019:0121