Bug 1641430 - During cold reboot treatment, RunVm did not run for some VMs
Summary: During cold reboot treatment, RunVm did not run for some VMs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.2.6
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ovirt-4.3.0
: 4.3.0
Assignee: Ravi Nori
QA Contact: Liran Rotenberg
URL:
Whiteboard:
Depends On:
Blocks: 1647384
TreeView+ depends on / blocked
 
Reported: 2018-10-22 05:16 UTC by Germano Veit Michel
Modified: 2022-03-13 15:49 UTC (History)
16 users (show)

Fixed In Version: ovirt-engine-4.3.0_alpha
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1647384 (view as bug list)
Environment:
Last Closed: 2019-05-08 12:38:43 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Logs from 21th from engine (1.48 MB, application/x-gzip)
2018-10-22 11:42 UTC, Roman Hodain
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44322 0 None None None 2021-12-10 18:12:44 UTC
Red Hat Product Errata RHEA-2019:1085 0 None None None 2019-05-08 12:39:03 UTC
oVirt gerrit 95070 0 master MERGED engine: During cold reboot treatment, RunVm did not run for some VMs 2021-01-21 08:01:35 UTC
oVirt gerrit 95192 0 ovirt-engine-4.2 MERGED engine: During cold reboot treatment, RunVm did not run for some VMs 2021-01-21 08:01:35 UTC

Description Germano Veit Michel 2018-10-22 05:16:56 UTC
Description of problem:

Several VMs were doing cold reboot treatment short time from each other. But they were never started again.

1. Reboot

2018-10-22 03:40:55,311+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] VM 'af567438-41b6-47a4-bf6a-e8099b08792f'(<removed>) moved from 'Up' --> 'RebootInProgress'

2. Destroy

2018-10-22 03:40:58,488+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-5) [] START, DestroyVDSCommand(HostName = <removed>, DestroyVmVDSCommandParameters:{hostId='b662414b-d7b7-4a93-8093-5a2c3086e7a1', vmId='af567438-41b6-47a4-bf6a-e8099b08792f', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 36ad9ff9

2018-10-22 03:40:58,496+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-5) [] VM 'af567438-41b6-47a4-bf6a-e8099b08792f'(<removed>) moved from 'RebootInProgress' --> 'Down'

3. Down:

2018-10-22 03:40:58,570+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-5) [] EVENT_ID: VM_DOWN(61), VM <removed> is down. Exit message: Down as a part of the reboot process

4. Cold Reboot:

2018-10-22 03:40:58,571+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-5) [] add VM 'af567438-41b6-47a4-bf6a-e8099b08792f'(<removed>) to cold reboot treatment

2018-10-22 03:40:58,579+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-5) [] EVENT_ID: COLD_REBOOT_VM_DOWN(9,611), VM <removed> is down as a part of cold reboot process

2018-10-22 03:40:58,579+02 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (ForkJoinPool-1-worker-5) [] VM is down as a part of cold reboot process. Attempting to restart. VM Name '<removed>', VM Id 'af567438-41b6-47a4-bf6a-e8099b08792f

5. UpdateVm

2018-10-22 03:40:58,619+02 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (EE-ManagedThreadFactory-engine-Thread-389) [62524a71] Running command: ProcessDownVmCommand internal: true.

2018-10-22 03:40:58,868+02 INFO  [org.ovirt.engine.core.bll.UpdateVmCommand] (EE-ManagedThreadFactory-engine-Thread-389) [651462f2] Running command: UpdateVmCommand internal: true. Entities affected :  ID: af567438-41b6-47a4-bf6a-e8099b08792f Type: VMAction group EDIT_VM_PROPERTIES with role type USER

2018-10-22 03:40:58,952+02 INFO  [org.ovirt.engine.core.bll.UpdateGraphicsDeviceCommand] (EE-ManagedThreadFactory-engine-Thread-389) [551097ef] Running command: UpdateGraphicsDeviceCommand internal: true. Entities affected :  ID: af567438-41b6-47a4-bf6a-e8099b08792f Type: VMAction group EDIT_VM_PROPERTIES with role type USER

2018-10-22 03:40:58,972+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-389) [551097ef] EVENT_ID: SYSTEM_UPDATE_VM(253), VM <removed> configuration was updated by system.

2018-10-22 03:40:58,977+02 INFO  [org.ovirt.engine.core.bll.UpdateVmCommand] (EE-ManagedThreadFactory-engine-Thread-389) [551097ef] Lock freed to object 'EngineLock:{exclusiveLocks='[<removed>=VM_NAME]', sharedLocks='[af567438-41b6-47a4-bf6a-e8099b08792f=VM]'}'

6. Nothing happens, later the user attempts to start it manually, gets stuck:

2018-10-22 05:45:57,537+02 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (default task-101) [a91813db-5ad2-4d6e-aa65-64b990422981] Running command: RunVmCommand internal: false. Entities affected :  ID: af567438-41b6-47a4-bf6a-e8099b08792f Type: VMAction group RUN_VM with role type USER


                job_id                | action_type |       description       | status  |            correlation_id            
--------------------------------------+-------------+-------------------------+---------+--------------------------------------
 291f6a6d-6030-4f99-858e-b4781ff75021 | RunVm       | Launching VM <removed> | STARTED | a91813db-5ad2-4d6e-aa65-64b990422981


task-101 is doing nothing:

"default task-101" #467 prio=5 os_prio=0 tid=0x00000000090d9800 nid=0x208c waiting on condition [0x00007f5872d63000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006c7755630> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)


Version-Release number of selected component (if applicable):
rhvm-4.2.6.4-0.1.el7ev.noarch

How reproducible:
Unknown at this point, this is initial investigation

Comment 14 Roman Hodain 2018-10-22 11:42:50 UTC
Created attachment 1496370 [details]
Logs from 21th from engine

Comment 23 Liran Rotenberg 2019-01-07 08:59:01 UTC
Verified on:
ovirt-engine-4.3.0-0.5.alpha1.el7.noarch
vdsm-4.30.4-1.el7ev.x86_64

Steps:
1. Run amount of VMs.
2. Change VMs property for next run change(ballooning for example).
3. Reboot the VMs using SSH (cold reboot) in the same time.

# The idea of rebooting in the same time is to create a peak to the engine.

Result:
After doing the above steps in a loop, the issue did not reproduce.

Comment 25 errata-xmlrpc 2019-05-08 12:38:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085


Note You need to log in before you can comment on or make changes to this bug.