Description of problem: I can't reproduce this on 3.6.11 or on 4.1.5. Customer hit on 3.6.11. Looks like ProcessDownVmCommand should not have been called. Race? See the details: 1) User Grabs a VM 2017-09-06 16:56:07,822 INFO [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-3) [24c593e1] Succeeded giving user '82a64380-5b61-4c34-b5e9-a54471f28bc9' permission to Vm '7074cd6a-96cc-4956-a549-5e2323e7806d' 2017-09-06 16:56:08,037 INFO [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-3) [3b525f84] Running command: RunVmCommand internal: true. Entities affected : ID: 7074cd6a-96cc-4956-a549-5e2323e7806d Type: VMAction group RUN_VM with role type USER 2017-09-06 16:56:08,039 INFO [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-3) [3b525f84] Creating stateless snapshot for VM 'XXXX-127' (7074cd6a-96cc-4956-a549-5e2323e7806d) 2017-09-06 16:56:15,131 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (org.ovirt.thread.pool-6-thread-33) [3b525f84] START, CreateVmVDSCommand( CreateVmVDSCommandParameters:{runAsync='true', hostId='6b33b03b-9a0f-4c3d-8281-10aeb6d0e2e4', vmId='7074cd6a-96cc-4956-a549-5e2323e7806d', vm='VM [XXXX-127]'}), log id: 1704041c 2017-09-06 16:57:16,556 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-32) [] VM '7074cd6a-96cc-4956-a549-5e2323e7806d'(XXXX-127) moved from 'PoweringUp' --> 'Up' 2) Restart the Manager: 2017-09-06 17:03:20,033 INFO [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 44) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'. 3) Several of These right at the beginning, apparently one for each VM. 2017-09-06 17:03:37,006 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-9) [6be3ad2e] Running command: ProcessDownVmCommand internal: true. 4) Detaches User: 2017-09-06 17:03:37,048 INFO [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (org.ovirt.thread.pool-6-thread-9) [7bb00284] Deleting snapshots for stateless vm '7074cd6a-96cc-4956-a549-5e2323e7806d' 2017-09-06 17:03:37,054 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (org.ovirt.thread.pool-6-thread-9) [47733928] Running command: RestoreStatelessVmCommand internal: true. Entities affected : ID: 7074cd6a-96cc-4956-a549-5e2323e7806d Type: VM 2017-09-06 17:03:37,233 WARN [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (org.ovirt.thread.pool-6-thread-9) [5067ca5d] CanDoAction of action 'RestoreAllSnapshots' failed for user SYSTEM. Reasons: VAR__ACTION__REVERT_TO,VAR__TYPE__SNAPSHOT,ACTION_TYPE_FAILED_VM_IS_NOT_DOWN 5) And yes, it shows the VM as up shortly after, status=Up. 2017-09-06 17:04:00,866 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-22) [] FINISH, FullListVDSCommand, return: [... vmId=7074cd6a-96cc-4956-a549-5e2323e7806d, status=Up... ], log id: 50922a21 Why it tried to restore? Just to double check: vm_name | console_disconnect_action -----------------+--------------------------- XXXX-127 | LOCK_SCREEN Version-Release number of selected component (if applicable): rhevm-3.6.11.3-0.1.el6.noarch How reproducible: Unknown.
the host is 3.5 (vdsm-4.16.38, right?). Then it doesn't use VM status events. There's not much useful info in logs, we would need a reproducer or a debug level logs to be able to investigate further
Hi Michal, (In reply to Michal Skrivanek from comment #8) > the host is 3.5 (vdsm-4.16.38, right?). Then it doesn't use VM status > events. Exactly. > There's not much useful info in logs, we would need a reproducer or > a debug level logs to be able to investigate further Could you be more specific on what loggers you need on DEBUG level? Just org.ovirt?
perhaps org.ovirt.engine.core.bll would be enough,also the vdsm side. Do we have a reproducer? The fact that the host in 3.5 might be significant
(In reply to Michal Skrivanek from comment #10) > perhaps org.ovirt.engine.core.bll would be enough,also the vdsm side. Do we > have a reproducer? The fact that the host in 3.5 might be significant Thanks. Yes, I'll try it now and if it doesn't reproduce we will ask the customer.
Michal, I've reproduced in on 3.6.11 + 4.16.38. So it seems to be confined to RHEL6 hosts. Still, I think we should understand what is going on to ensure it can't happen on newer versions. Attached engine debug and vdsm logs to the case. Here is what we are after: 2017-09-08 16:54:14,671 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-10) [1a4e9094] Running command: ProcessDownVmCommand internal: true. 2017-09-08 16:54:14,776 INFO [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (org.ovirt.thread.pool-6-thread-10) [4a8039c4] Running command: DetachUserFromVmFromPoolCommand internal: true. Entities affected : ID: 7967e300-7317-4457-986c-873eab585a2c Type: VmPool Still comes out of nowhere to me...