Cause:
If a 3.6 engine had a 3.5 cluster with 3.5 VDSM on it, after restart the monitoring got confused and considered all the running VMs to be shut off during restart.
Consequence:
As a consequence, it tried to clear them up. One of the steps of this cleanup is to remove the user permissions from the pooled VMs. Hence, after engine restart, all the pooled VMs have lost their belonging to the user who have taken them.
Fix:
Fixed monitoring to properly find the VMs also on initialization
Result:
Now, after engine restart, the pooled VMs will still be assigned to the user who have taken them before engine restart.
DescriptionGermano Veit Michel
2017-09-07 04:43:40 UTC
Description of problem:
I can't reproduce this on 3.6.11 or on 4.1.5. Customer hit on 3.6.11.
Looks like ProcessDownVmCommand should not have been called. Race?
See the details:
1) User Grabs a VM
2017-09-06 16:56:07,822 INFO [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-3) [24c593e1] Succeeded giving user '82a64380-5b61-4c34-b5e9-a54471f28bc9' permission to Vm '7074cd6a-96cc-4956-a549-5e2323e7806d'
2017-09-06 16:56:08,037 INFO [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-3) [3b525f84] Running command: RunVmCommand internal: true. Entities affected : ID: 7074cd6a-96cc-4956-a549-5e2323e7806d Type: VMAction group RUN_VM with role type USER
2017-09-06 16:56:08,039 INFO [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-3) [3b525f84] Creating stateless snapshot for VM 'XXXX-127' (7074cd6a-96cc-4956-a549-5e2323e7806d)
2017-09-06 16:56:15,131 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (org.ovirt.thread.pool-6-thread-33) [3b525f84] START, CreateVmVDSCommand( CreateVmVDSCommandParameters:{runAsync='true', hostId='6b33b03b-9a0f-4c3d-8281-10aeb6d0e2e4', vmId='7074cd6a-96cc-4956-a549-5e2323e7806d', vm='VM [XXXX-127]'}), log id: 1704041c
2017-09-06 16:57:16,556 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-32) [] VM '7074cd6a-96cc-4956-a549-5e2323e7806d'(XXXX-127) moved from 'PoweringUp' --> 'Up'
2) Restart the Manager:
2017-09-06 17:03:20,033 INFO [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 44) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.
3) Several of These right at the beginning, apparently one for each VM.
2017-09-06 17:03:37,006 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-9) [6be3ad2e] Running command: ProcessDownVmCommand internal: true.
4) Detaches User:
2017-09-06 17:03:37,048 INFO [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (org.ovirt.thread.pool-6-thread-9) [7bb00284] Deleting snapshots for stateless vm '7074cd6a-96cc-4956-a549-5e2323e7806d'
2017-09-06 17:03:37,054 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (org.ovirt.thread.pool-6-thread-9) [47733928] Running command: RestoreStatelessVmCommand internal: true. Entities affected : ID: 7074cd6a-96cc-4956-a549-5e2323e7806d Type: VM
2017-09-06 17:03:37,233 WARN [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (org.ovirt.thread.pool-6-thread-9) [5067ca5d] CanDoAction of action 'RestoreAllSnapshots' failed for user SYSTEM. Reasons: VAR__ACTION__REVERT_TO,VAR__TYPE__SNAPSHOT,ACTION_TYPE_FAILED_VM_IS_NOT_DOWN
5) And yes, it shows the VM as up shortly after, status=Up.
2017-09-06 17:04:00,866 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-22) [] FINISH, FullListVDSCommand, return: [... vmId=7074cd6a-96cc-4956-a549-5e2323e7806d, status=Up... ], log id: 50922a21
Why it tried to restore?
Just to double check:
vm_name | console_disconnect_action
-----------------+---------------------------
XXXX-127 | LOCK_SCREEN
Version-Release number of selected component (if applicable):
rhevm-3.6.11.3-0.1.el6.noarch
How reproducible:
Unknown.
the host is 3.5 (vdsm-4.16.38, right?). Then it doesn't use VM status events. There's not much useful info in logs, we would need a reproducer or a debug level logs to be able to investigate further
Comment 9Germano Veit Michel
2017-09-08 06:14:33 UTC
Hi Michal,
(In reply to Michal Skrivanek from comment #8)
> the host is 3.5 (vdsm-4.16.38, right?). Then it doesn't use VM status
> events.
Exactly.
> There's not much useful info in logs, we would need a reproducer or
> a debug level logs to be able to investigate further
Could you be more specific on what loggers you need on DEBUG level? Just org.ovirt?
Comment 10Michal Skrivanek
2017-09-08 06:17:29 UTC
perhaps org.ovirt.engine.core.bll would be enough,also the vdsm side. Do we have a reproducer? The fact that the host in 3.5 might be significant
Comment 11Germano Veit Michel
2017-09-08 06:20:16 UTC
(In reply to Michal Skrivanek from comment #10)
> perhaps org.ovirt.engine.core.bll would be enough,also the vdsm side. Do we
> have a reproducer? The fact that the host in 3.5 might be significant
Thanks.
Yes, I'll try it now and if it doesn't reproduce we will ask the customer.
Comment 14Germano Veit Michel
2017-09-08 07:10:33 UTC
Michal,
I've reproduced in on 3.6.11 + 4.16.38. So it seems to be confined to RHEL6 hosts. Still, I think we should understand what is going on to ensure it can't happen on newer versions. Attached engine debug and vdsm logs to the case.
Here is what we are after:
2017-09-08 16:54:14,671 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-10) [1a4e9094] Running command: ProcessDownVmCommand internal: true.
2017-09-08 16:54:14,776 INFO [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (org.ovirt.thread.pool-6-thread-10) [4a8039c4] Running command: DetachUserFromVmFromPoolCommand internal: true. Entities affected : ID: 7967e300-7317-4457-986c-873eab585a2c Type: VmPool
Still comes out of nowhere to me...
Description of problem: I can't reproduce this on 3.6.11 or on 4.1.5. Customer hit on 3.6.11. Looks like ProcessDownVmCommand should not have been called. Race? See the details: 1) User Grabs a VM 2017-09-06 16:56:07,822 INFO [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-3) [24c593e1] Succeeded giving user '82a64380-5b61-4c34-b5e9-a54471f28bc9' permission to Vm '7074cd6a-96cc-4956-a549-5e2323e7806d' 2017-09-06 16:56:08,037 INFO [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-3) [3b525f84] Running command: RunVmCommand internal: true. Entities affected : ID: 7074cd6a-96cc-4956-a549-5e2323e7806d Type: VMAction group RUN_VM with role type USER 2017-09-06 16:56:08,039 INFO [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-3) [3b525f84] Creating stateless snapshot for VM 'XXXX-127' (7074cd6a-96cc-4956-a549-5e2323e7806d) 2017-09-06 16:56:15,131 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (org.ovirt.thread.pool-6-thread-33) [3b525f84] START, CreateVmVDSCommand( CreateVmVDSCommandParameters:{runAsync='true', hostId='6b33b03b-9a0f-4c3d-8281-10aeb6d0e2e4', vmId='7074cd6a-96cc-4956-a549-5e2323e7806d', vm='VM [XXXX-127]'}), log id: 1704041c 2017-09-06 16:57:16,556 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-32) [] VM '7074cd6a-96cc-4956-a549-5e2323e7806d'(XXXX-127) moved from 'PoweringUp' --> 'Up' 2) Restart the Manager: 2017-09-06 17:03:20,033 INFO [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 44) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'. 3) Several of These right at the beginning, apparently one for each VM. 2017-09-06 17:03:37,006 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-9) [6be3ad2e] Running command: ProcessDownVmCommand internal: true. 4) Detaches User: 2017-09-06 17:03:37,048 INFO [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (org.ovirt.thread.pool-6-thread-9) [7bb00284] Deleting snapshots for stateless vm '7074cd6a-96cc-4956-a549-5e2323e7806d' 2017-09-06 17:03:37,054 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (org.ovirt.thread.pool-6-thread-9) [47733928] Running command: RestoreStatelessVmCommand internal: true. Entities affected : ID: 7074cd6a-96cc-4956-a549-5e2323e7806d Type: VM 2017-09-06 17:03:37,233 WARN [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (org.ovirt.thread.pool-6-thread-9) [5067ca5d] CanDoAction of action 'RestoreAllSnapshots' failed for user SYSTEM. Reasons: VAR__ACTION__REVERT_TO,VAR__TYPE__SNAPSHOT,ACTION_TYPE_FAILED_VM_IS_NOT_DOWN 5) And yes, it shows the VM as up shortly after, status=Up. 2017-09-06 17:04:00,866 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-22) [] FINISH, FullListVDSCommand, return: [... vmId=7074cd6a-96cc-4956-a549-5e2323e7806d, status=Up... ], log id: 50922a21 Why it tried to restore? Just to double check: vm_name | console_disconnect_action -----------------+--------------------------- XXXX-127 | LOCK_SCREEN Version-Release number of selected component (if applicable): rhevm-3.6.11.3-0.1.el6.noarch How reproducible: Unknown.