1489251 – [3.6 only] User is Detached from VmPool when engine restarts

Bug 1489251 - [3.6 only] User is Detached from VmPool when engine restarts

Summary: [3.6 only] User is Detached from VmPool when engine restarts

Keywords:
Status:	CLOSED DUPLICATE of bug 1490090
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.10
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-3.6.13
Target Release:	---
Assignee:	Tomas Jelinek
QA Contact:	Nisim Simsolo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-07 04:43 UTC by Germano Veit Michel
Modified:	2020-12-14 09:55 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: If a 3.6 engine had a 3.5 cluster with 3.5 VDSM on it, after restart the monitoring got confused and considered all the running VMs to be shut off during restart. Consequence: As a consequence, it tried to clear them up. One of the steps of this cleanup is to remove the user permissions from the pooled VMs. Hence, after engine restart, all the pooled VMs have lost their belonging to the user who have taken them. Fix: Fixed monitoring to properly find the VMs also on initialization Result: Now, after engine restart, the pooled VMs will still be assigned to the user who have taken them before engine restart.
Clone Of:
Environment:
Last Closed:	2017-11-13 13:59:43 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3182601	0	None	None	None	2017-09-13 15:24:37 UTC
oVirt gerrit	81617	0	None	None	None	2017-09-11 10:43:53 UTC

Description Germano Veit Michel 2017-09-07 04:43:40 UTC

Description of problem:

I can't reproduce this on 3.6.11 or on 4.1.5. Customer hit on 3.6.11.
Looks like ProcessDownVmCommand should not have been called. Race?

See the details:

1) User Grabs a VM

2017-09-06 16:56:07,822 INFO  [org.ovirt.engine.core.bll.AttachUserToVmFromPoolAndRunCommand] (ajp-/127.0.0.1:8702-3) [24c593e1] Succeeded giving user '82a64380-5b61-4c34-b5e9-a54471f28bc9' permission to Vm '7074cd6a-96cc-4956-a549-5e2323e7806d'

2017-09-06 16:56:08,037 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-3) [3b525f84] Running command: RunVmCommand internal: true. Entities affected :  ID: 7074cd6a-96cc-4956-a549-5e2323e7806d Type: VMAction group RUN_VM with role type USER

2017-09-06 16:56:08,039 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-3) [3b525f84] Creating stateless snapshot for VM 'XXXX-127' (7074cd6a-96cc-4956-a549-5e2323e7806d)

2017-09-06 16:56:15,131 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (org.ovirt.thread.pool-6-thread-33) [3b525f84] START, CreateVmVDSCommand( CreateVmVDSCommandParameters:{runAsync='true', hostId='6b33b03b-9a0f-4c3d-8281-10aeb6d0e2e4', vmId='7074cd6a-96cc-4956-a549-5e2323e7806d', vm='VM [XXXX-127]'}), log id: 1704041c

2017-09-06 16:57:16,556 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-32) [] VM '7074cd6a-96cc-4956-a549-5e2323e7806d'(XXXX-127) moved from 'PoweringUp' --> 'Up'

2) Restart the Manager:

2017-09-06 17:03:20,033 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 44) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.

3) Several of These right at the beginning, apparently one for each VM.

2017-09-06 17:03:37,006 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-9) [6be3ad2e] Running command: ProcessDownVmCommand internal: true.

4) Detaches User:

2017-09-06 17:03:37,048 INFO  [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (org.ovirt.thread.pool-6-thread-9) [7bb00284] Deleting snapshots for stateless vm '7074cd6a-96cc-4956-a549-5e2323e7806d'

2017-09-06 17:03:37,054 INFO  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (org.ovirt.thread.pool-6-thread-9) [47733928] Running command: RestoreStatelessVmCommand internal: true. Entities affected :  ID: 7074cd6a-96cc-4956-a549-5e2323e7806d Type: VM

2017-09-06 17:03:37,233 WARN  [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (org.ovirt.thread.pool-6-thread-9) [5067ca5d] CanDoAction of action 'RestoreAllSnapshots' failed for user SYSTEM. Reasons: VAR__ACTION__REVERT_TO,VAR__TYPE__SNAPSHOT,ACTION_TYPE_FAILED_VM_IS_NOT_DOWN

5) And yes, it shows the VM as up shortly after, status=Up.

2017-09-06 17:04:00,866 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-22) [] FINISH, FullListVDSCommand, return: [... vmId=7074cd6a-96cc-4956-a549-5e2323e7806d, status=Up... ], log id: 50922a21

Why it tried to restore?

Just to double check:

     vm_name     | console_disconnect_action 
-----------------+---------------------------
 XXXX-127        | LOCK_SCREEN


Version-Release number of selected component (if applicable):
rhevm-3.6.11.3-0.1.el6.noarch

How reproducible:
Unknown.

Comment 8 Michal Skrivanek 2017-09-08 06:05:56 UTC

the host is 3.5 (vdsm-4.16.38, right?). Then it doesn't use VM status events. There's not much useful info in logs, we would need a reproducer or a debug level logs to be able to investigate further

Comment 9 Germano Veit Michel 2017-09-08 06:14:33 UTC

Hi Michal,

(In reply to Michal Skrivanek from comment #8)
> the host is 3.5 (vdsm-4.16.38, right?). Then it doesn't use VM status
> events. 

Exactly.

> There's not much useful info in logs, we would need a reproducer or
> a debug level logs to be able to investigate further

Could you be more specific on what loggers you need on DEBUG level? Just org.ovirt?

Comment 10 Michal Skrivanek 2017-09-08 06:17:29 UTC

perhaps org.ovirt.engine.core.bll would be enough,also the vdsm side. Do we have  a reproducer? The fact that the host in 3.5 might be significant

Comment 11 Germano Veit Michel 2017-09-08 06:20:16 UTC

(In reply to Michal Skrivanek from comment #10)
> perhaps org.ovirt.engine.core.bll would be enough,also the vdsm side. Do we
> have  a reproducer? The fact that the host in 3.5 might be significant

Thanks.

Yes, I'll try it now and if it doesn't reproduce we will ask the customer.

Comment 14 Germano Veit Michel 2017-09-08 07:10:33 UTC

Michal,

I've reproduced in on 3.6.11 + 4.16.38. So it seems to be confined to RHEL6 hosts. Still, I think we should understand what is going on to ensure it can't happen on newer versions. Attached engine debug and vdsm logs to the case.

Here is what we are after:

2017-09-08 16:54:14,671 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-6-thread-10) [1a4e9094] Running command: ProcessDownVmCommand internal: true.

2017-09-08 16:54:14,776 INFO  [org.ovirt.engine.core.bll.DetachUserFromVmFromPoolCommand] (org.ovirt.thread.pool-6-thread-10) [4a8039c4] Running command: DetachUserFromVmFromPoolCommand internal: true. Entities affected :  ID: 7967e300-7317-4457-986c-873eab585a2c Type: VmPool

Still comes out of nowhere to me...

Note You need to log in before you can comment on or make changes to this bug.