1698310 – VMs did not start because the RunVmCommand was stuck.

Bug 1698310 - VMs did not start because the RunVmCommand was stuck.

Summary: VMs did not start because the RunVmCommand was stuck.

Keywords:
Status:	CLOSED DUPLICATE of bug 1430876
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.7
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-10 05:43 UTC by hyupark
Modified:	2020-08-03 15:43 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-11 23:10:12 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4056901	0	Troubleshoot	None	Cannot start Virtual Machines in RHV	2019-04-12 00:09:06 UTC

Comment 4 Germano Veit Michel 2019-04-10 06:10:24 UTC

Similar to BZ1561302, also on same version (4.1.7). Closed because we did not have enough data.
Would be nice to confirm this is fixed in later versions, or get a fix if not.

Comment 7 Michal Skrivanek 2019-04-11 04:14:10 UTC

Ryan, looks like a deadlock in scheduler when updating pending resources?

Comment 8 Ryan Barry 2019-04-11 04:21:42 UTC

(In reply to Michal Skrivanek from comment #7)
> Ryan, looks like a deadlock in scheduler when updating pending resources?

Yes, but a very old version, and I don't think this is reproducible anymore. Still going through the sosreport. 

There's no TransactionReapers here, but I don't think comment#1 is right. This object was already locked. Immediately after the lines in comment#1:

2019-04-09 09:54:27,260+09 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (default task-37) [946813d6-bcd6-4ad9-926c-41ef11f6d4b4] Failed to Acquire Lock to object 'EngineLock:{exclusiveLocks='[7384419d-f121-44f3-8643-515ca48d751c=VM]', sharedLocks=''}'
2019-04-09 09:54:27,271+09 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (default task-37) [946813d6-bcd6-4ad9-926c-41ef11f6d4b4] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_OBJECT_LOCKED

But, unless we're missing logs, the last time this was touched was over a year ago, and the lock was freed (shared). Last time there was an exclusive lock is 20180127

Ravi/Martin, it's still 4.1.7, but were some jboss issues resolves around locking in the meantime?

Comment 9 Germano Veit Michel 2019-04-11 04:34:45 UTC

(In reply to Ryan Barry from comment #8)
> There's no TransactionReapers here, but I don't think comment#1 is right.
> This object was already locked. Immediately after the lines in comment#1:
> 
> 2019-04-09 09:54:27,260+09 INFO  [org.ovirt.engine.core.bll.RunVmCommand]
> (default task-37) [946813d6-bcd6-4ad9-926c-41ef11f6d4b4] Failed to Acquire
> Lock to object
> 'EngineLock:{exclusiveLocks='[7384419d-f121-44f3-8643-515ca48d751c=VM]',
> sharedLocks=''}'
> 2019-04-09 09:54:27,271+09 WARN  [org.ovirt.engine.core.bll.RunVmCommand]
> (default task-37) [946813d6-bcd6-4ad9-926c-41ef11f6d4b4] Validation of
> action 'RunVm' failed for user admin@internal-authz. Reasons:
> VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_OBJECT_LOCKED

Pardon my ignorance, just trying to learn a bit more. Isn't this failure above just because the lock for the VM object acquired 1 minute before here?

2019-04-09 09:53:59,786+09 INFO  [org.ovirt.engine.core.bll.RunVmOnceCommand] (default task-61) [538c7c08-01bb-46fc-92a4-4bd13ffa33e6] Lock Acquired to object 'EngineLock:{exclusiveLocks='[7384419d-f121-44f3-8643-515ca48d751c=VM]', sharedLocks=''}'

Like, initial attempt locked the VM and is now stuck on some other lock. New attempts to start fail as the VM is locked. No?

Comment 10 Michal Skrivanek 2019-04-11 04:36:46 UTC

(In reply to Michal Skrivanek from comment #7)
> Ryan, looks like a deadlock in scheduler when updating pending resources?

Sorry, not a deadlock really, just stuck getVdsStats blockng update in VdsManager::updatePendingData during scheduling. 
I do not see any other problem, it just waits on results from host numerous times. We did solve a few stuck host monitoring cases on hosts in 4.2, but hard to pinpoint I guess

Comment 11 Michal Skrivanek 2019-04-11 04:40:04 UTC

Germano, yeah, we just need a Vds lock when scheduling and also when updating periodic host stats. When the latter gets stuck (e.g. because vdsm never returns) the lock is held and further VM starts get stuck in scheduling phase

Comment 12 Germano Veit Michel 2019-04-11 04:48:00 UTC

Thanks Michal! So the actual problem is here:

"default task-61" #245 prio=5 os_prio=0 tid=0x00007f0ba80ea5d0 nid=0x6e3 waiting for monitor entry [0x00007f0b76dba000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.ovirt.engine.core.vdsbroker.VdsManager.updatePendingData(VdsManager.java:474)
        - waiting to lock <0x000000008a90d0a8> (a org.ovirt.engine.core.vdsbroker.VdsManager)
        at org.ovirt.engine.core.bll.scheduling.pending.PendingResourceManager.notifyHostManagers(PendingResourceManager.java:227)
        at org.ovirt.engine.core.bll.scheduling.SchedulingManager.schedule(SchedulingManager.java:340)
        at org.ovirt.engine.core.bll.RunVmCommand.getVdsToRunOn(RunVmCommand.java:838)
        at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:228)
        at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:402)
        at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:327)
        at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:120)
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251)

Which is being held by this quartz:

"DefaultQuartzScheduler8" #164 prio=5 os_prio=0 tid=0x00007f0bbc015c60 nid=0x661 waiting on condition [0x00007f0b6f717000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000008a8e4ad8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at org.ovirt.vdsm.jsonrpc.client.utils.LockWrapper.<init>(LockWrapper.java:16)
        at org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker.registerTrackingRequest(ResponseTracker.java:81)
        at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.retryCall(JsonRpcClient.java:101)
        at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:81)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:70)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getVdsStats(JsonRpcVdsServer.java:322)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:20)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111)
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73)
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14)
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:407)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsStats(HostMonitoring.java:384)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:111)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:84)
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:273)
        - locked <0x000000008a90d0a8> (a org.ovirt.engine.core.vdsbroker.VdsManager)
        at sun.reflect.GeneratedMethodAccessor136.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:83)
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:55)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Shouldn't a GetStatsVDSCommand timeout after 180s?

Comment 13 Martin Perina 2019-04-11 13:45:48 UTC

There were many improvements around threading and locking in HostMonitoring in 4.2, so I definitely recommend an upgrade. But in case, Ravi, could you please take a look?

Comment 14 Michal Skrivanek 2019-04-11 14:38:09 UTC

no. there's no timeout unfortunately. We had a lot of these (specially with gluster recently) where it just gets stuck on host never returning back. If other calls work we don't even detect it since heartbeat still works and even storage pool monitoring could continue and so the host stays Up like nothing happened...
Let's wait for Ravi's final confirmation, but we are confident enough it's solved in 4.3 (on the host side)

Comment 16 Ravi Nori 2019-04-11 16:53:32 UTC

From the logs it looks like the threading improvements in 4.2 should resolve the issue described here. HostMonitoring no longer uses GetStats but executes the Async variant of the command GetStatsAsync which should not block. There are also lots of improvements around command execution in 4.3. IMO in 4.3 we should not see these errors.

Comment 17 Germano Veit Michel 2019-04-11 23:10:12 UTC

Perfect, thanks guys. 

So I'm closing this as a dup of the BZ that introduced GetStatsAsync. If we see this again on 4.2+ I'll open a new BZ.

*** This bug has been marked as a duplicate of bug 1430876 ***

Note You need to log in before you can comment on or make changes to this bug.