Similar to BZ1561302, also on same version (4.1.7). Closed because we did not have enough data. Would be nice to confirm this is fixed in later versions, or get a fix if not.
Ryan, looks like a deadlock in scheduler when updating pending resources?
(In reply to Michal Skrivanek from comment #7) > Ryan, looks like a deadlock in scheduler when updating pending resources? Yes, but a very old version, and I don't think this is reproducible anymore. Still going through the sosreport. There's no TransactionReapers here, but I don't think comment#1 is right. This object was already locked. Immediately after the lines in comment#1: 2019-04-09 09:54:27,260+09 INFO [org.ovirt.engine.core.bll.RunVmCommand] (default task-37) [946813d6-bcd6-4ad9-926c-41ef11f6d4b4] Failed to Acquire Lock to object 'EngineLock:{exclusiveLocks='[7384419d-f121-44f3-8643-515ca48d751c=VM]', sharedLocks=''}' 2019-04-09 09:54:27,271+09 WARN [org.ovirt.engine.core.bll.RunVmCommand] (default task-37) [946813d6-bcd6-4ad9-926c-41ef11f6d4b4] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_OBJECT_LOCKED But, unless we're missing logs, the last time this was touched was over a year ago, and the lock was freed (shared). Last time there was an exclusive lock is 20180127 Ravi/Martin, it's still 4.1.7, but were some jboss issues resolves around locking in the meantime?
(In reply to Ryan Barry from comment #8) > There's no TransactionReapers here, but I don't think comment#1 is right. > This object was already locked. Immediately after the lines in comment#1: > > 2019-04-09 09:54:27,260+09 INFO [org.ovirt.engine.core.bll.RunVmCommand] > (default task-37) [946813d6-bcd6-4ad9-926c-41ef11f6d4b4] Failed to Acquire > Lock to object > 'EngineLock:{exclusiveLocks='[7384419d-f121-44f3-8643-515ca48d751c=VM]', > sharedLocks=''}' > 2019-04-09 09:54:27,271+09 WARN [org.ovirt.engine.core.bll.RunVmCommand] > (default task-37) [946813d6-bcd6-4ad9-926c-41ef11f6d4b4] Validation of > action 'RunVm' failed for user admin@internal-authz. Reasons: > VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_OBJECT_LOCKED Pardon my ignorance, just trying to learn a bit more. Isn't this failure above just because the lock for the VM object acquired 1 minute before here? 2019-04-09 09:53:59,786+09 INFO [org.ovirt.engine.core.bll.RunVmOnceCommand] (default task-61) [538c7c08-01bb-46fc-92a4-4bd13ffa33e6] Lock Acquired to object 'EngineLock:{exclusiveLocks='[7384419d-f121-44f3-8643-515ca48d751c=VM]', sharedLocks=''}' Like, initial attempt locked the VM and is now stuck on some other lock. New attempts to start fail as the VM is locked. No?
(In reply to Michal Skrivanek from comment #7) > Ryan, looks like a deadlock in scheduler when updating pending resources? Sorry, not a deadlock really, just stuck getVdsStats blockng update in VdsManager::updatePendingData during scheduling. I do not see any other problem, it just waits on results from host numerous times. We did solve a few stuck host monitoring cases on hosts in 4.2, but hard to pinpoint I guess
Germano, yeah, we just need a Vds lock when scheduling and also when updating periodic host stats. When the latter gets stuck (e.g. because vdsm never returns) the lock is held and further VM starts get stuck in scheduling phase
Thanks Michal! So the actual problem is here: "default task-61" #245 prio=5 os_prio=0 tid=0x00007f0ba80ea5d0 nid=0x6e3 waiting for monitor entry [0x00007f0b76dba000] java.lang.Thread.State: BLOCKED (on object monitor) at org.ovirt.engine.core.vdsbroker.VdsManager.updatePendingData(VdsManager.java:474) - waiting to lock <0x000000008a90d0a8> (a org.ovirt.engine.core.vdsbroker.VdsManager) at org.ovirt.engine.core.bll.scheduling.pending.PendingResourceManager.notifyHostManagers(PendingResourceManager.java:227) at org.ovirt.engine.core.bll.scheduling.SchedulingManager.schedule(SchedulingManager.java:340) at org.ovirt.engine.core.bll.RunVmCommand.getVdsToRunOn(RunVmCommand.java:838) at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:228) at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:402) at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:327) at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:120) at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251) Which is being held by this quartz: "DefaultQuartzScheduler8" #164 prio=5 os_prio=0 tid=0x00007f0bbc015c60 nid=0x661 waiting on condition [0x00007f0b6f717000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000008a8e4ad8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at org.ovirt.vdsm.jsonrpc.client.utils.LockWrapper.<init>(LockWrapper.java:16) at org.ovirt.vdsm.jsonrpc.client.internal.ResponseTracker.registerTrackingRequest(ResponseTracker.java:81) at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.retryCall(JsonRpcClient.java:101) at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:81) at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:70) at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getVdsStats(JsonRpcVdsServer.java:322) at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:20) at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111) at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73) at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:407) at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsStats(HostMonitoring.java:384) at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:111) at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:84) at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:273) - locked <0x000000008a90d0a8> (a org.ovirt.engine.core.vdsbroker.VdsManager) at sun.reflect.GeneratedMethodAccessor136.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:83) at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:55) at org.quartz.core.JobRunShell.run(JobRunShell.java:213) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Shouldn't a GetStatsVDSCommand timeout after 180s?
There were many improvements around threading and locking in HostMonitoring in 4.2, so I definitely recommend an upgrade. But in case, Ravi, could you please take a look?
no. there's no timeout unfortunately. We had a lot of these (specially with gluster recently) where it just gets stuck on host never returning back. If other calls work we don't even detect it since heartbeat still works and even storage pool monitoring could continue and so the host stays Up like nothing happened... Let's wait for Ravi's final confirmation, but we are confident enough it's solved in 4.3 (on the host side)
From the logs it looks like the threading improvements in 4.2 should resolve the issue described here. HostMonitoring no longer uses GetStats but executes the Async variant of the command GetStatsAsync which should not block. There are also lots of improvements around command execution in 4.3. IMO in 4.3 we should not see these errors.
Perfect, thanks guys. So I'm closing this as a dup of the BZ that introduced GetStatsAsync. If we see this again on 4.2+ I'll open a new BZ. *** This bug has been marked as a duplicate of bug 1430876 ***