Description of problem: RunVmCommand get stuck roughly forever due to exclusive lock on the engine 2016-12-07 22:26:57,021 INFO [org.ovirt.engine.core.bll.RunVmCommand] (default task-177) [22daee98] Lock Acquired to object 'EngineLock:{exclusiveLocks='[2dab1cb1-708e-4430-996b-1ff957de071c=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' by the attached TD we can easily identify the lock which occupied by the ReactorClient.connect note* all of my hosts are bare metal and up & running, also i have few in maintenance. "org.ovirt.thread.pool-6-thread-42" #284 prio=5 os_prio=0 tid=0x00007f276402a800 nid=0x4c7a waiting for monitor entry [0x00007f289e7e9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.ovirt.engine.core.vdsbroker.VdsManager.updatePendingData(VdsManager.java:452) - waiting to lock <0x000000008d02c858> (a java.lang.Object) at org.ovirt.engine.core.bll.scheduling.pending.PendingResourceManager.notifyHostManagers(PendingResourceManager.java:236) at org.ovirt.engine.core.bll.scheduling.SchedulingManager.schedule(SchedulingManager.java:324) at org.ovirt.engine.core.bll.RunVmCommand.getVdsToRunOn(RunVmCommand.java:820) "DefaultQuartzScheduler9" #240 prio=5 os_prio=0 tid=0x00007f267404d000 nid=0x4c4e waiting on condition [0x00007f28a5919000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000fa5f5230> (a java.util.concurrent.FutureTask) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) at java.util.concurrent.FutureTask.get(FutureTask.java:191) at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.connect(ReactorClient.java:132) at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.getClient(JsonRpcClient.java:134) at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:81) at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:70) at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getVdsStats(JsonRpcVdsServer.java:272) at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:20) at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73) at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:451) at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsStats(HostMonitoring.java:390) at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:84) at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:238) - locked <0x000000008d02c858> (a java.lang.Object) Version-Release number of selected component (if applicable): 4.0.6-1 How reproducible: not-clear, since this env was running to me with 2.2K vms, due to a miantance on the storage I had to restart ovirt. Steps to Reproduce: 1. not clear, but currently engine restart. 2. 3. Actual results: vm's can't start. Expected results: vms command should be independent or having some time out due to the exclusive look. Additional info:
Ovirt assets profile: 1DC, 1 CL, 12 SD's 27 Hosts (active), VMs 2.2K
Isn't it a duplicate of bug 1401585? Seems the same to me.
*** This bug has been marked as a duplicate of bug 1401585 ***
(In reply to Roy Golan from comment #4) > > *** This bug has been marked as a duplicate of bug 1401585 *** It is the same root cause, but different use case, here we can't start any vm, which is critical. what about depends on # 1401585 ?
I think we can add a comment for the other bug to make sure to verify VM startup correctly. Can you?
(In reply to Oved Ourfali from comment #6) > I think we can add a comment for the other bug to make sure to verify VM > startup correctly. > > Can you? But if it's comment some people may missed it, also if someone else will encounter it wont be visible enough. please lets have it as is, (on a separate bug) which depends on 1401585 and once it's resolved we can easily resolve it as well. and again both bugs represents different use case what do you think?
I think that if you write a comment with: Please verify starting a VM, then it won't be missed. You can put a needinfo on the QE contact to make sure that isn't missed. A root cause can have tons of issues around it, and I think in this case automation is the tool to make sure things work properly.
(In reply to Oved Ourfali from comment #8) > I think that if you write a comment with: > Please verify starting a VM, then it won't be missed. You can put a needinfo > on the QE contact to make sure that isn't missed. > > A root cause can have tons of issues around it, and I think in this case > automation is the tool to make sure things work properly. It looks like different teams will need to verify it on different use cases. While it is a critical issue that I would prefer to monitor closely, would you mind we keep it open and ON_QA, just so we can track it better from the scale perspective?
(In reply to Gil Klein from comment #9) > (In reply to Oved Ourfali from comment #8) > > I think that if you write a comment with: > > Please verify starting a VM, then it won't be missed. You can put a needinfo > > on the QE contact to make sure that isn't missed. > > > > A root cause can have tons of issues around it, and I think in this case > > automation is the tool to make sure things work properly. > It looks like different teams will need to verify it on different use cases. > > While it is a critical issue that I would prefer to monitor closely, would > you mind we keep it open and ON_QA, just so we can track it better from the > scale perspective? I think it is redundant, but up to you.
When you found this issue which version of jsonrpc did you use? Please give steps to reproduce. How many hosts and vms are there?
(In reply to Piotr Kliczewski from comment #11) > When you found this issue which version of jsonrpc did you use? > > Please give steps to reproduce. How many hosts and vms are there? How reproducible: not-clear, since this env was running to me with 2.2K vms, due to a miantance on the storage I had to restart ovirt. Steps to Reproduce: 1. not clear, but currently engine restart. the deafualt rhevm build instllation 4.0.6-1 which is vdsm-jsonrpc-java-1.2.9-1.el7ev.noarch