Bug 1402597 - [scale] - vm failed to start due to blocking monitor lock
Summary: [scale] - vm failed to start due to blocking monitor lock
Keywords:
Status: CLOSED DUPLICATE of bug 1401585
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.0.6.1
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Roy Golan
QA Contact: Eldad Marciano
URL:
Whiteboard:
Depends On: 1364791
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-07 22:51 UTC by Eldad Marciano
Modified: 2019-04-28 08:37 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-12-08 07:45:12 UTC
oVirt Team: Infra
Embargoed:


Attachments (Terms of Use)

Description Eldad Marciano 2016-12-07 22:51:23 UTC
Description of problem:
RunVmCommand get stuck roughly forever due to exclusive lock on the engine
2016-12-07 22:26:57,021 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (default task-177) [22daee98] Lock Acquired to object 'EngineLock:{exclusiveLocks='[2dab1cb1-708e-4430-996b-1ff957de071c=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'

by the attached TD we can easily identify the lock which occupied by the ReactorClient.connect

note* all of my hosts are bare metal and up & running, also i have few in maintenance.


"org.ovirt.thread.pool-6-thread-42" #284 prio=5 os_prio=0 tid=0x00007f276402a800 nid=0x4c7a waiting for monitor entry [0x00007f289e7e9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.ovirt.engine.core.vdsbroker.VdsManager.updatePendingData(VdsManager.java:452)
        - waiting to lock <0x000000008d02c858> (a java.lang.Object)
        at org.ovirt.engine.core.bll.scheduling.pending.PendingResourceManager.notifyHostManagers(PendingResourceManager.java:236)
        at org.ovirt.engine.core.bll.scheduling.SchedulingManager.schedule(SchedulingManager.java:324)
        at org.ovirt.engine.core.bll.RunVmCommand.getVdsToRunOn(RunVmCommand.java:820)




"DefaultQuartzScheduler9" #240 prio=5 os_prio=0 tid=0x00007f267404d000 nid=0x4c4e waiting on condition [0x00007f28a5919000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000fa5f5230> (a java.util.concurrent.FutureTask)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
        at java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.connect(ReactorClient.java:132)
        at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.getClient(JsonRpcClient.java:134)
        at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:81)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:70)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getVdsStats(JsonRpcVdsServer.java:272)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:20)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110)
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73)
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:451)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsStats(HostMonitoring.java:390)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:84)
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:238)
        - locked <0x000000008d02c858> (a java.lang.Object)



Version-Release number of selected component (if applicable):
4.0.6-1

How reproducible:
not-clear, since this env was running to me with 2.2K vms, due to a miantance on the storage I had to restart ovirt.

Steps to Reproduce:
1. not clear, but currently engine restart.
2.
3.

Actual results:
vm's can't start.

Expected results:
vms command should be independent or having some time out due to the exclusive look.

Additional info:

Comment 2 Eldad Marciano 2016-12-07 23:00:32 UTC
Ovirt assets profile:
1DC, 1 CL, 12 SD's 27 Hosts (active), VMs 2.2K

Comment 3 Oved Ourfali 2016-12-08 05:50:06 UTC
Isn't it a duplicate of bug 1401585?
Seems the same to me.

Comment 4 Roy Golan 2016-12-08 07:45:12 UTC

*** This bug has been marked as a duplicate of bug 1401585 ***

Comment 5 Eldad Marciano 2016-12-08 08:07:35 UTC
(In reply to Roy Golan from comment #4)
> 
> *** This bug has been marked as a duplicate of bug 1401585 ***

It is the same root cause, but different use case, here we can't start any vm, which is critical.

what about depends on # 1401585 ?

Comment 6 Oved Ourfali 2016-12-08 08:10:57 UTC
I think we can add a comment for the other bug to make sure to verify VM startup correctly.

Can you?

Comment 7 Eldad Marciano 2016-12-08 08:19:32 UTC
(In reply to Oved Ourfali from comment #6)
> I think we can add a comment for the other bug to make sure to verify VM
> startup correctly.
> 
> Can you?

But if it's comment some people may missed it, also if someone else will encounter it wont be visible enough.
please lets have it as is, (on a separate bug) which depends on 1401585 and once it's resolved we can easily resolve it as well.
and again both bugs represents different use case
what do you think?

Comment 8 Oved Ourfali 2016-12-08 08:22:21 UTC
I think that if you write a comment with:
Please verify starting a VM, then it won't be missed. You can put a needinfo on the QE contact to make sure that isn't missed.

A root cause can have tons of issues around it, and I think in this case automation is the tool to make sure things work properly.

Comment 9 Gil Klein 2016-12-08 08:37:48 UTC
(In reply to Oved Ourfali from comment #8)
> I think that if you write a comment with:
> Please verify starting a VM, then it won't be missed. You can put a needinfo
> on the QE contact to make sure that isn't missed.
> 
> A root cause can have tons of issues around it, and I think in this case
> automation is the tool to make sure things work properly.
It looks like different teams will need to verify it on different use cases.

While it is a critical issue that I would prefer to monitor closely, would you mind we keep it open and ON_QA, just so we can track it better from the scale perspective?

Comment 10 Oved Ourfali 2016-12-08 08:41:22 UTC
(In reply to Gil Klein from comment #9)
> (In reply to Oved Ourfali from comment #8)
> > I think that if you write a comment with:
> > Please verify starting a VM, then it won't be missed. You can put a needinfo
> > on the QE contact to make sure that isn't missed.
> > 
> > A root cause can have tons of issues around it, and I think in this case
> > automation is the tool to make sure things work properly.
> It looks like different teams will need to verify it on different use cases.
> 
> While it is a critical issue that I would prefer to monitor closely, would
> you mind we keep it open and ON_QA, just so we can track it better from the
> scale perspective?

I think it is redundant, but up to you.

Comment 11 Piotr Kliczewski 2016-12-08 09:15:55 UTC
When you found this issue which version of jsonrpc did you use?

Please give steps to reproduce. How many hosts and vms are there?

Comment 12 Eldad Marciano 2016-12-08 09:31:33 UTC
(In reply to Piotr Kliczewski from comment #11)
> When you found this issue which version of jsonrpc did you use?
> 
> Please give steps to reproduce. How many hosts and vms are there?

How reproducible:
not-clear, since this env was running to me with 2.2K vms, due to a miantance on the storage I had to restart ovirt.

Steps to Reproduce:
1. not clear, but currently engine restart.

the deafualt rhevm build instllation 4.0.6-1 which is vdsm-jsonrpc-java-1.2.9-1.el7ev.noarch


Note You need to log in before you can comment on or make changes to this bug.