Bug 1597062 - setupNetworks holding vds lock for too long
Summary: setupNetworks holding vds lock for too long
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.1.10
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.3.0
: 4.3.0
Assignee: Alona Kaplan
QA Contact: Lukas Svaty
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-02 00:46 UTC by Germano Veit Michel
Modified: 2021-09-09 14:52 UTC (History)
13 users (show)

Fixed In Version: ovirt-engine-4.3.0_alpha
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-08 12:37:51 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43544 0 None None None 2021-09-09 14:52:17 UTC
Red Hat Product Errata RHEA-2019:1085 0 None None None 2019-05-08 12:38:14 UTC
oVirt gerrit 93139 0 master MERGED engine: infinite loop of setupNetworksTask.isDone() 2020-12-17 17:40:41 UTC

Description Germano Veit Michel 2018-07-02 00:46:11 UTC
Description of problem:

A host does not seem to activate, then engine is waiting to grab a lock for several hours.

2018-06-28 17:13:59,804+02 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (default task-1) [296f9531-f270-40ca-84cd-8a70d95c346c] Lock Acquired to object 'EngineLock:{exclusiveLocks='[d0a78da6-dcf4-450f-96d4-e6ddf7f25ca6=VDS]', sharedLocks=''}'

2018-06-28 17:13:59,852+02 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (org.ovirt.thread.pool-6-thread-11) [296f9531-f270-40ca-84cd-8a70d95c346c] Running command: ActivateVdsCommand internal: false. Entities affected :  ID: d0a78da6-dcf4-450f-96d4-e6ddf7f25ca6 Type: VDSAction group MANIPULATE_HOST with role type ADMIN

2018-06-28 17:13:59,852+02 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (org.ovirt.thread.pool-6-thread-11) [296f9531-f270-40ca-84cd-8a70d95c346c] Before acquiring lock in order to prevent monitoring for host 'host.example.com' from data-center 'datacenter'

2018-06-28 17:13:59,852+02 INFO  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-6-thread-11) [296f9531-f270-40ca-84cd-8a70d95c346c] Failed to acquire lock and wait lock 'HostEngineLock:{exclusiveLocks='[d0a78da6-dcf4-450f-96d4-e6ddf7f25ca6=VDS_INIT]', sharedLocks=''}'

[.... 15 hours later... ] 

2018-06-29 07:54:53,056+02 INFO  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-6-thread-11) [296f9531-f270-40ca-84cd-8a70d95c346c] Failed to acquire lock and wait lock 'HostEngineLock:{exclusiveLocks='[d0a78da6-dcf4-450f-96d4-e6ddf7f25ca6=VDS_INIT]', sharedLocks=''}' 

Here is the thread, still waiting:

"org.ovirt.thread.pool-6-thread-11" #243 prio=5 os_prio=0 tid=0x00007f1bfc365ef0 nid=0xcdb4 waiting on condition [0x00007f1b75ea9000]                                                          
   java.lang.Thread.State: WAITING (parking)                                                                                                                                                   
        at sun.misc.Unsafe.park(Native Method)                                                                                                                                                 
        - parking to wait for  <0x00000006ce548628> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)                                                                  
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)                                                                                                                   
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)                                                                   
        at org.ovirt.engine.core.bll.lock.InMemoryLockManager.acquireLockWait(InMemoryLockManager.java:98)                                                                                     
        [....]                                                                       
        at org.ovirt.engine.core.bll.lock.LockManager$739599139$Proxy$_$$_Weld$EnterpriseProxy$.acquireLockWait(Unknown Source)                                                                
        at org.ovirt.engine.core.bll.HostLocking.acquireMonitorLock(HostLocking.java:34)                                                                                                       
        at org.ovirt.engine.core.bll.VdsCommand.acquireMonitorLock(VdsCommand.java:267)                                                                                                        
        at org.ovirt.engine.core.bll.ActivateVdsCommand.executeCommand(ActivateVdsCommand.java:64)                                                                                             
        [....]                                                 

But I cannot find what is holding 0x00000006ce548628.

Version-Release number of selected component (if applicable):
rhevm-4.1.10.3-0.1.el7.noarch

How reproducible:
Unknown

Comment 3 Germano Veit Michel 2018-07-02 00:57:22 UTC
The 'Not Responding' messages are due to this: https://bugzilla.redhat.com/show_bug.cgi?id=1574580

Comment 4 Moti Asayag 2018-07-03 11:29:00 UTC
Hi Germano,

Some background about engine's in-memory locking:
The engine's locks are managed in-memory. Each lock has the locked entity (i.e. host by its ID) and the locking group (i.e. VDS, VDS_INIT and more).

In order to watch the existing locks you should use the JMX as explained on
https://gerrit.ovirt.org/#/c/72316/ (attached jconsole screenshot).

The 'showall' should list the existing locks. However, that won't be enough, since we'd like to know which flow/command took the lock at first place and haven't released it.

Note that engine's locks are expected to be released when a command is completed or by restarting the engine or by using the JMX API (releaseLock for single lock or clear for all of the locks).

regarding:

> But I cannot find what is holding 0x00000006ce548628.

So I reviewed the log and followed the host (lxf207s001, d0a78da6-dcf4-450f-96d4-e6ddf7f25ca6).

A setup network command was triggered for that host: 

2018-06-28 17:00:09,028+02 INFO  [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (default task-101) [7ca676c6-2e90-4790-b1a0-90a355520ac8] Lock Acquir
ed to object 'EngineLock:{exclusiveLocks='[HOST_NETWORKd0a78da6-dcf4-450f-96d4-e6ddf7f25ca6=HOST_NETWORK]', sharedLocks=''}'

First, the lock for (host, HOST_NETWORK) is taken HOST_NETWORKd0a78da6-dcf4-450f-96d4-e6ddf7f25ca6=HOST_NETWORK

2018-06-28 17:00:09,317+02 INFO  [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (default task-101) [7ca676c6-2e90-4790-b1a0-90a355520ac8] Before acquiring lock in order to prevent monitoring for host 'lxf207s001' from data-center 'lxf207f208d001'
2018-06-28 17:00:09,317+02 INFO  [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (default task-101) [7ca676c6-2e90-4790-b1a0-90a355520ac8] Lock acquired, from now a monitoring of host will be skipped for host 'lxf207s001' from data-center 'lxf207f208d001'

Next the lock for (host, VDS_INIT) is taken. The exact one being used by the ActivateVdsCommand. 

In the brackets you can see (default task-101) which is the thread name. We'll use it to find the exact thread from the thread-dump you've attached to this bug, but first, following this command, it seems that neither of the monitoring lock nor the setup networks command's lock were released.

The HostSetupNetworksCommand command started on 2018-06-28 17:00:09, and by the end of the log file 2018-06-29 03:33 it wasn't released. Therefore the required locks to enable host monitoring is occupied by that command and leads to the repeated messages of:

2018-06-29 03:33:01,467+02 INFO  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-6-thread-11) [296f9531-f270-40ca-84cd-8a70d95c346c] Failed to acquire lock and wait lock 'HostEngineLock:{exclusiveLocks='[d0a78da6-dcf4-450f-96d4-e6ddf7f25ca6=VDS_INIT]', sharedLocks=''}'

Looking at the thread from the thread-dump shows that the command is stuck on:

"default task-101" #387 prio=5 os_prio=0 tid=0x00007f1b8800eea0 nid=0xce64 waiting on condition [0x00007f1b6cf95000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.waitBetweenPolls(HostSetupNetworkPoller.java:74)
	at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.poll(HostSetupNetworkPoller.java:32)
	at org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand.invokeSetupNetworksCommand(HostSetupNetworksCommand.java:429)
	at org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand.executeCommand(HostSetupNetworksCommand.java:290)
	at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1255)
	at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1395)
	at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2059)
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164)
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103)
	at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1455)
	at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:397)
	at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13)
	at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:516)
	at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:498)
	at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:451

The exact piece of code is here:
https://github.com/oVirt/ovirt-engine/blob/ovirt-4.1.4.z/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/network/host/HostSetupNetworkPoller.java#L71

Since the last entry log for that thread in the engine.log is:
2018-06-28 17:16:46,774+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (default task-101) [7ca676c6-2e90-4790-b1a0-90a355520ac8] Timeout waiting fo
r VDSM response: Internal timeout occured

It means that while loop never quits (else we should have got additional poll failures):

        while (timestampOfEndOfPreviousInvocation != null
                && currentTimestamp() - timestampOfEndOfPreviousInvocation < POLLING_BREAK_IN_NANOS) {
            try {
                Thread.sleep(POLLING_BREAK);
            } catch (InterruptedException e) {
                // ignore.
            }
        }

The only reason this while loop runs forever is if currentTimestamp() that calls System.nanoTime() is smaller than the previous time.

According to various blogs, it is possible that System.nanoTime() might misbehave on multi-cpu env, and led to the while loop run for a long time.

Germano, could you describe the setup on this ovirt-engine java process is running on ? 
Also, is it possible to grab another thread-dump from the same env and also the latest version of the engine.log ?
Will it be possible to reproduce this issue with TRACE debug level ?

Regardless all of the above, the HostSetupNetworkCommand shouldn't last several hours and this should be fixed.

Comment 5 Germano Veit Michel 2018-07-04 23:38:10 UTC
Hi Moti, 

Thanks a lot for the explanation ;)

(In reply to Moti Asayag from comment #4)
> Germano, could you describe the setup on this ovirt-engine java process is
> running on ?
It is a VMware VM with 12 CPUs. TSC is the clocksource and the CPUs support constant tsc.

> Also, is it possible to grab another thread-dump from the same env and also
> the latest version of the engine.log ?
I will request it. Not sure if the problem is still there.

> Will it be possible to reproduce this issue with TRACE debug level ?
I dont think this is easily reproducible. They did the same SetupNetworks operation on dozens of hosts already, but this lockup only happened on this one. Also this is a very busy production environment where scale issues have already been spotted, there are several DCs with several dozens of hosts. I don't think we should request debug logs unless it can be reproduced on a test environment.

Would vdsm logs from the host help? Maybe for a particular host we can enable debug.

> Regardless all of the above, the HostSetupNetworkCommand shouldn't last
> several hours and this should be fixed.
Makes sense.

Keeping the needinfo on me to gather new engine.log/console.log.

Comment 6 Moti Asayag 2018-07-05 07:57:41 UTC
(In reply to Germano Veit Michel from comment #5)
> Hi Moti, 
> 
> Thanks a lot for the explanation ;)
> 
> (In reply to Moti Asayag from comment #4)
> > Germano, could you describe the setup on this ovirt-engine java process is
> > running on ?
> It is a VMware VM with 12 CPUs. TSC is the clocksource and the CPUs support
> constant tsc.
> 
> > Also, is it possible to grab another thread-dump from the same env and also
> > the latest version of the engine.log ?
> I will request it. Not sure if the problem is still there.
> 
> > Will it be possible to reproduce this issue with TRACE debug level ?
> I dont think this is easily reproducible. They did the same SetupNetworks
> operation on dozens of hosts already, but this lockup only happened on this
> one. Also this is a very busy production environment where scale issues have
> already been spotted, there are several DCs with several dozens of hosts. I
> don't think we should request debug logs unless it can be reproduced on a
> test environment.
> 
> Would vdsm logs from the host help? Maybe for a particular host we can
> enable debug.

Yes, vdsm.log and supervdsm.log of lxf207s001 can shed light on the failure of setup networks.

> 
> > Regardless all of the above, the HostSetupNetworkCommand shouldn't last
> > several hours and this should be fixed.
> Makes sense.
> 
> Keeping the needinfo on me to gather new engine.log/console.log.

Comment 7 Germano Veit Michel 2018-07-06 01:36:19 UTC
Hi Moti,

I've got a new LogCollector from the engine side.

The problem went away here:

1. Last message:

2018-06-29 09:07:13,866+02 INFO  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-6-thread-11) [296f9531-f270-40ca-84cd-8a70d95c346c] Failed to acquire lock and wait lock 'HostEngineLock:{exclusiveLocks='[d0a78da6-dcf4-450f-96d4-e6ddf7f25ca6=VDS_INIT]', sharedLocks=''}'  

2. SetupNetworks finished  
                                  
2018-06-29 09:07:13,901+02 INFO  [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (default task-101) [7ca676c6-2e90-4790-b1a0-90a355520ac8] Host setup networks finished. Lock released. Monitoring can run now for host 'lxf207s001' from data-center 'lxf207f208d001'  

3. But it seems it just finished because the engine was restarted:

2018-06-29 09:07:13,901+02 ERROR [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (default task-101) [7ca676c6-2e90-4790-b1a0-90a355520ac8] Command 'org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand' failed: WFLYEJB0421: Invocation cannot proceed as component is shutting down  

So the console.log I got with the new LogCollector is useless, as the engine was restarted and the problem is not there anymore.

I'll try to get some logs from the host.

Comment 9 Germano Veit Michel 2018-07-13 00:08:35 UTC
Moti, we have the host logs.

Looks like before the setupNetworks that hung, we had a previous setupNetworks that failed the connectivity check. It does not look like engine and vdsm were able to talk at all when the subsequent setupNetworks hung.

This is the previous setupNetworks:

MainProcess|jsonrpc/5::ERROR::2018-06-28 16:05:21,499::supervdsmServer::94::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks                                                         
Traceback (most recent call last):                                                                                                                                                             
  File "/usr/share/vdsm/supervdsmServer", line 92, in wrapper                                                                                                                                  
    res = func(*args, **kwargs)                                                                                                                                                                
  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 218, in setupNetworks                                                                                                      
    _setup_networks(networks, bondings, options)                                                                                                                                               
  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 239, in _setup_networks                                                                                                    
    netswitch.setup(networks, bondings, options, in_rollback)                                                                                                                                  
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch.py", line 132, in setup                                                                                                        
    _setup_legacy(legacy_nets, legacy_bonds, options, in_rollback)                                                                                                                             
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch.py", line 155, in _setup_legacy                                                                                                
    connectivity.check(options)                                                                                                                                                                
  File "/usr/lib/python2.7/site-packages/vdsm/network/connectivity.py", line 46, in check                                                                                                      
    'connectivity check failed')                                                                                                                                                               
ConfigNetworkError: (10, 'connectivity check failed') 

And then we had the one that hung you identified in comment #4. It does not look like that one ever reached the host.

Comment 12 Moti Asayag 2018-07-15 08:12:40 UTC
Thanks for the additional info, I'm moving this bug to the Network team to examine the failure.

Comment 14 Lukas Svaty 2018-07-24 08:12:12 UTC
Hi, can you provide reproduction steps?

Comment 15 Dan Kenigsberg 2018-07-24 19:32:36 UTC
As I understand it, the bug shows up very rarely, if a host looses connectivity while setupNetwork initiates. The fix is rather delicate, and I prefer not to backport it to 4.2.6 unless explicitly requested and justified by the customer.

Comment 16 Alona Kaplan 2018-08-06 16:18:22 UTC
(In reply to Lukas Svaty from comment #14)
> Hi, can you provide reproduction steps?

I reproduced it using a breakpoint.
You can try stop the vdsm and then immediately send a setup networks.
But as Dan said, the bug shows up very rarely and it would be hard to reproduce it.

Comment 17 Lukas Svaty 2018-12-10 13:48:01 UTC
No sufficient verification steps, moving this to verify, please reopen if you will be able to reproduce

Comment 19 errata-xmlrpc 2019-05-08 12:37:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 22 Daniel Gur 2019-08-28 13:14:26 UTC
sync2jira

Comment 23 Daniel Gur 2019-08-28 13:19:28 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.