Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1641836

Summary: After network outage, 2 hosts hit vdsTimeout for commands until engine restarted.
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: vdsm-jsonrpc-javaAssignee: Ravi Nori <rnori>
Status: CLOSED DUPLICATE QA Contact: Lukas Svaty <lsvaty>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.6CC: gveitmic, lsurette, mperina, Rhev-m-bugs, rnori
Target Milestone: ---Flags: lsvaty: testing_plan_complete-
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-01 00:28:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2018-10-23 00:26:38 UTC
Description of problem:

1. Network outage around 09:22am
2. HE HA VM fails over, restarted at 09:28
3. Network is restored shortly after, but out of the 5 hosts, 2 remained Not Responding until engine was restarted again. These are blade01 and blade05.

These 2 hosts were constantly failing with vdsTimeout, but their VDSM logs are clean. The commands the engine is complaining about (vdsTimeout) never reached those hosts. Upon engine restart they went Up.

2018-10-19 10:12:07,604+13 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-12) [] Command 'GetAllVmStatsVDSCommand(HostName = blade05, VdsIdVDSCommandParametersBase:{hostId='56847cc8-d0a6-4f0d-842e-d979d05b8fe3'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

2018-10-19 10:12:07,609+13 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] Command 'GetAllVmStatsVDSCommand(HostName = blade01, VdsIdVDSCommandParametersBase:{hostId='9ca5b2e4-d554-4989-9f4d-cb1d22514aaf'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

And we have 2 threads apparently waiting on some jsonrpc thread

"EE-ManagedThreadFactory-engineScheduled-Thread-59" #212 daemon prio=5 os_prio=0 tid=0x0000000006c60800 nid=0x7b1 waiting on condition [0x00007f1db031a000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000ff3163a8> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
        at org.ovirt.vdsm.jsonrpc.client.internal.BatchCall.get(BatchCall.java:60)
        at org.ovirt.vdsm.jsonrpc.client.internal.Call.get(Call.java:53)
        at org.ovirt.vdsm.jsonrpc.client.internal.Call.get(Call.java:20)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.lazyEval(FutureMap.java:112)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.get(FutureMap.java:247)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VMInfoListReturn.<init>(VMInfoListReturn.java:15)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getAllVmStats(JsonRpcVdsServer.java:389)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand.executeVdsBrokerCommand(GetAllVmStatsVDSCommand.java:22)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVdsCommandWithNetworkEvent(VdsBrokerCommand.java:123)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111)
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73)
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)

"EE-ManagedThreadFactory-engineScheduled-Thread-65" #219 daemon prio=5 os_prio=0 tid=0x0000000006e2f000 nid=0x7b9 waiting on condition [0x00007f1db2d42000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000ff3162e8> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
        at org.ovirt.vdsm.jsonrpc.client.internal.BatchCall.get(BatchCall.java:60)
        at org.ovirt.vdsm.jsonrpc.client.internal.Call.get(Call.java:53)
        at org.ovirt.vdsm.jsonrpc.client.internal.Call.get(Call.java:20)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.lazyEval(FutureMap.java:112)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.get(FutureMap.java:247)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VMInfoListReturn.<init>(VMInfoListReturn.java:15)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getAllVmStats(JsonRpcVdsServer.java:389)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand.executeVdsBrokerCommand(GetAllVmStatsVDSCommand.java:22)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVdsCommandWithNetworkEvent(VdsBrokerCommand.java:123)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111)
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73)
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)


"SSL Stomp Reactor" #126 daemon prio=5 os_prio=0 tid=0x00000000065bd000 nid=0x74f runnable [0x00007f1db5992000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00000000918e9c28> (a sun.nio.ch.Util$3)
        - locked <0x00000000918e9c18> (a java.util.Collections$UnmodifiableSet)
        - locked <0x0000000091793b80> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.select(Reactor.java:47)
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:59)

Version-Release number of selected component (if applicable):
vdsm-jsonrpc-java-1.4.14-1.el7ev.noarch
ovirt-engine-4.2.6.4-0.1.el7ev.noarch

How reproducible:
Unknown

Comment 4 Piotr Kliczewski 2018-10-23 09:01:35 UTC
Based on the logs I see that the engine was down for some time:

2018-10-19 09:24:52,003+13 ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-65) [] Timer update runtime info failed. Exception:
2018-10-19 09:28:03,478+13 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 42) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.

When the engine started there were attempts to reconnect to the hosts and we see still networking issues at:

2018-10-19 09:28:31,757+13 ERROR [org.ovirt.engine.core.vdsbroker.HostDevListByCapsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-7) [b5ae4d2] Command 'HostDevListByCapsVDSCommand(HostName = blade03, VdsIdAndVdsVDSCommandParametersBase:{hostId='96817980-8f48-4f19-9ab1-ee1656b513ec', vds='Host[blade03,96817980-8f48-4f19-9ab1-ee1656b513ec]'})' execution failed: java.net.UnknownHostException: wlg-bladechassis1-blade03: Name or service not known

The last ERROR message from the log is:

2018-10-19 10:12:07,609+13 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] Command 'GetAllVmStatsVDSCommand(HostName = blade01, VdsIdVDSCommandParametersBase:{hostId='9ca5b2e4-d554-4989-9f4d-cb1d22514aaf'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

It is important to know that there were no communication issues in the log prior to restart. There was only:

2018-10-19 08:56:10,365+13 ERROR [org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] (default task-27) [f33c20be-78ee-4b29-aba8-547a717e17ce] Command 'org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand' failed: CallableStatementCallback; SQL [{call insertdiskvmelement(?, ?, ?, ?, ?, ?)}]; ERROR: duplicate key value violates unique constraint "pk_disk_vm_element"

which could be handled as separate BZ.

Germano please provide full engine log. I would like to see at what point the hosts recovered and please provide failing hosts logs as well.

Comment 5 Germano Veit Michel 2018-10-24 00:46:48 UTC
Hi Piotr,

Let me try to clarify it.

(In reply to Piotr Kliczewski from comment #4)
> Based on the logs I see that the engine was down for some time:
> 
> 2018-10-19 09:24:52,003+13 ERROR
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (EE-ManagedThreadFactory-engineScheduled-Thread-65) [] Timer update runtime
> info failed. Exception:
> 2018-10-19 09:28:03,478+13 INFO 
> [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread
> Pool -- 42) [] Loaded file
> '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.

Correct, this is the HostedEngine failover at 09:28, item 2 on comment #0.

> 
> When the engine started there were attempts to reconnect to the hosts and we
> see still networking issues at:
> 
> 2018-10-19 09:28:31,757+13 ERROR
> [org.ovirt.engine.core.vdsbroker.HostDevListByCapsVDSCommand]
> (EE-ManagedThreadFactory-engine-Thread-7) [b5ae4d2] Command
> 'HostDevListByCapsVDSCommand(HostName = blade03,
> VdsIdAndVdsVDSCommandParametersBase:{hostId='96817980-8f48-4f19-9ab1-
> ee1656b513ec', vds='Host[blade03,96817980-8f48-4f19-9ab1-ee1656b513ec]'})'
> execution failed: java.net.UnknownHostException: wlg-bladechassis1-blade03:
> Name or service not known

I think this is the part I may have confused you. 

This restart at 09:28 is during the network outage (which caused the HostedEngine failover), this is not the restart that solved the problem. This is where the supposed engine hung happened, due to some bad network issue, and then never recovered until restarted again later.

The restart that fixed the problem was at 10:24.

> 
> The last ERROR message from the log is:
> 
> 2018-10-19 10:12:07,609+13 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand]
> (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] Command
> 'GetAllVmStatsVDSCommand(HostName = blade01,
> VdsIdVDSCommandParametersBase:{hostId='9ca5b2e4-d554-4989-9f4d-
> cb1d22514aaf'})' execution failed: VDSGenericException: VDSNetworkException:
> Message timeout which can be caused by communication issues

This is already into the problem. The network outage finished at around 09:29. You can see that after that only blade01 and blade05 have problems. There are the 2 that were stuck and solved by the engine restart.

> Germano please provide full engine log. I would like to see at what point
> the hosts recovered and please provide failing hosts logs as well.

Yes, now I realized I only gave you logs covering the first restart. I'm attaching the right one, which also converts the second restart. Still, the console.log is from before the second restart, which is correct.

Comment 16 Germano Veit Michel 2018-11-01 00:28:10 UTC
All good. Thanks, I just wanted to me sure this is not a different bug.
By the way, our own IT just hit the same on 4.2.6.

*** This bug has been marked as a duplicate of bug 1582379 ***