1380356 – Engine confused by external network provider not responding to create-port command

Bug 1380356 - Engine confused by external network provider not responding to create-port command

Summary: Engine confused by external network provider not responding to create-port co...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.1.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.1.1
Target Release:	4.1.1.3
Assignee:	Dominik Holler
QA Contact:	Meni Yakove
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-29 11:42 UTC by Mor
Modified:	2017-04-21 09:53 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-04-21 09:53:31 UTC
oVirt Team:	Network
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+ ylavi: exception+

Attachments	(Terms of Use)
oVirt engine log (7.30 MB, text/plain) 2016-09-29 11:42 UTC, Mor	no flags	Details
console with thread dump (111.06 KB, text/plain) 2016-10-05 14:02 UTC, Mor	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	70135	master	MERGED	engine: Add timeout to communication with network provider	2020-05-12 04:44:15 UTC
oVirt gerrit	72985	ovirt-engine-4.1	MERGED	engine: Add timeout to communication with network provider	2020-05-12 04:44:15 UTC
oVirt gerrit	72992	ovirt-engine-4.1.1.z	MERGED	engine: Add timeout to communication with network provider	2020-05-12 04:44:15 UTC

Description Mor 2016-09-29 11:42:54 UTC

Created attachment 1205931 [details]
oVirt engine log

Description of problem:
I am trying to a start a VM on oVirt 4.1 environment, and I'm getting an error in the UI, and the VM doesn't start.

Engine.log relevant error:
2016-09-29 14:20:57,275 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (default task-7) [1c832914] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_OBJECT_LOCKED

Version-Release number of selected component (if applicable):
4.1.0-0.0.master.20160927231316.git7784131.el7.centos

Steps to Reproduce:
1. Run a VM on the engine

Actual results:
Error message is displayed in the UI:
"Cannot run VM. Related operation is currently in progress. Please try again later."

Expected results:
VM should run

Additional info:
Bug in oVirt 3.5: https://bugzilla.redhat.com/show_bug.cgi?id=1169879

engine job table:
engine=# select start_time,action_type,job_id,correlation_id,status from job;
         start_time         |     action_type      |                job_id                | correlation_id |  status  
----------------------------+----------------------+--------------------------------------+----------------+----------
 2016-09-29 14:19:59.651+03 | HostSetupNetworks    | 7a199845-92b2-4d42-ba70-cec670a0d5b0 | 68dbb034       | FINISHED
 2016-09-29 09:29:47.799+03 | RunVm                | 9da0aa38-98d9-4c1d-a8ed-2e4b296f8a0a | 24efd1eb       | FINISHED
 2016-09-29 14:20:08.118+03 | CommitNetworkChanges | f95e2ce6-4005-4495-854f-4387c1ddabd6 | 723dec7        | FINISHED
 2016-09-29 13:28:36.4+03   | RunVm                | bb50a074-42fa-4718-bd34-dad717c82fe6 | 6777824f       | STARTED
 2016-09-29 14:17:34.868+03 | RunVm                | 8901e37d-b645-4aa1-b3d7-706b2ea7c634 | 71d4ae3d       | FINISHED
 2016-09-29 14:22:08.742+03 | HostSetupNetworks    | 29b34608-f84d-4295-aa24-ea22a3ed3402 | 269578bb       | FINISHED
 2016-09-29 14:22:17.81+03  | CommitNetworkChanges | 5d60bc63-3423-46af-9ca8-2557b6c67887 | 37e4ab21       | FINISHED
 2016-09-29 14:16:05.56+03  | RunVm                | 3a0350c7-298b-44be-9f9c-f77e91576701 | 7d9a4a23       | STARTED

Comment 1 Tomas Jelinek 2016-09-30 08:24:22 UTC

@Arik: It looks like some race when the run VM is called twice and the second run fails getting the lock, the VM will never be unlocked.

Please check if it can happen on 4.0 as well

Comment 2 Arik 2016-10-02 08:53:37 UTC

(In reply to Tomas Jelinek from comment #1)
It does not seem to be a race between two RunVm commands as it was in bz 1169879.

Mor,
1. Was this VM configured with external network?
2. How reproducible is it - can we get thread-dump?

Comment 3 Mor 2016-10-05 10:34:18 UTC

Hi Arik,

1. Yes, it is configured with OVN external network
2. Currently, it varies, I'm doing a maintenance on the environment, I'll see if it's still reproducible

Comment 4 Arik 2016-10-05 12:20:11 UTC

I'm almost entirely sure it is somehow related to the network's handling:

In the attached log I see:
2016-09-29 13:28:36,631 INFO  [org.ovirt.engine.core.vdsbroker.UpdateVmDynamicDataVDSCommand] (org.ovirt.thread.pool-6-thread-32) [6777824f] FINISH, UpdateVmDynamicDataVDSCommand

This means that the RunVm command (job: bb50a074-42fa-4718-bd34-dad717c82fe6) reached the point in RunVmCommand#createVm where updateCurrentCd is called.

Then, this command does not write anything to the log. It should have logged the invocation of CreateVm command. The fact that it didn't probably means that it got into a deadlock somewhere after the call to updateCurrentCd and before the call to CreateVm.

There is not way that the settings of cd-path, floppy-path, initrd-url and kernel-url lead to a deadlock. I also do not see how the handling of passthrough-vnics lead to it. The main suspect is therefore the handling of external-network (initParametersForExternalNetworks) - this code interacts with openstack code that I'm not familiar with.

Thread-dump could make things much more clear. Without it, I think the best chances to discover the problem would be by examining the openstack code we interact with.

Comment 5 Mor 2016-10-05 14:02:09 UTC

Created attachment 1207611 [details]
console with thread dump

Comment 6 Arik 2016-10-05 14:06:21 UTC

Indeed, the call to initParametersForExternalNetworks is the problematic part (although it doesn't lead to a deadlock):

"org.ovirt.thread.pool-6-thread-38" #204 prio=5 os_prio=0 tid=0x00007fee5c16a000 nid=0x26ad runnable [0x00007fee52c85000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.jboss.resteasy.client.core.executors.ApacheHttpClient4Executor.execute(ApacheHttpClient4Executor.java:199)
        at org.jboss.resteasy.client.ClientRequest.execute(ClientRequest.java:439)
        at org.jboss.resteasy.client.ClientRequest.httpMethod(ClientRequest.java:686)
        at org.jboss.resteasy.client.ClientRequest.httpMethod(ClientRequest.java:692)
        at com.woorea.openstack.connector.RESTEasyConnector.request(RESTEasyConnector.java:105)
        at com.woorea.openstack.base.client.OpenStackClient.request(OpenStackClient.java:51)
        at com.woorea.openstack.base.client.OpenStackClient.execute(OpenStackClient.java:66)
        at com.woorea.openstack.base.client.OpenStackRequest.execute(OpenStackRequest.java:98)
        at org.ovirt.engine.core.bll.provider.network.openstack.BaseNetworkProviderProxy.locatePort(BaseNetworkProviderProxy.java:365)
        at org.ovirt.engine.core.bll.provider.network.openstack.BaseNetworkProviderProxy.allocate(BaseNetworkProviderProxy.java:238)
        at org.ovirt.engine.core.bll.RunVmCommand.initParametersForExternalNetworks(RunVmCommand.java:602)
        at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:557)
        at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:240)
        at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:413)
        at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:338)

Comment 7 Arik 2016-11-13 12:20:07 UTC

We need to first understand why did the external provider not respond to the create-port command.
Regardless, Engine should protect itself from non-responding external entities. Simple timeout is not enough, since we need to handle the case where timeout expires after the port was created on the external provider. Engine should probably take care of clearing that resource, or it would leak.

Comment 8 Mor 2017-03-06 13:20:16 UTC

To verify this, I'm thinking of shutting-down ovirt-provider-ovn service when a VM is using a pre-defined OVN network, and then try to run it. See how engine handle non-responding provider. 

Dan, do you have any other ideas for additional flow(s) that I should add to check for leaks?

Comment 9 Dan Kenigsberg 2017-03-06 13:36:52 UTC

The bug assignee is Dominik; questions should be directed to him.

I suggest `kill -STOP pid-of-provider` (and -CONT to let it run again)

Comment 10 Dominik Holler 2017-03-06 13:37:49 UTC

Just shutting-down the ovirt-provider-ovn is not enough, since this way the engine would just be unable to connect instead of running in a timeout.

I would verify like this:
1. shutting-down ovirt-provider-ovn service
2. create a not responding replacement by: nc -k -l 9696

Comment 11 Dominik Holler 2017-03-06 13:52:09 UTC

Dan's idea from comment #9 works, too.

Comment 12 Mor 2017-03-06 13:57:46 UTC

Will trt(In reply to Dominik Holler from comment #11)
> Dan's idea from comment #9 works, too.

Will try both suggestions. Thanks for clarifying.

Comment 13 Mor 2017-03-06 16:07:01 UTC

Verified on:
Red Hat Virtualization Manager Version: 4.1.1.3-0.1.el7

I get a generic "Failed to run VM" event, and while it waits for connection timeout, I get "Cannot run VM. Related operation is currently in progress" dialog box. 

I am opening another bug to change the event message to be more specific event like: "Failed to contact provider: <name>. Verify that it is listening and running".

Note You need to log in before you can comment on or make changes to this bug.