Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1380356

Summary: Engine confused by external network provider not responding to create-port command
Product: [oVirt] ovirt-engine Reporter: Mor <mkalfon>
Component: GeneralAssignee: Dominik Holler <dholler>
Status: CLOSED CURRENTRELEASE QA Contact: Meni Yakove <myakove>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.0CC: ahadas, bugs, danken, dholler, gklein, mkalfon, mmirecki, tjelinek, ylavi
Target Milestone: ovirt-4.1.1Flags: rule-engine: ovirt-4.1+
ylavi: exception+
Target Release: 4.1.1.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-21 09:53:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
oVirt engine log
none
console with thread dump none

Description Mor 2016-09-29 11:42:54 UTC
Created attachment 1205931 [details]
oVirt engine log

Description of problem:
I am trying to a start a VM on oVirt 4.1 environment, and I'm getting an error in the UI, and the VM doesn't start.

Engine.log relevant error:
2016-09-29 14:20:57,275 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (default task-7) [1c832914] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_OBJECT_LOCKED

Version-Release number of selected component (if applicable):
4.1.0-0.0.master.20160927231316.git7784131.el7.centos

Steps to Reproduce:
1. Run a VM on the engine

Actual results:
Error message is displayed in the UI:
"Cannot run VM. Related operation is currently in progress. Please try again later."

Expected results:
VM should run

Additional info:
Bug in oVirt 3.5: https://bugzilla.redhat.com/show_bug.cgi?id=1169879

engine job table:
engine=# select start_time,action_type,job_id,correlation_id,status from job;
         start_time         |     action_type      |                job_id                | correlation_id |  status  
----------------------------+----------------------+--------------------------------------+----------------+----------
 2016-09-29 14:19:59.651+03 | HostSetupNetworks    | 7a199845-92b2-4d42-ba70-cec670a0d5b0 | 68dbb034       | FINISHED
 2016-09-29 09:29:47.799+03 | RunVm                | 9da0aa38-98d9-4c1d-a8ed-2e4b296f8a0a | 24efd1eb       | FINISHED
 2016-09-29 14:20:08.118+03 | CommitNetworkChanges | f95e2ce6-4005-4495-854f-4387c1ddabd6 | 723dec7        | FINISHED
 2016-09-29 13:28:36.4+03   | RunVm                | bb50a074-42fa-4718-bd34-dad717c82fe6 | 6777824f       | STARTED
 2016-09-29 14:17:34.868+03 | RunVm                | 8901e37d-b645-4aa1-b3d7-706b2ea7c634 | 71d4ae3d       | FINISHED
 2016-09-29 14:22:08.742+03 | HostSetupNetworks    | 29b34608-f84d-4295-aa24-ea22a3ed3402 | 269578bb       | FINISHED
 2016-09-29 14:22:17.81+03  | CommitNetworkChanges | 5d60bc63-3423-46af-9ca8-2557b6c67887 | 37e4ab21       | FINISHED
 2016-09-29 14:16:05.56+03  | RunVm                | 3a0350c7-298b-44be-9f9c-f77e91576701 | 7d9a4a23       | STARTED

Comment 1 Tomas Jelinek 2016-09-30 08:24:22 UTC
@Arik: It looks like some race when the run VM is called twice and the second run fails getting the lock, the VM will never be unlocked.

Please check if it can happen on 4.0 as well

Comment 2 Arik 2016-10-02 08:53:37 UTC
(In reply to Tomas Jelinek from comment #1)
It does not seem to be a race between two RunVm commands as it was in bz 1169879.

Mor,
1. Was this VM configured with external network?
2. How reproducible is it - can we get thread-dump?

Comment 3 Mor 2016-10-05 10:34:18 UTC
Hi Arik,

1. Yes, it is configured with OVN external network
2. Currently, it varies, I'm doing a maintenance on the environment, I'll see if it's still reproducible

Comment 4 Arik 2016-10-05 12:20:11 UTC
I'm almost entirely sure it is somehow related to the network's handling:

In the attached log I see:
2016-09-29 13:28:36,631 INFO  [org.ovirt.engine.core.vdsbroker.UpdateVmDynamicDataVDSCommand] (org.ovirt.thread.pool-6-thread-32) [6777824f] FINISH, UpdateVmDynamicDataVDSCommand

This means that the RunVm command (job: bb50a074-42fa-4718-bd34-dad717c82fe6) reached the point in RunVmCommand#createVm where updateCurrentCd is called.

Then, this command does not write anything to the log. It should have logged the invocation of CreateVm command. The fact that it didn't probably means that it got into a deadlock somewhere after the call to updateCurrentCd and before the call to CreateVm.

There is not way that the settings of cd-path, floppy-path, initrd-url and kernel-url lead to a deadlock. I also do not see how the handling of passthrough-vnics lead to it. The main suspect is therefore the handling of external-network (initParametersForExternalNetworks) - this code interacts with openstack code that I'm not familiar with.

Thread-dump could make things much more clear. Without it, I think the best chances to discover the problem would be by examining the openstack code we interact with.

Comment 5 Mor 2016-10-05 14:02:09 UTC
Created attachment 1207611 [details]
console with thread dump

Comment 6 Arik 2016-10-05 14:06:21 UTC
Indeed, the call to initParametersForExternalNetworks is the problematic part (although it doesn't lead to a deadlock):

"org.ovirt.thread.pool-6-thread-38" #204 prio=5 os_prio=0 tid=0x00007fee5c16a000 nid=0x26ad runnable [0x00007fee52c85000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.jboss.resteasy.client.core.executors.ApacheHttpClient4Executor.execute(ApacheHttpClient4Executor.java:199)
        at org.jboss.resteasy.client.ClientRequest.execute(ClientRequest.java:439)
        at org.jboss.resteasy.client.ClientRequest.httpMethod(ClientRequest.java:686)
        at org.jboss.resteasy.client.ClientRequest.httpMethod(ClientRequest.java:692)
        at com.woorea.openstack.connector.RESTEasyConnector.request(RESTEasyConnector.java:105)
        at com.woorea.openstack.base.client.OpenStackClient.request(OpenStackClient.java:51)
        at com.woorea.openstack.base.client.OpenStackClient.execute(OpenStackClient.java:66)
        at com.woorea.openstack.base.client.OpenStackRequest.execute(OpenStackRequest.java:98)
        at org.ovirt.engine.core.bll.provider.network.openstack.BaseNetworkProviderProxy.locatePort(BaseNetworkProviderProxy.java:365)
        at org.ovirt.engine.core.bll.provider.network.openstack.BaseNetworkProviderProxy.allocate(BaseNetworkProviderProxy.java:238)
        at org.ovirt.engine.core.bll.RunVmCommand.initParametersForExternalNetworks(RunVmCommand.java:602)
        at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:557)
        at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:240)
        at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:413)
        at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:338)

Comment 7 Arik 2016-11-13 12:20:07 UTC
We need to first understand why did the external provider not respond to the create-port command.
Regardless, Engine should protect itself from non-responding external entities. Simple timeout is not enough, since we need to handle the case where timeout expires after the port was created on the external provider. Engine should probably take care of clearing that resource, or it would leak.

Comment 8 Mor 2017-03-06 13:20:16 UTC
To verify this, I'm thinking of shutting-down ovirt-provider-ovn service when a VM is using a pre-defined OVN network, and then try to run it. See how engine handle non-responding provider. 

Dan, do you have any other ideas for additional flow(s) that I should add to check for leaks?

Comment 9 Dan Kenigsberg 2017-03-06 13:36:52 UTC
The bug assignee is Dominik; questions should be directed to him.

I suggest `kill -STOP pid-of-provider` (and -CONT to let it run again)

Comment 10 Dominik Holler 2017-03-06 13:37:49 UTC
Just shutting-down the ovirt-provider-ovn is not enough, since this way the engine would just be unable to connect instead of running in a timeout.

I would verify like this:
1. shutting-down ovirt-provider-ovn service
2. create a not responding replacement by: nc -k -l 9696

Comment 11 Dominik Holler 2017-03-06 13:52:09 UTC
Dan's idea from comment #9 works, too.

Comment 12 Mor 2017-03-06 13:57:46 UTC
Will trt(In reply to Dominik Holler from comment #11)
> Dan's idea from comment #9 works, too.

Will try both suggestions. Thanks for clarifying.

Comment 13 Mor 2017-03-06 16:07:01 UTC
Verified on:
Red Hat Virtualization Manager Version: 4.1.1.3-0.1.el7

I get a generic "Failed to run VM" event, and while it waits for connection timeout, I get "Cannot run VM. Related operation is currently in progress" dialog box. 

I am opening another bug to change the event message to be more specific event like: "Failed to contact provider: <name>. Verify that it is listening and running".