Bug 1380356
| Summary: | Engine confused by external network provider not responding to create-port command | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Mor <mkalfon> | ||||||
| Component: | General | Assignee: | Dominik Holler <dholler> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Meni Yakove <myakove> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 4.1.0 | CC: | ahadas, bugs, danken, dholler, gklein, mkalfon, mmirecki, tjelinek, ylavi | ||||||
| Target Milestone: | ovirt-4.1.1 | Flags: | rule-engine:
ovirt-4.1+
ylavi: exception+ |
||||||
| Target Release: | 4.1.1.3 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2017-04-21 09:53:31 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Mor
2016-09-29 11:42:54 UTC
@Arik: It looks like some race when the run VM is called twice and the second run fails getting the lock, the VM will never be unlocked. Please check if it can happen on 4.0 as well (In reply to Tomas Jelinek from comment #1) It does not seem to be a race between two RunVm commands as it was in bz 1169879. Mor, 1. Was this VM configured with external network? 2. How reproducible is it - can we get thread-dump? Hi Arik, 1. Yes, it is configured with OVN external network 2. Currently, it varies, I'm doing a maintenance on the environment, I'll see if it's still reproducible I'm almost entirely sure it is somehow related to the network's handling: In the attached log I see: 2016-09-29 13:28:36,631 INFO [org.ovirt.engine.core.vdsbroker.UpdateVmDynamicDataVDSCommand] (org.ovirt.thread.pool-6-thread-32) [6777824f] FINISH, UpdateVmDynamicDataVDSCommand This means that the RunVm command (job: bb50a074-42fa-4718-bd34-dad717c82fe6) reached the point in RunVmCommand#createVm where updateCurrentCd is called. Then, this command does not write anything to the log. It should have logged the invocation of CreateVm command. The fact that it didn't probably means that it got into a deadlock somewhere after the call to updateCurrentCd and before the call to CreateVm. There is not way that the settings of cd-path, floppy-path, initrd-url and kernel-url lead to a deadlock. I also do not see how the handling of passthrough-vnics lead to it. The main suspect is therefore the handling of external-network (initParametersForExternalNetworks) - this code interacts with openstack code that I'm not familiar with. Thread-dump could make things much more clear. Without it, I think the best chances to discover the problem would be by examining the openstack code we interact with. Created attachment 1207611 [details]
console with thread dump
Indeed, the call to initParametersForExternalNetworks is the problematic part (although it doesn't lead to a deadlock):
"org.ovirt.thread.pool-6-thread-38" #204 prio=5 os_prio=0 tid=0x00007fee5c16a000 nid=0x26ad runnable [0x00007fee52c85000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.jboss.resteasy.client.core.executors.ApacheHttpClient4Executor.execute(ApacheHttpClient4Executor.java:199)
at org.jboss.resteasy.client.ClientRequest.execute(ClientRequest.java:439)
at org.jboss.resteasy.client.ClientRequest.httpMethod(ClientRequest.java:686)
at org.jboss.resteasy.client.ClientRequest.httpMethod(ClientRequest.java:692)
at com.woorea.openstack.connector.RESTEasyConnector.request(RESTEasyConnector.java:105)
at com.woorea.openstack.base.client.OpenStackClient.request(OpenStackClient.java:51)
at com.woorea.openstack.base.client.OpenStackClient.execute(OpenStackClient.java:66)
at com.woorea.openstack.base.client.OpenStackRequest.execute(OpenStackRequest.java:98)
at org.ovirt.engine.core.bll.provider.network.openstack.BaseNetworkProviderProxy.locatePort(BaseNetworkProviderProxy.java:365)
at org.ovirt.engine.core.bll.provider.network.openstack.BaseNetworkProviderProxy.allocate(BaseNetworkProviderProxy.java:238)
at org.ovirt.engine.core.bll.RunVmCommand.initParametersForExternalNetworks(RunVmCommand.java:602)
at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:557)
at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:240)
at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:413)
at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:338)
We need to first understand why did the external provider not respond to the create-port command. Regardless, Engine should protect itself from non-responding external entities. Simple timeout is not enough, since we need to handle the case where timeout expires after the port was created on the external provider. Engine should probably take care of clearing that resource, or it would leak. To verify this, I'm thinking of shutting-down ovirt-provider-ovn service when a VM is using a pre-defined OVN network, and then try to run it. See how engine handle non-responding provider. Dan, do you have any other ideas for additional flow(s) that I should add to check for leaks? The bug assignee is Dominik; questions should be directed to him. I suggest `kill -STOP pid-of-provider` (and -CONT to let it run again) Just shutting-down the ovirt-provider-ovn is not enough, since this way the engine would just be unable to connect instead of running in a timeout. I would verify like this: 1. shutting-down ovirt-provider-ovn service 2. create a not responding replacement by: nc -k -l 9696 Dan's idea from comment #9 works, too. Will trt(In reply to Dominik Holler from comment #11) > Dan's idea from comment #9 works, too. Will try both suggestions. Thanks for clarifying. Verified on: Red Hat Virtualization Manager Version: 4.1.1.3-0.1.el7 I get a generic "Failed to run VM" event, and while it waits for connection timeout, I get "Cannot run VM. Related operation is currently in progress" dialog box. I am opening another bug to change the event message to be more specific event like: "Failed to contact provider: <name>. Verify that it is listening and running". |