Created attachment 1205931 [details] oVirt engine log Description of problem: I am trying to a start a VM on oVirt 4.1 environment, and I'm getting an error in the UI, and the VM doesn't start. Engine.log relevant error: 2016-09-29 14:20:57,275 WARN [org.ovirt.engine.core.bll.RunVmCommand] (default task-7) [1c832914] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_OBJECT_LOCKED Version-Release number of selected component (if applicable): 4.1.0-0.0.master.20160927231316.git7784131.el7.centos Steps to Reproduce: 1. Run a VM on the engine Actual results: Error message is displayed in the UI: "Cannot run VM. Related operation is currently in progress. Please try again later." Expected results: VM should run Additional info: Bug in oVirt 3.5: https://bugzilla.redhat.com/show_bug.cgi?id=1169879 engine job table: engine=# select start_time,action_type,job_id,correlation_id,status from job; start_time | action_type | job_id | correlation_id | status ----------------------------+----------------------+--------------------------------------+----------------+---------- 2016-09-29 14:19:59.651+03 | HostSetupNetworks | 7a199845-92b2-4d42-ba70-cec670a0d5b0 | 68dbb034 | FINISHED 2016-09-29 09:29:47.799+03 | RunVm | 9da0aa38-98d9-4c1d-a8ed-2e4b296f8a0a | 24efd1eb | FINISHED 2016-09-29 14:20:08.118+03 | CommitNetworkChanges | f95e2ce6-4005-4495-854f-4387c1ddabd6 | 723dec7 | FINISHED 2016-09-29 13:28:36.4+03 | RunVm | bb50a074-42fa-4718-bd34-dad717c82fe6 | 6777824f | STARTED 2016-09-29 14:17:34.868+03 | RunVm | 8901e37d-b645-4aa1-b3d7-706b2ea7c634 | 71d4ae3d | FINISHED 2016-09-29 14:22:08.742+03 | HostSetupNetworks | 29b34608-f84d-4295-aa24-ea22a3ed3402 | 269578bb | FINISHED 2016-09-29 14:22:17.81+03 | CommitNetworkChanges | 5d60bc63-3423-46af-9ca8-2557b6c67887 | 37e4ab21 | FINISHED 2016-09-29 14:16:05.56+03 | RunVm | 3a0350c7-298b-44be-9f9c-f77e91576701 | 7d9a4a23 | STARTED
@Arik: It looks like some race when the run VM is called twice and the second run fails getting the lock, the VM will never be unlocked. Please check if it can happen on 4.0 as well
(In reply to Tomas Jelinek from comment #1) It does not seem to be a race between two RunVm commands as it was in bz 1169879. Mor, 1. Was this VM configured with external network? 2. How reproducible is it - can we get thread-dump?
Hi Arik, 1. Yes, it is configured with OVN external network 2. Currently, it varies, I'm doing a maintenance on the environment, I'll see if it's still reproducible
I'm almost entirely sure it is somehow related to the network's handling: In the attached log I see: 2016-09-29 13:28:36,631 INFO [org.ovirt.engine.core.vdsbroker.UpdateVmDynamicDataVDSCommand] (org.ovirt.thread.pool-6-thread-32) [6777824f] FINISH, UpdateVmDynamicDataVDSCommand This means that the RunVm command (job: bb50a074-42fa-4718-bd34-dad717c82fe6) reached the point in RunVmCommand#createVm where updateCurrentCd is called. Then, this command does not write anything to the log. It should have logged the invocation of CreateVm command. The fact that it didn't probably means that it got into a deadlock somewhere after the call to updateCurrentCd and before the call to CreateVm. There is not way that the settings of cd-path, floppy-path, initrd-url and kernel-url lead to a deadlock. I also do not see how the handling of passthrough-vnics lead to it. The main suspect is therefore the handling of external-network (initParametersForExternalNetworks) - this code interacts with openstack code that I'm not familiar with. Thread-dump could make things much more clear. Without it, I think the best chances to discover the problem would be by examining the openstack code we interact with.
Created attachment 1207611 [details] console with thread dump
Indeed, the call to initParametersForExternalNetworks is the problematic part (although it doesn't lead to a deadlock): "org.ovirt.thread.pool-6-thread-38" #204 prio=5 os_prio=0 tid=0x00007fee5c16a000 nid=0x26ad runnable [0x00007fee52c85000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) at org.jboss.resteasy.client.core.executors.ApacheHttpClient4Executor.execute(ApacheHttpClient4Executor.java:199) at org.jboss.resteasy.client.ClientRequest.execute(ClientRequest.java:439) at org.jboss.resteasy.client.ClientRequest.httpMethod(ClientRequest.java:686) at org.jboss.resteasy.client.ClientRequest.httpMethod(ClientRequest.java:692) at com.woorea.openstack.connector.RESTEasyConnector.request(RESTEasyConnector.java:105) at com.woorea.openstack.base.client.OpenStackClient.request(OpenStackClient.java:51) at com.woorea.openstack.base.client.OpenStackClient.execute(OpenStackClient.java:66) at com.woorea.openstack.base.client.OpenStackRequest.execute(OpenStackRequest.java:98) at org.ovirt.engine.core.bll.provider.network.openstack.BaseNetworkProviderProxy.locatePort(BaseNetworkProviderProxy.java:365) at org.ovirt.engine.core.bll.provider.network.openstack.BaseNetworkProviderProxy.allocate(BaseNetworkProviderProxy.java:238) at org.ovirt.engine.core.bll.RunVmCommand.initParametersForExternalNetworks(RunVmCommand.java:602) at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:557) at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:240) at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:413) at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:338)
We need to first understand why did the external provider not respond to the create-port command. Regardless, Engine should protect itself from non-responding external entities. Simple timeout is not enough, since we need to handle the case where timeout expires after the port was created on the external provider. Engine should probably take care of clearing that resource, or it would leak.
To verify this, I'm thinking of shutting-down ovirt-provider-ovn service when a VM is using a pre-defined OVN network, and then try to run it. See how engine handle non-responding provider. Dan, do you have any other ideas for additional flow(s) that I should add to check for leaks?
The bug assignee is Dominik; questions should be directed to him. I suggest `kill -STOP pid-of-provider` (and -CONT to let it run again)
Just shutting-down the ovirt-provider-ovn is not enough, since this way the engine would just be unable to connect instead of running in a timeout. I would verify like this: 1. shutting-down ovirt-provider-ovn service 2. create a not responding replacement by: nc -k -l 9696
Dan's idea from comment #9 works, too.
Will trt(In reply to Dominik Holler from comment #11) > Dan's idea from comment #9 works, too. Will try both suggestions. Thanks for clarifying.
Verified on: Red Hat Virtualization Manager Version: 4.1.1.3-0.1.el7 I get a generic "Failed to run VM" event, and while it waits for connection timeout, I get "Cannot run VM. Related operation is currently in progress" dialog box. I am opening another bug to change the event message to be more specific event like: "Failed to contact provider: <name>. Verify that it is listening and running".