Bug 1505242 - Engine fails to start when AddVmComman (start VM) job exists
Summary: Engine fails to start when AddVmComman (start VM) job exists
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Network
Version: 4.2.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.1.7
: ---
Assignee: Ravi Nori
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-23 06:31 UTC by Meni Yakove
Modified: 2017-11-13 12:28 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-13 12:28:36 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+


Attachments (Terms of Use)
engine logs (16.15 KB, application/zip)
2017-10-23 06:31 UTC, Meni Yakove
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 83103 0 master MERGED engine : Engine fails to start when AddVmComman (start VM) job exists 2017-10-30 13:52:50 UTC
oVirt gerrit 83389 0 ovirt-engine-4.1 MERGED engine : Engine fails to start when AddVmCommand (start VM) job exists 2017-10-30 15:03:24 UTC

Description Meni Yakove 2017-10-23 06:31:59 UTC
Created attachment 1341998 [details]
engine logs

Description of problem:
When start VM job is exists while the engine starting the engine failed to start.


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20171013142622.git15e767c.el7.centos.noarch



Steps to Reproduce:
1. Start VM
2. Kill engine

Actual results:
Engine failed to start

Expected results:
Engine started

Additional info: (From email discussions)

Juan Hernandez:
In order to see why that engine failed to start I did a thread stack dump:

  # kill -3 $(pidof ovirt-engine)
  # view /var/log/ovirt-engine/console.log

I found that there is a circular dependency between the Backend and MacPoolPerCluster EJBs. The Backend runs compensation during its initialization, which in some cases uses the MacPoolPerCluster EJB. But this EJB depends explicitly on the Backend, so it won't be deployed till the backend finishes initialization. As a result the initialization of the Backend never finishes. Eventually the application server tries to kill it, but it fails. This leaves the application server running, but useless.

The details of the circular dependency can be found in the console.log file (I left all the log files in /var/log/ovirt-engine/juan) directory. Look for these lines lines:

---8<---
org.ovirt.engine.core.bll.network.macpool.MacPoolPerCluster$Proxy$_$$_Weld$EnterpriseProxy$.getMacPoolById(Unknown Source)
org.ovirt.engine.core.bll.Backend.create(Backend.java:197)
--->8--

This seems to be triggered by this compensation:

---8<---
2017-10-19 17:38:30,965+03 INFO [org.ovirt.engine.core.bll.AddVmCommand] (ServerService Thread Pool -- 61) [] Command [id=4b5a84b2-90cf-4d85-b121-0dc02b90d55f]: Compensating TRANSIENT_ENTITY of org.ovirt.engine.core.common.businessentities.ReleaseMacsTransientCompensation; snapshot: org.ovirt.engine.core.common.businessentities.ReleaseMacsTransientCompensation@28536906.
--->8---

The only way I found to restore that environment was to modify the source so that it doesn't try to use the MacPoolPerCluster EJB in this case:

---8<---
diff --git a/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/ObjectCompensation.java b/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/ObjectCompensation.java
index e6092dca99e..fe22b8d61b0 100644
--- a/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/ObjectCompensation.java
+++ b/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/ObjectCompensation.java
@@ -1,18 +1,15 @@
 package org.ovirt.engine.core.bll;

-import javax.inject.Inject;
 import javax.inject.Singleton;

-import org.ovirt.engine.core.bll.network.macpool.MacPool;
-import org.ovirt.engine.core.bll.network.macpool.MacPoolPerCluster;
 import org.ovirt.engine.core.common.businessentities.BusinessEntitySnapshot;
 import org.ovirt.engine.core.common.businessentities.ReleaseMacsTransientCompensation;
 import org.ovirt.engine.core.common.businessentities.TransientCompensationBusinessEntity;

 @Singleton
 public class ObjectCompensation {
-    @Inject
-    private MacPoolPerCluster macPoolPerCluster;
+    //@Inject
+    //private MacPoolPerCluster macPoolPerCluster;

     public void compensate(CommandBase command, TransientCompensationBusinessEntity entity) {
         switch (entity.getTransientEntityType()) {
@@ -28,7 +25,7 @@ public class ObjectCompensation {
     }

     private void handleReleaseMacsCompensation(ReleaseMacsTransientCompensation releaseMacs) {
-        MacPool macPool = macPoolPerCluster.getMacPoolById(releaseMacs.getMacPoolId());
-        macPool.freeMacs(releaseMacs.getMacs());
+        //MacPool macPool = macPoolPerCluster.getMacPoolById(releaseMacs.getMacPoolId());
+        //macPool.freeMacs(releaseMacs.getMacs());
     }
 }
--->8---

To make this change I replaced the /usr/share/java/ovirt-engine/bll.jar file. The original one is renamed to bll.jar.juan.

This isn't the right solution, of course. Please report this as a bug to the network team, so that they fix int properly.


Dan Kenigsberg:
I think that this dependency loop was made explicit by https://gerrit.ovirt.org/#/q/I98082be352fe1d61f47cfa46bdd0456b9a2348ed and merits a 4.1.7 blocker bug.

Comment 1 Oved Ourfali 2017-10-23 06:34:53 UTC
According to the analysis, this is a network bug.
If you need assistance from Infra, contact us.
Moving to network.

Comment 2 Martin Mucha 2017-10-23 11:29:39 UTC
(In reply to Oved Ourfali from comment #1)
> According to the analysis, this is a network bug.
> If you need assistance from Infra, contact us.
> Moving to network.

Hi, 

@Depends on was added into MacPoolPerCluster, because it fails to initialize it's beans. In Roy Golans words: "Other beans that are not Backend must depend on it otherwise they might end up using some facilities which are not initilized." ~ Pools do not need Backend in any way, @DependsOn is there just because DI failed otherwise. This issue is Network related only in one way, failure is just happening here.

So I think, that true reason is that Backend is severely overgrown (why should Backend do compensation on init?? Cannot that be done in separate bean? etc. etc.), and it should be broken down. Requirement, to use @Depends just to make DI work, is suggestion, that there is something infra-wrong.

But I don't know what analysis you did and its results; please explain in greater depth why do you thinks it's network related.

workaround: We shouldn't do it. But we can try (I don't know what happens then though) to remove @Startup from MacPoolPerCluster, leaving first initialization delay for the first customer. It shouldn't take long, but it was also multiple times decided, that we don't want to do that.

Comment 3 Martin Perina 2017-10-23 13:00:06 UTC
Ravi, could you please help network team to fix that issue?

Comment 4 Ravi Nori 2017-10-23 15:19:05 UTC
Posted a patch to fix the issue

Comment 5 Oved Ourfali 2017-10-24 02:43:16 UTC
I see the nature of the proposed fix is infra related, so moving back to infra.

Comment 6 Michael Burman 2017-11-02 13:32:27 UTC
Verified on - 4.1.7.5-0.1.el7


Note You need to log in before you can comment on or make changes to this bug.