Created attachment 1341998 [details] engine logs Description of problem: When start VM job is exists while the engine starting the engine failed to start. Version-Release number of selected component (if applicable): ovirt-engine-4.2.0-0.0.master.20171013142622.git15e767c.el7.centos.noarch Steps to Reproduce: 1. Start VM 2. Kill engine Actual results: Engine failed to start Expected results: Engine started Additional info: (From email discussions) Juan Hernandez: In order to see why that engine failed to start I did a thread stack dump: # kill -3 $(pidof ovirt-engine) # view /var/log/ovirt-engine/console.log I found that there is a circular dependency between the Backend and MacPoolPerCluster EJBs. The Backend runs compensation during its initialization, which in some cases uses the MacPoolPerCluster EJB. But this EJB depends explicitly on the Backend, so it won't be deployed till the backend finishes initialization. As a result the initialization of the Backend never finishes. Eventually the application server tries to kill it, but it fails. This leaves the application server running, but useless. The details of the circular dependency can be found in the console.log file (I left all the log files in /var/log/ovirt-engine/juan) directory. Look for these lines lines: ---8<--- org.ovirt.engine.core.bll.network.macpool.MacPoolPerCluster$Proxy$_$$_Weld$EnterpriseProxy$.getMacPoolById(Unknown Source) org.ovirt.engine.core.bll.Backend.create(Backend.java:197) --->8-- This seems to be triggered by this compensation: ---8<--- 2017-10-19 17:38:30,965+03 INFO [org.ovirt.engine.core.bll.AddVmCommand] (ServerService Thread Pool -- 61) [] Command [id=4b5a84b2-90cf-4d85-b121-0dc02b90d55f]: Compensating TRANSIENT_ENTITY of org.ovirt.engine.core.common.businessentities.ReleaseMacsTransientCompensation; snapshot: org.ovirt.engine.core.common.businessentities.ReleaseMacsTransientCompensation@28536906. --->8--- The only way I found to restore that environment was to modify the source so that it doesn't try to use the MacPoolPerCluster EJB in this case: ---8<--- diff --git a/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/ObjectCompensation.java b/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/ObjectCompensation.java index e6092dca99e..fe22b8d61b0 100644 --- a/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/ObjectCompensation.java +++ b/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/ObjectCompensation.java @@ -1,18 +1,15 @@ package org.ovirt.engine.core.bll; -import javax.inject.Inject; import javax.inject.Singleton; -import org.ovirt.engine.core.bll.network.macpool.MacPool; -import org.ovirt.engine.core.bll.network.macpool.MacPoolPerCluster; import org.ovirt.engine.core.common.businessentities.BusinessEntitySnapshot; import org.ovirt.engine.core.common.businessentities.ReleaseMacsTransientCompensation; import org.ovirt.engine.core.common.businessentities.TransientCompensationBusinessEntity; @Singleton public class ObjectCompensation { - @Inject - private MacPoolPerCluster macPoolPerCluster; + //@Inject + //private MacPoolPerCluster macPoolPerCluster; public void compensate(CommandBase command, TransientCompensationBusinessEntity entity) { switch (entity.getTransientEntityType()) { @@ -28,7 +25,7 @@ public class ObjectCompensation { } private void handleReleaseMacsCompensation(ReleaseMacsTransientCompensation releaseMacs) { - MacPool macPool = macPoolPerCluster.getMacPoolById(releaseMacs.getMacPoolId()); - macPool.freeMacs(releaseMacs.getMacs()); + //MacPool macPool = macPoolPerCluster.getMacPoolById(releaseMacs.getMacPoolId()); + //macPool.freeMacs(releaseMacs.getMacs()); } } --->8--- To make this change I replaced the /usr/share/java/ovirt-engine/bll.jar file. The original one is renamed to bll.jar.juan. This isn't the right solution, of course. Please report this as a bug to the network team, so that they fix int properly. Dan Kenigsberg: I think that this dependency loop was made explicit by https://gerrit.ovirt.org/#/q/I98082be352fe1d61f47cfa46bdd0456b9a2348ed and merits a 4.1.7 blocker bug.
According to the analysis, this is a network bug. If you need assistance from Infra, contact us. Moving to network.
(In reply to Oved Ourfali from comment #1) > According to the analysis, this is a network bug. > If you need assistance from Infra, contact us. > Moving to network. Hi, @Depends on was added into MacPoolPerCluster, because it fails to initialize it's beans. In Roy Golans words: "Other beans that are not Backend must depend on it otherwise they might end up using some facilities which are not initilized." ~ Pools do not need Backend in any way, @DependsOn is there just because DI failed otherwise. This issue is Network related only in one way, failure is just happening here. So I think, that true reason is that Backend is severely overgrown (why should Backend do compensation on init?? Cannot that be done in separate bean? etc. etc.), and it should be broken down. Requirement, to use @Depends just to make DI work, is suggestion, that there is something infra-wrong. But I don't know what analysis you did and its results; please explain in greater depth why do you thinks it's network related. workaround: We shouldn't do it. But we can try (I don't know what happens then though) to remove @Startup from MacPoolPerCluster, leaving first initialization delay for the first customer. It shouldn't take long, but it was also multiple times decided, that we don't want to do that.
Ravi, could you please help network team to fix that issue?
Posted a patch to fix the issue
I see the nature of the proposed fix is infra related, so moving back to infra.
Verified on - 4.1.7.5-0.1.el7