1528292 – Recursion in HostDeviceManager and other services prevents engine to startup successfully

Bug 1528292 - Recursion in HostDeviceManager and other services prevents engine to startup successfully

Summary: Recursion in HostDeviceManager and other services prevents engine to startup ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Backend.Core
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-4.2.1
Target Release:	---
Assignee:	Ravi Nori
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-21 13:15 UTC by Yanir Quinn
Modified:	2018-02-12 11:50 UTC (History)
CC List:	11 users (show)
Fixed In Version:	ovirt-engine-4.2.1.3
Clone Of:
Environment:
Last Closed:	2018-02-12 11:50:28 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ rule-engine: blocker+

Attachments	(Terms of Use)
server.log (125.05 KB, text/plain) 2017-12-21 13:15 UTC, Yanir Quinn	no flags	Details
engine.log (405.14 KB, text/plain) 2017-12-21 13:17 UTC, Yanir Quinn	no flags	Details
/var/log/ovirt-engine with DEBUG and full thread dumps (1.65 MB, application/x-gzip) 2018-01-02 14:53 UTC, Yedidyah Bar David	no flags	Details
setup log and answer file - engine and server logs (150.75 KB, application/x-gzip) 2018-01-08 13:56 UTC, Kobi Hakimi	no flags	Details
wildfly debug logs (3.57 MB, application/x-gzip) 2018-01-16 12:48 UTC, Simone Tiraboschi	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	85898	master	ABANDONED	test, do not merge: Make the engine log in DEBUG level	2018-02-05 11:05:06 UTC
oVirt gerrit	86005	master	MERGED	core: Recursion in HostDeviceManager prevents engine from starting successfully	2018-01-08 10:01:07 UTC
oVirt gerrit	86439	master	MERGED	engine : Recursion in HostDeviceManager prevents engine to startup successfully	2018-01-23 16:55:35 UTC
oVirt gerrit	86486	master	ABANDONED	test, do not merge: Build an appliance with a custom engine	2018-01-23 16:31:14 UTC
oVirt gerrit	86667	master	MERGED	engine: Initialize CommandsRepository before CommandCallbacksPoller	2018-01-23 16:55:38 UTC

Description Yanir Quinn 2017-12-21 13:15:51 UTC

Created attachment 1370865 [details]
server.log

Description of problem:

After Hosted Engine 0 node deployment (ansible setup) , an "HTTP 404 Not Found" occurs and wildfly times out.


Version-Release number of selected component (if applicable): 4.2


How reproducible:
Always

Steps to Reproduce:
Using Ovirt system tests:
1.run script he-basic-suite-master
2.wait for Hosted engine to deploy
3.test scenario 002_bootstrap will run and fail  

Actual results:
HTTP 404 Not Found Error in server log causing wildfly to time-out after 300 seconds

Expected results:
No errors in server.log 

Additional info:

Comment 1 Yanir Quinn 2017-12-21 13:17:24 UTC

Created attachment 1370866 [details]
engine.log

Comment 2 Simone Tiraboschi 2017-12-21 13:23:02 UTC

simply running
  systemctl restart ovirt-engine
fixes it

Comment 3 Yaniv Kaul 2017-12-21 14:21:47 UTC

Note that also extending the timeout from 10 to 20 minutes 'solves' it - I need to look at why it's needed at all, of course - but it doesn't always happen.

Comment 4 Simone Tiraboschi 2017-12-22 11:16:43 UTC

An upstream user reported it also after 4.1 -> 4.2 upgrade of an hosted-engine env so it doesn't seams specific to the new ansible based flow.

Logs on:
https://drive.google.com/file/d/1UQAllZfjueVGkXDsBs09S7THGDFn9YPa/view

Comment 5 Sandro Bonazzola 2017-12-22 11:40:47 UTC

(In reply to Simone Tiraboschi from comment #4)
> An upstream user reported it also after 4.1 -> 4.2 upgrade of an
> hosted-engine env so it doesn't seams specific to the new ansible based flow.
> 
> Logs on:
> https://drive.google.com/file/d/1UQAllZfjueVGkXDsBs09S7THGDFn9YPa/view

Updated summary accordingly.
quoting Martin Perina from users mailing list:

it usually happens when access to storage is really slow or the host itself is overloaded and WildFly is not able to startup properly until default 300 seconds interval is over.

If this is going to happen often, we will have to raise that timeout for all installations.

Comment 6 Simone Tiraboschi 2017-12-22 11:46:12 UTC

(In reply to Sandro Bonazzola from comment #5)
> it usually happens when access to storage is really slow or the host itself
> is overloaded and WildFly is not able to startup properly until default 300
> seconds interval is over.

It's probably not that simple: 300 seconds are already a huge time.
When I hit this on my tests I noticed that, also before the 300 seconds timeout, a simple 'systemctl restart ovirt-engine' was enough to get it properly started in a bunch of seconds.

Comment 7 Yedidyah Bar David 2018-01-02 11:10:06 UTC

Are we certain this is really related to load?

It happens to me consistently with ovirt-system-tests and ansible hosted-engine aka node-zero.

On the first start of the engine, it's all quick and nice:

2018-01-02 03:16:12,952-05 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 63) WFLYCLINF0002: Started timeout-base cache from ovirt-engine container
2018-01-02 03:16:17,699-05 INFO  [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "ovirt-web-ui.war" (runtime-name : "ovirt-web-ui.war")
2018-01-02 03:16:17,699-05 INFO  [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "apidoc.war" (runtime-name : "apidoc.war")
2018-01-02 03:16:17,699-05 INFO  [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "restapi.war" (runtime-name : "restapi.war")
2018-01-02 03:16:17,699-05 INFO  [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "engine.ear" (runtime-name : "engine.ear")
2018-01-02 03:16:18,116-05 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0212: Resuming server

Then HE-setup does stuff with the engine (adds storage, host etc. - all working well), then it shuts down the machine, and on the first reboot of the engine vm, it fails:

2018-01-02 03:25:40,551-05 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 67) WFLYCLINF0002: Started timeout-base cache from ovirt-engine container
2018-01-02 03:30:00,597-05 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[
    ("core-service" => "management"),
    ("management-interface" => "native-interface")
]'

More than 4 minutes with nothing in server.log. During that time, engine.log does have some things in the beginning, nothing that seems relevant to me, and then nothing until the next restart:

2018-01-02 03:26:01,079-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DumpXmlsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-16) [] FINISH, DumpXmlsVDSCommand, return: [{vmId=80ed48d7-38ee-4c3c-b4f1-67b7438da3a4, devices=[Ljava.util.Map;@3128dacd, guestDiskMapping={}}], log id: 4a3569dd
2018-01-02 03:36:43,630-05 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 59) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.

Then HA times out, shuts down the engine machine, starts it, and then everything is quick and nice again:

2018-01-02 03:36:48,060-05 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 65) WFLYCLINF0002: Started timeout-base cache from ovirt-engine container
2018-01-02 03:36:53,469-05 INFO  [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "ovirt-web-ui.war" (runtime-name : "ovirt-web-ui.war")
2018-01-02 03:36:53,470-05 INFO  [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "apidoc.war" (runtime-name : "apidoc.war")
2018-01-02 03:36:53,470-05 INFO  [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "restapi.war" (runtime-name : "restapi.war")
2018-01-02 03:36:53,470-05 INFO  [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "engine.ear" (runtime-name : "engine.ear")
2018-01-02 03:36:53,685-05 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0212: Resuming server

Comment 8 Artyom 2018-01-02 11:22:25 UTC

It happens sometimes also in regular HE environment, but I have a traceback under server.log:
2018-01-02 13:14:22,729+02 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[
    ("core-service" => "management"),
    ("management-interface" => "native-interface")
]'
2018-01-02 13:14:30,405+02 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 73) WFLYUT0022: Unregistered web context: /ovirt-engine/web-ui
2018-01-02 13:14:30,419+02 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 93) WFLYUT0022: Unregistered web context: /ovirt-engine/rhv
2018-01-02 13:14:30,403+02 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 80) WFLYUT0022: Unregistered web context: /ovirt-engine/apidoc
2018-01-02 13:14:29,742+02 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@4e32357b for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException
        at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:511) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1369) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.AbstractOperationContext$Step.finalizeInternal(AbstractOperationContext.java:1328) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.AbstractOperationContext$Step.finalizeStep(AbstractOperationContext.java:1301) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.AbstractOperationContext$Step.access$300(AbstractOperationContext.java:1185) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.AbstractOperationContext.executeResultHandlerPhase(AbstractOperationContext.java:767) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:644) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:370) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1344) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.ModelControllerImpl.boot(ModelControllerImpl.java:485) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:387) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:349) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.server.ServerService.boot(ServerService.java:392) [wildfly-server-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.server.ServerService.boot(ServerService.java:365) [wildfly-server-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:299) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_151]

2018-01-02 13:14:30,607+02 ERROR [org.jboss.as.controller.client] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@4e32357b for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException

Comment 9 Martin Perina 2018-01-02 12:37:32 UTC

So, could someone please reproduce this issue with WildFly logs in debug? It can be done by editing /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in and changing following lines:

1. Change INFO to DEBUG in:
      <file-handler name="SERVER">                                                                             
        <level name="DEBUG"/>                                                                          
        <formatter>

2. Change INFO to DEBUG in:
      <root-logger>
        <level name="DEBUG"/>
        <handlers>

With those logs we should be able to understand why deployment took that much time. Thanks

Comment 10 Yedidyah Bar David 2018-01-02 13:30:31 UTC

85898 is a patch to HE-setup to patch the engine as requested in comment 9.

Comment 11 Yedidyah Bar David 2018-01-02 13:32:25 UTC

BTW, https://gerrit.ovirt.org/85890 indeed makes it work for me. So it might indeed be related to high load.

Comment 12 Juan Hernández 2018-01-02 13:41:45 UTC

Looking at the logs provided by Didi in the `devel` list (see [1]) I see that the last activity performed by the application server service threads is the following:

  2018-01-01 08:19:18,619-05 INFO  [org.ovirt.engine.core.bll.ServiceLoader] (ServerService Thread Pool -- 57) [] Start org.ovirt.engine.core.bll.scheduling.SchedulingManager@104b2434

In a normal start-up sequence right after that message for the `SchedulingManager` a similar one should appear for the `HostDeviceManager` as the code calls them in sequence:

  // From InitBackendServicesOnStartupBean.java:
  serviceLoader.load(SchedulingManager.class);

  sessionDataContainer.cleanupEngineSessionsOnStartup();

  serviceLoader.load(HostDeviceManager.class);

But that `HostDeviceManager` message never appears: the application server initialization times out before.

Looking at the code for `HostDeviceMananager` it looks like during its initialization it tries to reload the devices of the hosts that are `UP`:

  // From HostDeviceManager.java:
  @PostConstruct
  private void init() {
    ...
    backend.runInternalMultipleActions(ActionType.RefreshHostDevices, parameters);
    hostDeviceDao.cleanDownVms();
  }

And the `RefreshHostDevices` command in turn depends on `HostDeviceManager`:

  // From RefreshHostDevicesCommand.java:
  @NonTransactiveCommandAttribute
  public class RefreshHostDevicesCommand ... {
    ...

    @Inject
    private HostDeviceManager hostDeviceManager;

    ...
  }

So the net result is that there is a circular dependency: initializing the `HostDeviceManager` requires a working `HostDeviceManager`. Most probably that is what blocks the initialization of the application server.

This doesn't happen always because the `RefreshHostDevices` command is executed asynchronously, in a different thread. If that thread is fast enough it will try to use the `HostDeviceManager` before it finished initializing, and thus it will block for ever. If it isn't fast enough then the `HostDeviceManager` will be already initialized, and then there is no problem.

Anyhow, I think we should remove that circular dependency.

Please try to obtain a thread dump of the engine when it is blocked in this wait, that can help us determine if this is really what is happening.

[1] http://lists.ovirt.org/pipermail/devel/2018-January/032365.html

Comment 13 Yedidyah Bar David 2018-01-02 14:01:52 UTC

Juan, thanks for the detailed analysis. I hope to soon attach logs with DEBUG level and thread dumps, hopefully this can reassure us about the root cause.

Is there any way to automatically verify that we do not have other such potential loops? Perhaps add a flag somewhere to sleep for a random time in the beginning of each such service-starting thread? Or some other way to make them start at random order?

Comment 14 Juan Hernández 2018-01-02 14:21:00 UTC

I am not aware of any mechanism or tool to automatically find this kind of circular dependencies.

Comment 15 Yedidyah Bar David 2018-01-02 14:53:24 UTC

Created attachment 1375786 [details]
/var/log/ovirt-engine with DEBUG and full thread dumps

Comment 16 Juan Hernández 2018-01-02 16:09:15 UTC

The thread stack dump seems to confirm what I suspected. The thread that initializes the backend is blocked waiting on lock 0x00000000cc238218, for the `HostDeviceManager` instance to be created, which blocks the startup of the application server:

"ServerService Thread Pool -- 68" #124 prio=5 os_prio=0 tid=0x000000000249f7d0 nid=0x773 waiting for monitor entry [0x00007f4f52802000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.jboss.weld.bean.ContextualInstanceStrategy$ApplicationScopedContextualInstanceStrategy.getIfExists(ContextualInstanceStrategy.java:122)
    - waiting to lock <0x00000000cc238218> (a org.jboss.weld.bean.ContextualInstanceStrategy$ApplicationScopedContextualInstanceStrategy)

And then there is thread that already obtained that lock, the thread that is trying to do the initialization of the `HostDeviceManager`. That thread is blocked waiting for the backend bean to be completely initialized, which will never happen:

"EE-ManagedThreadFactory-engine-Thread-3" #169 daemon prio=5 os_prio=0 tid=0x00000000064d5600 nid=0x7ad in Object.wait() [0x00007f4f4efd6000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000c3a899a8> (a org.jboss.as.ee.component.deployers.StartupCountdown)
    at java.lang.Object.wait(Object.java:502)
    at org.jboss.as.ee.component.deployers.StartupCountdown.await(StartupCountdown.java:45)
    - locked <0x00000000c3a899a8> (a org.jboss.as.ee.component.deployers.StartupCountdown)
    at org.jboss.as.ejb3.deployment.processors.StartupAwaitInterceptor.processInvocation(StartupAwaitInterceptor.java:21)
    ...
    at org.jboss.weld.bean.ContextualInstanceStrategy$ApplicationScopedContextualInstanceStrategy.get(ContextualInstanceStrategy.java:140)
    - locked <0x00000000cc238218> (a org.jboss.weld.bean.ContextualInstanceStrategy$ApplicationScopedContextualInstanceStrategy)

These two threads are in a indirect dead lock.

Comment 17 Kobi Hakimi 2018-01-08 13:54:52 UTC

I have the same issue with nested hosted engine environment.
but in my case, for some reason, the VM looks like very slow and ovirt-engine restart not resolve this issue.

Attached logs and setup details in automation-nested-1-hosted-engine-logs.tar.gz file

Comment 18 Kobi Hakimi 2018-01-08 13:56:15 UTC

Created attachment 1378552 [details]
setup log and answer file - engine and server logs

Comment 19 Simone Tiraboschi 2018-01-15 18:08:38 UTC

Still there with ovirt-engine.noarch              4.2.1.1-1.el7.centos @ovirt-4.2-pre

in server.log:

2018-01-15 18:53:26,808+01 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 60) WFLYCLINF0002: Started timeout-base cache from ovirt-engine container
2018-01-15 18:58:05,045+01 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[
    ("core-service" => "management"),
    ("management-interface" => "native-interface")
]'
2018-01-15 18:58:05,097+01 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 64) WFLYUT0022: Unregistered web context: '/ovirt-engine/services' from server 'default-server'
2018-01-15 18:58:05,097+01 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 66) WFLYUT0022: Unregistered web context: '/ovirt-engine/api' from server 'default-server'
2018-01-15 18:58:05,098+01 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 67) WFLYUT0022: Unregistered web context: '/ovirt-engine/webadmin' from server 'default-server'
2018-01-15 18:58:05,097+01 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 65) WFLYUT0022: Unregistered web context: '/ovirt-engine/docs' from server 'default-server'
2018-01-15 18:58:05,104+01 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 63) WFLYUT0022: Unregistered web context: '/ovirt-engine' from server 'default-server'
2018-01-15 18:58:05,105+01 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 63) WFLYUT0022: Unregistered web context: '/' from server 'default-server'
2018-01-15 18:58:05,107+01 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 68) WFLYUT0022: Unregistered web context: '/ovirt-engine/sso' from server 'default-server'
2018-01-15 18:58:05,118+01 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 70) WFLYUT0022: Unregistered web context: '/ovirt-engine/apidoc' from server 'default-server'
2018-01-15 18:58:05,134+01 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 69) WFLYUT0022: Unregistered web context: '/ovirt-engine/web-ui' from server 'default-server'
2018-01-15 18:58:05,180+01 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-7) WFLYSRV0028: Stopped deployment apidoc.war (runtime-name: apidoc.war) in 127ms
2018-01-15 18:58:05,181+01 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-2) WFLYSRV0028: Stopped deployment restapi.war (runtime-name: restapi.war) in 129ms
2018-01-15 18:58:05,192+01 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-5) WFLYSRV0028: Stopped deployment ovirt-web-ui.war (runtime-name: ovirt-web-ui.war) in 140ms
2018-01-15 18:58:10,049+01 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@49d3206f for operation add-deployer-chains at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException
        at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:522) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1483) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.AbstractOperationContext$Step.finalizeInternal(AbstractOperationContext.java:1437) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.AbstractOperationContext$Step.finalizeStep(AbstractOperationContext.java:1410) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.AbstractOperationContext$Step.access$400(AbstractOperationContext.java:1284) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.AbstractOperationContext.executeResultHandlerPhase(AbstractOperationContext.java:857) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:709) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:450) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1402) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.ModelControllerImpl.boot(ModelControllerImpl.java:516) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:468) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:430) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.server.ServerService.boot(ServerService.java:437) [wildfly-server-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.server.ServerService.boot(ServerService.java:396) [wildfly-server-3.0.8.Final.jar:3.0.8.Final]
        at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:370) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_151]

2018-01-15 18:58:10,052+01 ERROR [org.jboss.as.controller.client] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@49d3206f for operation add-deployer-chains at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException

Comment 20 Simone Tiraboschi 2018-01-15 22:44:11 UTC

Seen also here: http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/2020/

Comment 21 Martin Perina 2018-01-16 08:35:20 UTC

According to logs it seems like a different issue with dependencies of HostDeviceManager, Simone is it possible to attach an engine thread dump so we can see which thread is blocking the initialization?

Comment 22 Simone Tiraboschi 2018-01-16 12:46:24 UTC

Here it is: it's not 100% reproducible.

On the first attempt (on the locally running Engine VM) wildfly correctly started, then we moved the disk to the shared storage and restarted the VM from there and wildfly failed due to

2018-01-16 12:33:44,249+01 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[
    ("core-service" => "management"),
    ("management-interface" => "native-interface")
]'


Then as expected, since the engine is down, ovirt-ha-agent shutdown and restarted the engine VM and this time wildfly correctly started.

Comment 23 Simone Tiraboschi 2018-01-16 12:48:06 UTC

Created attachment 1381985 [details]
wildfly debug logs

Comment 24 Jiri Belka 2018-01-19 16:45:03 UTC

Our Brno QE env - brq-setup - hit this issue as well. We use this env since 3.1, please provide a solution.

Comment 25 Ravi Nori 2018-01-19 17:12:20 UTC

The new patch http://gerrit.ovirt.org/86439 should fix the issue

Comment 26 Nikolai Sednev 2018-01-24 12:10:29 UTC

Worked for me just fine on these components, for FC deployment of node zero:
ovirt-hosted-engine-ha-2.2.5-0.0.master.20180117160214.20180117160210.gitd5b58a8.el7.centos.noarch
ovirt-hosted-engine-setup-2.2.8-0.0.master.20180123120416.gitfa69a78.el7.centos.noarch
ovirt-engine-appliance-4.2-20180121.1.el7.centos.noarch
Linux 3.10.0-693.17.1.el7.x86_64 #1 SMP Sun Jan 14 10:36:03 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

Moving to verified.
Please reopen if you still see this issue.

Comment 27 Nikolai Sednev 2018-01-24 12:12:22 UTC

I did not seen this bug happening already in ovirt-engine-setup-base-4.2.1.2-0.0.master.20180115100502.gitf91bea1.el7.centos.noarch

Comment 28 Martin Perina 2018-01-24 13:20:40 UTC

Before verifying this I'd like Jiri to upgrade http://brq-setup.rhev.lab.eng.brq.redhat.com/ to this week's build (probably 4.2.1.3) and see if issue is not visible anymore as issue when deploying new HE is a be less severe then reproducing it during engine upgrade.

Comment 29 Martin Perina 2018-01-24 13:22:17 UTC

And also worth mentioning that during node zero deployment the issue was not reproducible 100%, it was a race so probably repeating the test several times is a good idea

Comment 30 Simone Tiraboschi 2018-01-25 08:54:07 UTC

http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-master/15 still failed for the same reason with ovirt-engine-appliance-4.2-20180124.1.el7.centos.ova which should probably include also https://gerrit.ovirt.org/#/c/86667/

Comment 31 Martin Perina 2018-01-25 10:04:51 UTC

(In reply to Simone Tiraboschi from comment #30)
> http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-
> master/15 still failed for the same reason with
> ovirt-engine-appliance-4.2-20180124.1.el7.centos.ova which should probably
> include also https://gerrit.ovirt.org/#/c/86667/

http://jenkins.ovirt.org/view/oVirt system tests/job/ovirt-system-tests_manual/2077 finished successfully as it contains ovirt-engine-4.2.1.4-0.0.master.20180124131357.git9958cd1.el7.centos.noarch

Comment 32 Jiri Belka 2018-01-25 10:17:20 UTC

I just updated brq-setup env to ovirt-engine-backend-4.2.1.3-0.1.el7.noarch (overriding rnori@ manual fixes) and everything seems ok.

Comment 33 Sandro Bonazzola 2018-02-12 11:50:28 UTC

This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.