Created attachment 1370865 [details] server.log Description of problem: After Hosted Engine 0 node deployment (ansible setup) , an "HTTP 404 Not Found" occurs and wildfly times out. Version-Release number of selected component (if applicable): 4.2 How reproducible: Always Steps to Reproduce: Using Ovirt system tests: 1.run script he-basic-suite-master 2.wait for Hosted engine to deploy 3.test scenario 002_bootstrap will run and fail Actual results: HTTP 404 Not Found Error in server log causing wildfly to time-out after 300 seconds Expected results: No errors in server.log Additional info:
Created attachment 1370866 [details] engine.log
simply running systemctl restart ovirt-engine fixes it
Note that also extending the timeout from 10 to 20 minutes 'solves' it - I need to look at why it's needed at all, of course - but it doesn't always happen.
An upstream user reported it also after 4.1 -> 4.2 upgrade of an hosted-engine env so it doesn't seams specific to the new ansible based flow. Logs on: https://drive.google.com/file/d/1UQAllZfjueVGkXDsBs09S7THGDFn9YPa/view
(In reply to Simone Tiraboschi from comment #4) > An upstream user reported it also after 4.1 -> 4.2 upgrade of an > hosted-engine env so it doesn't seams specific to the new ansible based flow. > > Logs on: > https://drive.google.com/file/d/1UQAllZfjueVGkXDsBs09S7THGDFn9YPa/view Updated summary accordingly. quoting Martin Perina from users mailing list: it usually happens when access to storage is really slow or the host itself is overloaded and WildFly is not able to startup properly until default 300 seconds interval is over. If this is going to happen often, we will have to raise that timeout for all installations.
(In reply to Sandro Bonazzola from comment #5) > it usually happens when access to storage is really slow or the host itself > is overloaded and WildFly is not able to startup properly until default 300 > seconds interval is over. It's probably not that simple: 300 seconds are already a huge time. When I hit this on my tests I noticed that, also before the 300 seconds timeout, a simple 'systemctl restart ovirt-engine' was enough to get it properly started in a bunch of seconds.
Are we certain this is really related to load? It happens to me consistently with ovirt-system-tests and ansible hosted-engine aka node-zero. On the first start of the engine, it's all quick and nice: 2018-01-02 03:16:12,952-05 INFO [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 63) WFLYCLINF0002: Started timeout-base cache from ovirt-engine container 2018-01-02 03:16:17,699-05 INFO [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "ovirt-web-ui.war" (runtime-name : "ovirt-web-ui.war") 2018-01-02 03:16:17,699-05 INFO [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "apidoc.war" (runtime-name : "apidoc.war") 2018-01-02 03:16:17,699-05 INFO [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "restapi.war" (runtime-name : "restapi.war") 2018-01-02 03:16:17,699-05 INFO [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "engine.ear" (runtime-name : "engine.ear") 2018-01-02 03:16:18,116-05 INFO [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0212: Resuming server Then HE-setup does stuff with the engine (adds storage, host etc. - all working well), then it shuts down the machine, and on the first reboot of the engine vm, it fails: 2018-01-02 03:25:40,551-05 INFO [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 67) WFLYCLINF0002: Started timeout-base cache from ovirt-engine container 2018-01-02 03:30:00,597-05 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[ ("core-service" => "management"), ("management-interface" => "native-interface") ]' More than 4 minutes with nothing in server.log. During that time, engine.log does have some things in the beginning, nothing that seems relevant to me, and then nothing until the next restart: 2018-01-02 03:26:01,079-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DumpXmlsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-16) [] FINISH, DumpXmlsVDSCommand, return: [{vmId=80ed48d7-38ee-4c3c-b4f1-67b7438da3a4, devices=[Ljava.util.Map;@3128dacd, guestDiskMapping={}}], log id: 4a3569dd 2018-01-02 03:36:43,630-05 INFO [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 59) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'. Then HA times out, shuts down the engine machine, starts it, and then everything is quick and nice again: 2018-01-02 03:36:48,060-05 INFO [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 65) WFLYCLINF0002: Started timeout-base cache from ovirt-engine container 2018-01-02 03:36:53,469-05 INFO [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "ovirt-web-ui.war" (runtime-name : "ovirt-web-ui.war") 2018-01-02 03:36:53,470-05 INFO [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "apidoc.war" (runtime-name : "apidoc.war") 2018-01-02 03:36:53,470-05 INFO [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "restapi.war" (runtime-name : "restapi.war") 2018-01-02 03:36:53,470-05 INFO [org.jboss.as.server] (ServerService Thread Pool -- 25) WFLYSRV0010: Deployed "engine.ear" (runtime-name : "engine.ear") 2018-01-02 03:36:53,685-05 INFO [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0212: Resuming server
It happens sometimes also in regular HE environment, but I have a traceback under server.log: 2018-01-02 13:14:22,729+02 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[ ("core-service" => "management"), ("management-interface" => "native-interface") ]' 2018-01-02 13:14:30,405+02 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 73) WFLYUT0022: Unregistered web context: /ovirt-engine/web-ui 2018-01-02 13:14:30,419+02 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 93) WFLYUT0022: Unregistered web context: /ovirt-engine/rhv 2018-01-02 13:14:30,403+02 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 80) WFLYUT0022: Unregistered web context: /ovirt-engine/apidoc 2018-01-02 13:14:29,742+02 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@4e32357b for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:511) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1369) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.AbstractOperationContext$Step.finalizeInternal(AbstractOperationContext.java:1328) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.AbstractOperationContext$Step.finalizeStep(AbstractOperationContext.java:1301) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.AbstractOperationContext$Step.access$300(AbstractOperationContext.java:1185) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.AbstractOperationContext.executeResultHandlerPhase(AbstractOperationContext.java:767) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:644) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:370) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1344) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.ModelControllerImpl.boot(ModelControllerImpl.java:485) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:387) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:349) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.server.ServerService.boot(ServerService.java:392) [wildfly-server-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.server.ServerService.boot(ServerService.java:365) [wildfly-server-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:299) [wildfly-controller-2.1.2.Final-redhat-1.jar:2.1.2.Final-redhat-1] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_151] 2018-01-02 13:14:30,607+02 ERROR [org.jboss.as.controller.client] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@4e32357b for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException
So, could someone please reproduce this issue with WildFly logs in debug? It can be done by editing /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in and changing following lines: 1. Change INFO to DEBUG in: <file-handler name="SERVER"> <level name="DEBUG"/> <formatter> 2. Change INFO to DEBUG in: <root-logger> <level name="DEBUG"/> <handlers> With those logs we should be able to understand why deployment took that much time. Thanks
85898 is a patch to HE-setup to patch the engine as requested in comment 9.
BTW, https://gerrit.ovirt.org/85890 indeed makes it work for me. So it might indeed be related to high load.
Looking at the logs provided by Didi in the `devel` list (see [1]) I see that the last activity performed by the application server service threads is the following: 2018-01-01 08:19:18,619-05 INFO [org.ovirt.engine.core.bll.ServiceLoader] (ServerService Thread Pool -- 57) [] Start org.ovirt.engine.core.bll.scheduling.SchedulingManager@104b2434 In a normal start-up sequence right after that message for the `SchedulingManager` a similar one should appear for the `HostDeviceManager` as the code calls them in sequence: // From InitBackendServicesOnStartupBean.java: serviceLoader.load(SchedulingManager.class); sessionDataContainer.cleanupEngineSessionsOnStartup(); serviceLoader.load(HostDeviceManager.class); But that `HostDeviceManager` message never appears: the application server initialization times out before. Looking at the code for `HostDeviceMananager` it looks like during its initialization it tries to reload the devices of the hosts that are `UP`: // From HostDeviceManager.java: @PostConstruct private void init() { ... backend.runInternalMultipleActions(ActionType.RefreshHostDevices, parameters); hostDeviceDao.cleanDownVms(); } And the `RefreshHostDevices` command in turn depends on `HostDeviceManager`: // From RefreshHostDevicesCommand.java: @NonTransactiveCommandAttribute public class RefreshHostDevicesCommand ... { ... @Inject private HostDeviceManager hostDeviceManager; ... } So the net result is that there is a circular dependency: initializing the `HostDeviceManager` requires a working `HostDeviceManager`. Most probably that is what blocks the initialization of the application server. This doesn't happen always because the `RefreshHostDevices` command is executed asynchronously, in a different thread. If that thread is fast enough it will try to use the `HostDeviceManager` before it finished initializing, and thus it will block for ever. If it isn't fast enough then the `HostDeviceManager` will be already initialized, and then there is no problem. Anyhow, I think we should remove that circular dependency. Please try to obtain a thread dump of the engine when it is blocked in this wait, that can help us determine if this is really what is happening. [1] http://lists.ovirt.org/pipermail/devel/2018-January/032365.html
Juan, thanks for the detailed analysis. I hope to soon attach logs with DEBUG level and thread dumps, hopefully this can reassure us about the root cause. Is there any way to automatically verify that we do not have other such potential loops? Perhaps add a flag somewhere to sleep for a random time in the beginning of each such service-starting thread? Or some other way to make them start at random order?
I am not aware of any mechanism or tool to automatically find this kind of circular dependencies.
Created attachment 1375786 [details] /var/log/ovirt-engine with DEBUG and full thread dumps
The thread stack dump seems to confirm what I suspected. The thread that initializes the backend is blocked waiting on lock 0x00000000cc238218, for the `HostDeviceManager` instance to be created, which blocks the startup of the application server: "ServerService Thread Pool -- 68" #124 prio=5 os_prio=0 tid=0x000000000249f7d0 nid=0x773 waiting for monitor entry [0x00007f4f52802000] java.lang.Thread.State: BLOCKED (on object monitor) at org.jboss.weld.bean.ContextualInstanceStrategy$ApplicationScopedContextualInstanceStrategy.getIfExists(ContextualInstanceStrategy.java:122) - waiting to lock <0x00000000cc238218> (a org.jboss.weld.bean.ContextualInstanceStrategy$ApplicationScopedContextualInstanceStrategy) And then there is thread that already obtained that lock, the thread that is trying to do the initialization of the `HostDeviceManager`. That thread is blocked waiting for the backend bean to be completely initialized, which will never happen: "EE-ManagedThreadFactory-engine-Thread-3" #169 daemon prio=5 os_prio=0 tid=0x00000000064d5600 nid=0x7ad in Object.wait() [0x00007f4f4efd6000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000c3a899a8> (a org.jboss.as.ee.component.deployers.StartupCountdown) at java.lang.Object.wait(Object.java:502) at org.jboss.as.ee.component.deployers.StartupCountdown.await(StartupCountdown.java:45) - locked <0x00000000c3a899a8> (a org.jboss.as.ee.component.deployers.StartupCountdown) at org.jboss.as.ejb3.deployment.processors.StartupAwaitInterceptor.processInvocation(StartupAwaitInterceptor.java:21) ... at org.jboss.weld.bean.ContextualInstanceStrategy$ApplicationScopedContextualInstanceStrategy.get(ContextualInstanceStrategy.java:140) - locked <0x00000000cc238218> (a org.jboss.weld.bean.ContextualInstanceStrategy$ApplicationScopedContextualInstanceStrategy) These two threads are in a indirect dead lock.
I have the same issue with nested hosted engine environment. but in my case, for some reason, the VM looks like very slow and ovirt-engine restart not resolve this issue. Attached logs and setup details in automation-nested-1-hosted-engine-logs.tar.gz file
Created attachment 1378552 [details] setup log and answer file - engine and server logs
Still there with ovirt-engine.noarch 4.2.1.1-1.el7.centos @ovirt-4.2-pre in server.log: 2018-01-15 18:53:26,808+01 INFO [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 60) WFLYCLINF0002: Started timeout-base cache from ovirt-engine container 2018-01-15 18:58:05,045+01 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[ ("core-service" => "management"), ("management-interface" => "native-interface") ]' 2018-01-15 18:58:05,097+01 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 64) WFLYUT0022: Unregistered web context: '/ovirt-engine/services' from server 'default-server' 2018-01-15 18:58:05,097+01 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 66) WFLYUT0022: Unregistered web context: '/ovirt-engine/api' from server 'default-server' 2018-01-15 18:58:05,098+01 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 67) WFLYUT0022: Unregistered web context: '/ovirt-engine/webadmin' from server 'default-server' 2018-01-15 18:58:05,097+01 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 65) WFLYUT0022: Unregistered web context: '/ovirt-engine/docs' from server 'default-server' 2018-01-15 18:58:05,104+01 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 63) WFLYUT0022: Unregistered web context: '/ovirt-engine' from server 'default-server' 2018-01-15 18:58:05,105+01 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 63) WFLYUT0022: Unregistered web context: '/' from server 'default-server' 2018-01-15 18:58:05,107+01 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 68) WFLYUT0022: Unregistered web context: '/ovirt-engine/sso' from server 'default-server' 2018-01-15 18:58:05,118+01 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 70) WFLYUT0022: Unregistered web context: '/ovirt-engine/apidoc' from server 'default-server' 2018-01-15 18:58:05,134+01 INFO [org.wildfly.extension.undertow] (ServerService Thread Pool -- 69) WFLYUT0022: Unregistered web context: '/ovirt-engine/web-ui' from server 'default-server' 2018-01-15 18:58:05,180+01 INFO [org.jboss.as.server.deployment] (MSC service thread 1-7) WFLYSRV0028: Stopped deployment apidoc.war (runtime-name: apidoc.war) in 127ms 2018-01-15 18:58:05,181+01 INFO [org.jboss.as.server.deployment] (MSC service thread 1-2) WFLYSRV0028: Stopped deployment restapi.war (runtime-name: restapi.war) in 129ms 2018-01-15 18:58:05,192+01 INFO [org.jboss.as.server.deployment] (MSC service thread 1-5) WFLYSRV0028: Stopped deployment ovirt-web-ui.war (runtime-name: ovirt-web-ui.war) in 140ms 2018-01-15 18:58:10,049+01 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@49d3206f for operation add-deployer-chains at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:522) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1483) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.AbstractOperationContext$Step.finalizeInternal(AbstractOperationContext.java:1437) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.AbstractOperationContext$Step.finalizeStep(AbstractOperationContext.java:1410) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.AbstractOperationContext$Step.access$400(AbstractOperationContext.java:1284) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.AbstractOperationContext.executeResultHandlerPhase(AbstractOperationContext.java:857) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:709) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:450) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1402) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.ModelControllerImpl.boot(ModelControllerImpl.java:516) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:468) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:430) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.server.ServerService.boot(ServerService.java:437) [wildfly-server-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.server.ServerService.boot(ServerService.java:396) [wildfly-server-3.0.8.Final.jar:3.0.8.Final] at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:370) [wildfly-controller-3.0.8.Final.jar:3.0.8.Final] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_151] 2018-01-15 18:58:10,052+01 ERROR [org.jboss.as.controller.client] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@49d3206f for operation add-deployer-chains at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException
Seen also here: http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/2020/
According to logs it seems like a different issue with dependencies of HostDeviceManager, Simone is it possible to attach an engine thread dump so we can see which thread is blocking the initialization?
Here it is: it's not 100% reproducible. On the first attempt (on the locally running Engine VM) wildfly correctly started, then we moved the disk to the shared storage and restarted the VM from there and wildfly failed due to 2018-01-16 12:33:44,249+01 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[ ("core-service" => "management"), ("management-interface" => "native-interface") ]' Then as expected, since the engine is down, ovirt-ha-agent shutdown and restarted the engine VM and this time wildfly correctly started.
Created attachment 1381985 [details] wildfly debug logs
Our Brno QE env - brq-setup - hit this issue as well. We use this env since 3.1, please provide a solution.
The new patch http://gerrit.ovirt.org/86439 should fix the issue
Worked for me just fine on these components, for FC deployment of node zero: ovirt-hosted-engine-ha-2.2.5-0.0.master.20180117160214.20180117160210.gitd5b58a8.el7.centos.noarch ovirt-hosted-engine-setup-2.2.8-0.0.master.20180123120416.gitfa69a78.el7.centos.noarch ovirt-engine-appliance-4.2-20180121.1.el7.centos.noarch Linux 3.10.0-693.17.1.el7.x86_64 #1 SMP Sun Jan 14 10:36:03 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Moving to verified. Please reopen if you still see this issue.
I did not seen this bug happening already in ovirt-engine-setup-base-4.2.1.2-0.0.master.20180115100502.gitf91bea1.el7.centos.noarch
Before verifying this I'd like Jiri to upgrade http://brq-setup.rhev.lab.eng.brq.redhat.com/ to this week's build (probably 4.2.1.3) and see if issue is not visible anymore as issue when deploying new HE is a be less severe then reproducing it during engine upgrade.
And also worth mentioning that during node zero deployment the issue was not reproducible 100%, it was a race so probably repeating the test several times is a good idea
http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-master/15 still failed for the same reason with ovirt-engine-appliance-4.2-20180124.1.el7.centos.ova which should probably include also https://gerrit.ovirt.org/#/c/86667/
(In reply to Simone Tiraboschi from comment #30) > http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite- > master/15 still failed for the same reason with > ovirt-engine-appliance-4.2-20180124.1.el7.centos.ova which should probably > include also https://gerrit.ovirt.org/#/c/86667/ http://jenkins.ovirt.org/view/oVirt system tests/job/ovirt-system-tests_manual/2077 finished successfully as it contains ovirt-engine-4.2.1.4-0.0.master.20180124131357.git9958cd1.el7.centos.noarch
I just updated brq-setup env to ovirt-engine-backend-4.2.1.3-0.1.el7.noarch (overriding rnori@ manual fixes) and everything seems ok.
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.