Bug 1357615

Summary: Failed to add RHEL7.2 as additional hosted-engine-host via WEBUI.
Product: [oVirt] ovirt-host-deploy Reporter: Nikolai Sednev <nsednev>
Component: Plugins.Hosted-EngineAssignee: Sandro Bonazzola <sbonazzo>
Status: CLOSED DUPLICATE QA Contact: Pavel Stehlik <pstehlik>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: masterCC: bugs, msivak, rgolan, stirabos
Target Milestone: ---Keywords: Reopened
Target Release: ---Flags: rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-19 09:09:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1167262    
Attachments:
Description Flags
sosreport from engine
none
sosreport from alma03
none
sosreport from alma04 none

Description Nikolai Sednev 2016-07-18 16:08:30 UTC
Description of problem:
I've deployed HE over iSCSI on one RHEL7.2 host, added to it NFS data storage domain and tried to add additional RHEL7.2 hosed-engine-host and failed.
In my case both hosts also may "see" 2 additional empty iSCSI LUNs on iSCSI storage, only one of which is HE's LUN (the LUN with 75Gigs there belongs to HE, others are empty).

iqn of hosted-engine-host was correct. 

Version-Release number of selected component (if applicable):
Engine:
rhevm-doc-4.0.0-2.el7ev.noarch
rhev-guest-tools-iso-4.0-4.el7ev.noarch
rhevm-4.0.1.1-0.1.el7ev.noarch
rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch
rhevm-branding-rhev-4.0.0-3.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch
rhevm-guest-agent-common-1.0.12-2.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch
rhev-release-4.0.1-2-001.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Hosts:
ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.18.x86_64
mom-0.5.5-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.1-1.el7ev.noarch
ovirt-host-deploy-1.5.1-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
vdsm-4.18.6-1.el7ev.x86_64
sanlock-3.2.4-2.el7_2.x86_64
ovirt-imageio-daemon-0.3.0-0.el7ev.noarch
ovirt-imageio-common-0.3.0-0.el7ev.noarch
Linux version 3.10.0-327.30.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Wed Jul 13 22:09:46 EDT 2016
Linux 3.10.0-327.30.1.el7.x86_64 #1 SMP Wed Jul 13 22:09:46 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

How reproducible:
100%

Steps to Reproduce:
1.Deploy HE over iSCSI and add additionally NFS data storage domain to get HE imported correctly in to the engine's WEBUI.
2.Try adding additional HE host via WEBUI, make sure it's iqn is correct and mapped prior to adding it via WEBUI.
3.

Actual results:
Host alma04.qa.lab.tlv.redhat.com reports about one of the Active Storage Domains as Problematic. Host failed to get added as additional hosted-engine-host.

Expected results:
Should successfully get added via WEBUI.

Additional info:
Sosreports from both hosts and the engine attached (the host that being added is alma04).

Comment 1 Nikolai Sednev 2016-07-18 16:09:52 UTC
Created attachment 1181172 [details]
sosreport from engine

Comment 2 Nikolai Sednev 2016-07-18 16:11:16 UTC
Created attachment 1181173 [details]
sosreport from alma03

Comment 3 Nikolai Sednev 2016-07-18 16:12:23 UTC
Created attachment 1181175 [details]
sosreport from alma04

Comment 4 Simone Tiraboschi 2016-07-18 16:23:24 UTC
This seams just a duplicate of 1350763: Add host failed - failed to configure ovirtmgmt network on host since vdsm is still on recovery

It's not hosted-engine specific.


2016-07-18 11:44:53,487 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [148b825d] Correlation ID: 148b825d, Call Stack: null, Custom Event ID: -1, Message: Installing Host alma04.qa.lab.tlv.redhat.com. Stage: Termination.
2016-07-18 11:44:53,620 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to alma04.qa.lab.tlv.redhat.com/10.35.117.26
2016-07-18 11:44:56,620 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.TimeBoundPollVDSCommand] (org.ovirt.thread.pool-6-thread-24) [148b825d] Command 'TimeBoundPollVDSCommand(HostName = alma04.qa.lab.tlv.redhat.com, TimeBoundPollVDSCommandParameters:{runAsync='true', hostId='beff4479-67f4-425e-8484-82ac26dc2fc4'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call
2016-07-18 11:44:56,620 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.TimeBoundPollVDSCommand] (org.ovirt.thread.pool-6-thread-24) [148b825d] Timeout waiting for VDSM response: null
2016-07-18 11:44:56,620 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient] (org.ovirt.thread.pool-6-thread-32) [] null: java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) [rt.jar:1.8.0_101]
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) [rt.jar:1.8.0_101]
	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) [rt.jar:1.8.0_101]
	at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient.waitForConnect(SSLStompClient.java:107) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient.sendMessage(SSLStompClient.java:78) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:81) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:91) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.lambda$timeBoundPoll$2(JsonRpcVdsServer.java:972) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer$FutureCallable.call(JsonRpcVdsServer.java:458) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer$FutureCallable.call(JsonRpcVdsServer.java:447) [vdsbroker.jar:]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_101]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_101]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_101]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_101]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_101]
	at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_101]

2016-07-18 11:45:00,602 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler7) [39efcb4] Fetched 1 VMs from VDS 'de5723be-3605-420e-9afd-86dc0e08c606'
2016-07-18 11:45:01,621 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to alma04.qa.lab.tlv.redhat.com/10.35.117.26
2016-07-18 11:45:04,764 INFO  [org.ovirt.engine.core.bll.network.NetworkConfigurator] (org.ovirt.thread.pool-6-thread-24) [148b825d] Engine managed to communicate with VDSM agent on host 'alma04.qa.lab.tlv.redhat.com' ('beff4479-67f4-425e-8484-82ac26dc2fc4')
2016-07-18 11:45:05,736 INFO  [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (org.ovirt.thread.pool-6-thread-24) [54341c1e] Lock Acquired to object 'EngineLock:{exclusiveLocks='[beff4479-67f4-425e-8484-82ac26dc2fc4=<HOST_NETWORK, ACTION_TYPE_FAILED_SETUP_NETWORKS_IN_PROGRESS>]', sharedLocks='null'}'
2016-07-18 11:45:05,829 INFO  [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (org.ovirt.thread.pool-6-thread-24) [54341c1e] Running command: HostSetupNetworksCommand internal: true. Entities affected :  ID: beff4479-67f4-425e-8484-82ac26dc2fc4 Type: VDSAction group CONFIGURE_HOST_NETWORK with role type ADMIN
2016-07-18 11:45:05,835 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (org.ovirt.thread.pool-6-thread-24) [54341c1e] START, HostSetupNetworksVDSCommand(HostName = alma04.qa.lab.tlv.redhat.com, HostSetupNetworksVdsCommandParameters:{runAsync='true', hostId='beff4479-67f4-425e-8484-82ac26dc2fc4', vds='Host[alma04.qa.lab.tlv.redhat.com,beff4479-67f4-425e-8484-82ac26dc2fc4]', rollbackOnFailure='true', connectivityTimeout='120', networks='[HostNetwork:{defaultRoute='true', bonding='false', networkName='ovirtmgmt', nicName='enp3s0f0', vlan='null', mtu='0', vmNetwork='true', stp='false', properties='null', ipv4BootProtocol='DHCP', ipv4Address='null', ipv4Netmask='null', ipv4Gateway='null', ipv6BootProtocol='NONE', ipv6Address='null', ipv6Prefix='null', ipv6Gateway='null'}]', removedNetworks='[]', bonds='[]', removedBonds='[]'}), log id: 7572ebe8
2016-07-18 11:45:05,838 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (org.ovirt.thread.pool-6-thread-24) [54341c1e] FINISH, HostSetupNetworksVDSCommand, log id: 7572ebe8

*** This bug has been marked as a duplicate of bug 1350763 ***

Comment 5 Nikolai Sednev 2016-07-19 08:18:36 UTC
These two are different bugs, mine was found on 4.0.1, wheres https://bugzilla.redhat.com/show_bug.cgi?id=1350763#c31 was for 3.6.8 and also was verified on 2016-07-18 06:39:04 EDT, however I still got this bug on  	2016-07-18 12:08 EDT, so this bug should be fixed for 4.0.

Comment 6 Simone Tiraboschi 2016-07-19 09:09:35 UTC
Ok sorry, 1348103 is the same issue for 4.0.2 so setting this as a duplicate of that.

*** This bug has been marked as a duplicate of bug 1348103 ***

Comment 7 Nikolai Sednev 2016-07-19 09:34:19 UTC
I'm not sure these still the same, as in bug 1348103 addition succeeded after 5 minutes, but in mine it's not, it stays in this state statically.

Comment 8 Simone Tiraboschi 2016-07-19 09:40:12 UTC
(In reply to Nikolai Sednev from comment #7)
> I'm not sure these still the same, as in bug 1348103 addition succeeded
> after 5 minutes, but in mine it's not, it stays in this state statically.

On my opinion the issue is really the same.

In that case after 5 minutes the AutoRecoveryManager kicks in and the host goes up by itself and so just retrying on hosted-engine-side is enough to continue.
Maybe we need to better understand why here the AutoRecoveryManager is not enough but the root issue is really the same.