Bug 1216172

Summary: [self-hosted] Can't add 2nd host into self-hosted env: The VDSM host was found in a failed state... Unable to add slot-5b to the manager
Product: Red Hat Enterprise Virtualization Manager Reporter: Jiri Belka <jbelka>
Component: ovirt-hosted-engine-setupAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED ERRATA QA Contact: Artyom <alukiano>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.5.1CC: bazulay, gklein, istein, jbelka, lpeer, lsurette, oourfali, pstehlik, pzhukov, sbonazzo, yeylon, ykaul, ylavi
Target Milestone: ovirt-3.6.0-rcKeywords: Regression, TestBlocker, Triaged, ZStream
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-3.6.0-alpha1 Doc Type: Bug Fix
Doc Text:
Previously, HostId was treated as an integer on the first host and as a string on additional hosts due to bad parsing of the answerfile, causing setup to fail. Now, this failure has been fixed by treating HostId as an integer on all hosts.
Story Points: ---
Clone Of:
: 1221290 (view as bug list) Environment:
Last Closed: 2016-03-09 19:12:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1215967, 1221148, 1226670, 1271272    
Bug Blocks: 1221290, 1234915    
Attachments:
Description Flags
logs from 2nd host
none
engine logs
none
ovirt-hosted-engine-setup-20150529172808-on9y3z.log none

Description Jiri Belka 2015-04-28 16:37:31 UTC
Created attachment 1019789 [details]
logs from 2nd host

Description of problem:

I can't add second host into self-hosted env, first host runs ok.

- there's problem with hosted-engine --deploy, rhevm bridge is not created successfully. i made that manually, ip was still on underlying (em1) device; then i reexecuted hosted-engine --deploy

[ INFO  ] Configuring VM
[ INFO  ] Updating hosted-engine configuration
[ INFO  ] Stage: Transaction commit
[ INFO  ] Stage: Closing up
[ INFO  ] Waiting for the host to become operational in the engine. This may take several minutes...
[ INFO  ] Still waiting for VDSM host to become operational...
[ ERROR ] The VDSM host was found in a failed state. Please check engine and bootstrap installation logs.
[ ERROR ] Unable to add slot-5b to the manager
[ INFO  ] Enabling and starting HA services
          Hosted Engine successfully set up
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20150428182105.conf'
[ INFO  ] Generating answer file '/etc/ovirt-hosted-engine/answers.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination

vdsm.log is full of python exceptions...
Thread-47::DEBUG::2015-04-28 18:20:02,837::fileSD::261::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd if=/rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup/23c03bb6-98
89-4cbf-b7ad-55b9a2c70653/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
Thread-47::DEBUG::2015-04-28 18:20:02,842::fileSD::261::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n497 bytes (497 B) copied, 0.000312696 s, 1
.6 MB/s\n'; <rc> = 0
Thread-47::ERROR::2015-04-28 18:20:02,845::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 23c03bb6-9889-4cbf-b7ad-55b9a2c70653 monitorin
g information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 250, in _monitorDomain
    self.nextStatus.hasHostId = self.domain.hasHostId(self.hostId)
  File "/usr/share/vdsm/storage/sd.py", line 483, in hasHostId
    return self._clusterLock.hasHostId(hostId)
  File "/usr/share/vdsm/storage/clusterlock.py", line 261, in hasHostId
    hostId, self._idsPath)
TypeError: argument 2 must be integer<k>, not str

...


MainThread::DEBUG::2015-04-28 18:24:18,293::protocoldetector::144::vds.MultiProtocolAcceptor::(stop) Stopping Acceptor
ioprocess communication (36158)::ERROR::2015-04-28 18:24:18,292::__init__::152::IOProcessClient::(_communicate) IOProcess failure
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in _communicate
    raise Exception("FD closed")
Exception: FD closed

Version-Release number of selected component (if applicable):
vdsm-4.16.13.1-1.el7ev.x86_64
ovirt-hosted-engine-setup-1.2.2-3.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. two hosts, one host part of self-hosted engine
2. have rhevm env working
3. add 2nd host into self-hosted engine

Actual results:
setup fails in the end, seems vdsm related

Expected results:
should work, it should be "HA"

Additional info:

Comment 1 Jiri Belka 2015-04-28 16:38:17 UTC
Created attachment 1019790 [details]
engine logs

Comment 2 Pavel Stehlik 2015-04-29 06:26:07 UTC
Please retry now - due to EMC storage policy it's IQN needs to be allowed to shares. I've just added it's IQN to the list.

Comment 3 Jiri Belka 2015-04-29 07:42:43 UTC
No, still same issue.

Comment 6 Jiri Belka 2015-05-29 15:32:03 UTC
while adding 2nd host with ovirt-hosted-engine-setup-1.3.0-0.0.master.20150518075146.gitdd9741f.el7.noarch:

...
          --== HOSTED ENGINE CONFIGURATION ==--
         
          Enter the name which will be used to identify this host inside the Administrator Portal [hosted_engine_2]: 
          Enter 'admin@internal' user password that will be used for accessing the Administrator Portal: 
          Confirm 'admin@internal' user password: 
[WARNING] Failed to resolve jb-hosted.rhev.lab.eng.brq.redhat.com using DNS, it can be resolved only locally
[ INFO  ] Stage: Setup validation
[ ERROR ] Failed to execute stage 'Setup validation': [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.34.63.199:_jbelka_jb-hosted/5440dfcd-9be7-4e43-97c5-bff83cc20e9b/ha_agent/hosted-engine.metadata'
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20150529172837.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination


...
2015-05-29 17:28:34 DEBUG otopi.plugins.ovirt_hosted_engine_setup.pki.vdsmpki plugin.execute:940 execute-output: ('/bin/openssl', 'x509', '-noout', '-text', '-in', '/etc/pki/vdsm/libvirt-spice/server-cert.pem') s
tderr:


2015-05-29 17:28:34 DEBUG otopi.context context._executeMethod:141 Stage validation METHOD otopi.plugins.ovirt_hosted_engine_setup.sanlock.lockspace.Plugin._validation
2015-05-29 17:28:34 DEBUG otopi.context context._executeMethod:147 condition False
2015-05-29 17:28:34 DEBUG otopi.context context._executeMethod:141 Stage validation METHOD otopi.plugins.ovirt_hosted_engine_setup.storage.storage.Plugin._validation
2015-05-29 17:28:34 DEBUG otopi.context context._executeMethod:155 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 145, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 263, in _validation
    ] + ".metadata",
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 168, in get_all_host_stats_direct
    self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 119, in get_all_stats_direct
    stats = sb.get_raw_stats_for_service_type("client", service_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 125, in get_raw_stats_for_service_type
    f = os.open(path, direct_flag | os.O_RDONLY)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.34.63.199:_jbelka_jb-hosted/5440dfcd-9be7-4e43-97c5-bff83cc20e9b/ha_agent/hosted-engine.metadata'
2015-05-29 17:28:34 ERROR otopi.context context._executeMethod:164 Failed to execute stage 'Setup validation': [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.34.63.199:_jbelka_jb-hosted/5440dfcd-9
be7-4e43-97c5-bff83cc20e9b/ha_agent/hosted-engine.metadata'
...

Comment 7 Jiri Belka 2015-05-29 15:32:28 UTC
Created attachment 1032157 [details]
ovirt-hosted-engine-setup-20150529172808-on9y3z.log

Comment 8 Jiri Belka 2015-05-29 15:33:57 UTC
broken symlink:

[root@dell-r210ii-13 ~]# ls -l /rhev/data-center/mnt/10.34.63.199:_jbelka_jb-hosted/5440dfcd-9be7-4e43-97c5-bff83cc20e9b/ha_agent/hosted-engine.metadata
lrwxrwxrwx. 1 vdsm kvm 132 May 29 17:15 /rhev/data-center/mnt/10.34.63.199:_jbelka_jb-hosted/5440dfcd-9be7-4e43-97c5-bff83cc20e9b/ha_agent/hosted-engine.metadata -> /var/run/vdsm/storage/5440dfcd-9be7-4e43-97c5-bff83cc20e9b/e2124bb1-bd54-4527-90ce-903e9bf7daf1/1ed25ddd-1fbf-4c16-ac24-1becbf1e6fc7

[root@dell-r210ii-13 ~]# find /var/run/vdsm/
/var/run/vdsm/
/var/run/vdsm/lvm
/var/run/vdsm/lvm/lvm.conf
/var/run/vdsm/client.log
/var/run/vdsm/nets_restored
/var/run/vdsm/svdsm.sock
/var/run/vdsm/v2v
/var/run/vdsm/trackedInterfaces
/var/run/vdsm/sourceRoutes

Comment 9 Jiri Belka 2015-05-29 15:35:17 UTC
vdsm-4.17.0-822.git9b11a18.el7.noarch

on RHEL7.1, vdsm not running yet (error occured during hosted-engine --deploy on 2nd host)

Comment 10 Simone Tiraboschi 2015-06-08 07:59:11 UTC
The original issue was this one

File "/usr/share/vdsm/storage/clusterlock.py", line 261, in hasHostId
    hostId, self._idsPath)
TypeError: argument 2 must be integer<k>, not str

and now it seams OK cause it goes further.
I has also been marked as verified for 3.5.3: https://bugzilla.redhat.com/1221290

With VDSM 4.17 we are facing an additional issue 
2015-05-29 17:28:34 ERROR otopi.context context._executeMethod:164 Failed to execute stage 'Setup validation': [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.34.63.199:_jbelka_jb-hosted/5440dfcd-9
be7-4e43-97c5-bff83cc20e9b/ha_agent/hosted-engine.metadata'

witch was also reported here: 
https://bugzilla.redhat.com/show_bug.cgi?id=1226670

Please handle this separately.

Comment 12 Artyom 2015-09-02 10:46:17 UTC
Verified on ovirt-hosted-engine-setup-1.3.0-0.4.beta.git42eb801.el7ev.noarch
Deployment of additional host on NFS storage passed without any errors

Comment 14 errata-xmlrpc 2016-03-09 19:12:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0375.html