1167074 – [hosted-engine] Deployment fails with "The VDSM host was found in a failed state. Please check engine and bootstrap installation logs."

Bug 1167074 - [hosted-engine] Deployment fails with "The VDSM host was found in a failed state. Please check engine and bootstrap installation logs."

Summary: [hosted-engine] Deployment fails with "The VDSM host was found in a failed st...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-setup
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Simone Tiraboschi
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1278715 (view as bug list)
Depends On:	1215623 1220824
Blocks:	1207062 1269768
TreeView+	depends on / blocked

Reported:	2014-11-23 10:00 UTC by Elad
Modified:	2016-03-09 19:06 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, iptables-services was not a required package during hosted-engine deployment, so hosted-engine was unable to configure iptables. Now, iptables-services is required, so hosted-engine can configure iptables correctly.
Clone Of:
Clones:	1207062 (view as bug list)
Environment:
Last Closed:	2016-03-09 19:06:18 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
setup and vdsm logs (1023.99 KB, application/x-gzip) 2014-11-23 10:00 UTC, Elad	no flags	Details
updated logs (2.47 MB, application/x-gzip) 2014-11-23 10:05 UTC, Elad	no flags	Details
sanlock.log (3.02 KB, application/x-gzip) 2014-11-23 10:06 UTC, Elad	no flags	Details
/var/log (457.14 KB, application/x-gzip) 2014-11-26 10:58 UTC, Elad	no flags	Details
updated logs (816.36 KB, application/x-gzip) 2014-11-26 11:02 UTC, Elad	no flags	Details
sanlock_log_dump (200.93 KB, text/plain) 2014-11-26 17:13 UTC, Simone Tiraboschi	no flags	Details
sanlock_status (348 bytes, text/plain) 2014-11-26 17:14 UTC, Simone Tiraboschi	no flags	Details
logs from RHEL6 reproduce (2.46 MB, application/x-gzip) 2014-12-07 14:44 UTC, Elad	no flags	Details
logs from host and engine (sosreports) and configuration log from deployment of the HE (13.38 MB, application/x-gzip) 2015-12-22 13:23 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1236588	high	CLOSED	Disable ovirt-host-deploy packager plugins while deploying HE	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2016:0375	normal	SHIPPED_LIVE	ovirt-hosted-engine-setup bug fix and enhancement update	2016-03-09 23:48:34 UTC
oVirt gerrit	37661	None	MERGED	packaging: spec: require iptables-services	2021-02-05 11:34:35 UTC

Internal Links: 1236588

Description Elad 2014-11-23 10:00:56 UTC

Created attachment 960409 [details]
setup and vdsm logs

Description of problem:
Deployed hosted-engine, used iscsi. When I got to the phase when rhevm installation was completed on the engine's VM, picked:
(1) Continue setup - engine installation is complete


Then, installation crashed with:

(1, 2, 3)[1]: 1
[ INFO  ] Engine replied: DB Up!Welcome to Health Status!
          Enter the name of the cluster to which you want to add the host (Default) [Default]: 
[ INFO  ] Waiting for the host to become operational in the engine. This may take several minutes...
[ INFO  ] Still waiting for VDSM host to become operational...
[ INFO  ] Still waiting for VDSM host to become operational...
[ INFO  ] Still waiting for VDSM host to become operational...
[ ERROR ] The VDSM host was found in a failed state. Please check engine and bootstrap installation logs.
[ ERROR ] Unable to add hosted_engine_1 to the manager
          Please shutdown the VM allowing the system to launch it as a monitored service.
          The system will wait until the VM is down.
[ ERROR ] Failed to execute stage 'Closing up': [Errno 111] Connection refused
[ INFO  ] Stage: Clean up
[ ERROR ] Failed to execute stage 'Clean up': [Errno 111] Connection refused
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20141123113619.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination




Version-Release number of selected component (if applicable):

rhev 3.5 vt10.1
vdsm-4.16.7.4-1.el7ev.x86_64
ovirt-hosted-engine-ha-1.2.4-1.el7.noarch
ovirt-hosted-engine-setup-1.2.1-3.el7ev.noarch

Host: 
[root@green-vdsb ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.0 (Maipo)


How reproducible:
Tried once

Steps to Reproduce:
1. hosted-engine --deploy, pick iscsi
2. when rhevm installation is completed on engine's VM, continue the deployment
3.


Actual results:

Deployed the hosted-engine on a 'clean' environment (no VM running on the host and no storage domain exists on the used LUN). Deployment failed with the mentioned error

Expected results:

Deployment should succeed

Additional info:
setup and vdsm logs

Comment 1 Elad 2014-11-23 10:05:27 UTC

Created attachment 960411 [details]
updated logs

from vdsm.log:

Thread-71::DEBUG::2014-11-23 11:34:11,689::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain 31e980f2-ebca-4c73-bb29-0f8a8b06ef48
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: ('31e980f2-ebca-4c73-bb29-0f8a8b06ef48', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))


Attaching the relevant logs

Comment 2 Elad 2014-11-23 10:06:56 UTC

Created attachment 960412 [details]
sanlock.log

Comment 3 Yedidyah Bar David 2014-11-23 14:51:27 UTC

Looked at the attached logs, can't see much more than already shown above. sanlock log ends half an hour before this error.

Already talked in private with Elad and he said the relevant data (machines, logs etc) were already removed. Next time please keep and attach also logs from the engine VM (especially host-deploy).

I'll leave it for Sandro to have a look too, he might be able to see things I did not see.

Comment 4 Yedidyah Bar David 2014-11-24 08:39:45 UTC

David, in case we fail to reproduce, what should QE collect next time to help debug this? all of dom_md? parts? anything else? Thanks!

Comment 5 Yedidyah Bar David 2014-11-24 09:16:55 UTC

Federico, what should be collected to help debug this on vdsm side? Thanks!

Comment 6 Simone Tiraboschi 2014-11-24 13:18:22 UTC

It's not reproducible on RHEL7 with RHEV-M 3.5 vt11.

In the past we had an issue with sanlock and SELinux on RHEL7 but it's now solved.

Tried with:
ovirt-hosted-engine-setup.noarch     1.2.1-4.el7ev           @qa-latest 
selinux-policy.noarch                3.12.1-153.el7_0.12     @rhel7z            
selinux-policy-targeted.noarch       3.12.1-153.el7_0.12     @rhel7z  
vdsm.x86_64                          4.16.7.5-1.el7ev        @qa-latest

Please ensure you have the latest rpms from @rhel7z

Comment 7 Elad 2014-11-24 15:44:24 UTC

(In reply to Simone Tiraboschi from comment #6)
> It's not reproducible on RHEL7 with RHEV-M 3.5 vt11.
> 
> In the past we had an issue with sanlock and SELinux on RHEL7 but it's now
> solved.
> 
> Tried with:
> ovirt-hosted-engine-setup.noarch     1.2.1-4.el7ev           @qa-latest 
> selinux-policy.noarch                3.12.1-153.el7_0.12     @rhel7z        
> 
> selinux-policy-targeted.noarch       3.12.1-153.el7_0.12     @rhel7z  
> vdsm.x86_64                          4.16.7.5-1.el7ev        @qa-latest
> 
> Please ensure you have the latest rpms from @rhel7z

Actually, I'm using newer SElinux packages:

[root@green-vdsb ~]# rpm -qa |grep selinux
libselinux-2.2.2-6.el7.x86_64
selinux-policy-3.13.1-6.el7.noarch
libselinux-ruby-2.2.2-6.el7.x86_64
libselinux-python-2.2.2-6.el7.x86_64
libselinux-utils-2.2.2-6.el7.x86_64
selinux-policy-targeted-3.13.1-6.el7.noarch


This seems to be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1167277 

Please check that, re-openning for now

Comment 8 David Teigland 2014-11-24 15:59:52 UTC

Collecting the output of 'sanlock status' and 'sanlock log_dump' would probably show you what's making the lockspace busy. That part might already be obvious from the vdsm side, I don't know.

Comment 9 Simone Tiraboschi 2014-11-24 17:05:12 UTC

(In reply to Elad from comment #7)
> > Please ensure you have the latest rpms from @rhel7z
> 
> Actually, I'm using newer SElinux packages:

Are you using an alpha release of RHEL 7.1?

Comment 10 Elad 2014-11-25 07:37:09 UTC

(In reply to Simone Tiraboschi from comment #9)
> (In reply to Elad from comment #7)
> > > Please ensure you have the latest rpms from @rhel7z
> > 
> > Actually, I'm using newer SElinux packages:
> 
> Are you using an alpha release of RHEL 7.1?

No, used that host before to test something with these spesipic SElinux packages

Comment 11 Simone Tiraboschi 2014-11-25 15:03:23 UTC

It's not reproducible on RHEL-7.1-Alpha-1.2 with
  ovirt-hosted-engine-ha.noarch         1.2.4-1.el7
  ovirt-hosted-engine-setup.noarch      1.2.1-4.el7ev
  selinux-policy.noarch                 3.13.1-9.el7
  selinux-policy-targeted.noarch        3.13.1-9.el7
  vdsm.x86_64                           4.16.7.5-1.el7ev

Closing it for insufficient data. Fell free to open again if you are able to reproduce it.

Comment 12 Elad 2014-11-26 10:58:25 UTC

Created attachment 961598 [details]
/var/log

Reproduced. Attaching the logs, re-openning.

[ ERROR ] The VDSM host was found in a failed state. Please check engine and bootstrap installation logs.
[ ERROR ] Unable to add hosted_engine_1 to the manager
          Please shutdown the VM allowing the system to launch it as a monitored service.
          The system will wait until the VM is down.
[ ERROR ] Failed to execute stage 'Closing up': [Errno 111] Connection refused
[ INFO  ] Stage: Clean up
[ ERROR ] Failed to execute stage 'Clean up': [Errno 111] Connection refused
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20141126125221.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination


Red Hat Enterprise Linux Server release 7.0 (Maipo)
ovirt-hosted-engine-setup-1.2.1-3.el7ev.noarch
ovirt-hosted-engine-ha-1.2.4-1.el7.noarch
libselinux-2.2.2-6.el7.x86_64
libselinux-ruby-2.2.2-6.el7.x86_64
selinux-policy-3.12.1-153.el7.noarch
libselinux-utils-2.2.2-6.el7.x86_64
selinux-policy-targeted-3.12.1-153.el7.noarch
libselinux-python-2.2.2-6.el7.x86_64
libvirt-daemon-1.1.1-29.el7_0.3.x86_64
vdsm-4.16.7.4-1.el7ev.x86_64
sanlock-3.1.0-2.el7.x86_64

/var/log attached

Comment 13 Elad 2014-11-26 11:02:16 UTC

Created attachment 961601 [details]
updated logs

Uploading /var/log also from engine and the answer file

Comment 14 Simone Tiraboschi 2014-11-26 17:12:20 UTC

(In reply to David Teigland from comment #8)
> Collecting the output of 'sanlock status' and 'sanlock log_dump' would
> probably show you what's making the lockspace busy. That part might already
> be obvious from the vdsm side, I don't know.

Is still probably something around sanlock

Thread-73::DEBUG::2014-11-26 12:50:15,103::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain 7e0e3617-b7c3-47ba-a2ba-ef0fd564cbb9
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: ('7e0e3617-b7c3-47ba-a2ba-ef0fd564cbb9', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))
MainThread::DEBUG::2014-11-26 12:50:15,104::taskManager::90::Storage.TaskManager::(prepareForShutdown) Request to stop all tasks


I'm uploading status and log_dump from that host.

Comment 15 Simone Tiraboschi 2014-11-26 17:13:32 UTC

Created attachment 961721 [details]
sanlock_log_dump

Comment 16 Simone Tiraboschi 2014-11-26 17:14:15 UTC

Created attachment 961722 [details]
sanlock_status

Comment 17 David Teigland 2014-11-26 18:26:39 UTC

rem_lockspace(SANLK_REM_UNUSED) returns -EBUSY if resource leases are still held in the lockspace, which is the case here,

r 7e0e3617-b7c3-47ba-a2ba-ef0fd564cbb9:ddeb2ca1-bd47-4dc0-b561-484161df8a0b:/dev/7e0e3617-b7c3-47ba-a2ba-ef0fd564cbb9/leases:111149056:3 p 15422

2014-11-26 12:20:14+0200 8038 [10967]: cmd_acquire 2,10,15422 ci_in 4 fd 12 count 1
2014-11-26 12:20:14+0200 8038 [10967]: s6:r15 resource 7e0e3617-b7c3-47ba-a2ba-ef0fd564cbb9:ddeb2ca1-bd47-4dc0-b561-484161df8a0b:/dev/7e0e3617-b7c3-47ba-a2ba-ef0fd564cbb9/leases:111149056 for 2,10,15422

...

2014-11-26 12:50:15+0200 9839 [10966]: cmd_rem_lockspace 4,12 7e0e3617-b7c3-47ba-a2ba-ef0fd564cbb9 flags 2
2014-11-26 12:50:15+0200 9839 [10966]: cmd_rem_lockspace 4,12 done -16

If vdsm does not want a resource lease to block the removal of the lockspace, it can drop the REM_UNUSED flag.

Comment 18 Simone Tiraboschi 2014-11-27 11:11:27 UTC

Moving to VDSM team.

Comment 20 Elad 2014-12-07 14:44:37 UTC

Created attachment 965579 [details]
logs from RHEL6 reproduce

Deployment fails everytime. Tried with RHEL7.0 and RHEL6.6.
The host was reprovisioned before each deployment and LUN were created right before it (they are clean). 
Attaching logs from RHEL6 host.

rhev 3.5 vt13.1
libvirt-0.10.2-46.el6_6.2.x86_64
qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64
vdsm-4.16.8.1-2.el6ev.x86_64lib
selinux-utils-2.0.94-5.8.el6.x86_64
libselinux-2.0.94-5.8.el6.x86_64
selinux-policy-3.7.19-260.el6.noarch
selinux-policy-targeted-3.7.19-260.el6.noarch
libselinux-ruby-2.0.94-5.8.el6.x86_64
libselinux-python-2.0.94-5.8.el6.x86_64
sanlock-2.8-1.el6.x86_64

Thread-82::INFO::2014-12-07 16:38:26,071::clusterlock::245::Storage.SANLock::(releaseHostId) Releasing host id for domain 0b0a80b7-539a-4d82-8e8e-8a25257c0b74 (id: 1)
Thread-82::DEBUG::2014-12-07 16:38:26,071::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain 0b0a80b7-539a-4d82-8e8e-8a25257c0b74
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: ('0b0a80b7-539a-4d82-8e8e-8a25257c0b74', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))

Comment 21 Allon Mureinik 2014-12-09 12:35:03 UTC

Elad, can we retry with selinux-policy-targeted-3.7.19-260.el6_6.1 please?
This is quite possibly an issue we already encountered.

Comment 22 Elad 2014-12-09 13:02:35 UTC

The bug reproduced everytime using RHEL7.0 with:

libselinux-2.2.2-6.el7.x86_64
selinux-policy-3.13.1-6.el7.noarch
libselinux-ruby-2.2.2-6.el7.x86_64
libselinux-python-2.2.2-6.el7.x86_64
libselinux-utils-2.2.2-6.el7.x86_64
selinux-policy-targeted-3.13.1-6.el7.noarch

Using RHEL6.6, I'm not manage to reproduce using the latest policy:

libselinux-utils-2.0.94-5.8.el6.x86_64
selinux-policy-targeted-3.7.19-260.el6_6.1.noarch
libselinux-2.0.94-5.8.el6.x86_64
libselinux-ruby-2.0.94-5.8.el6.x86_64
selinux-policy-3.7.19-260.el6_6.1.noarch
libselinux-python-2.0.94-5.8.el6.x86_64

Comment 23 Nir Soffer 2014-12-09 13:45:59 UTC

Elad, can you reproduce this in permissive mode? If not, please attach /var/log/audit/audit.log*

This looks like bug 1160808 which was fixed on rhel 6.6.

Comment 24 Allon Mureinik 2014-12-09 14:43:06 UTC

(In reply to Nir Soffer from comment #23)
> This looks like bug 1160808 which was fixed on rhel 6.6.

For 6.6, this is indeed fixed, as Elad mentioned in comment 22:
> 
> Using RHEL6.6, I'm not manage to reproduce using the latest policy:
> 
> libselinux-utils-2.0.94-5.8.el6.x86_64
> selinux-policy-targeted-3.7.19-260.el6_6.1.noarch
> libselinux-2.0.94-5.8.el6.x86_64
> libselinux-ruby-2.0.94-5.8.el6.x86_64
> selinux-policy-3.7.19-260.el6_6.1.noarch
> libselinux-python-2.0.94-5.8.el6.x86_64


For 7.0, this DOES reproduce:
> The bug reproduced everytime using RHEL7.0 with:
> 
> libselinux-2.2.2-6.el7.x86_64
> selinux-policy-3.13.1-6.el7.noarch
> libselinux-ruby-2.2.2-6.el7.x86_64
> libselinux-python-2.2.2-6.el7.x86_64
> libselinux-utils-2.2.2-6.el7.x86_64
> selinux-policy-targeted-3.13.1-6.el7.noarch

Comment 25 Nir Soffer 2014-12-09 16:38:39 UTC

Looking in vdsm log, we can see that HostedEngine vm was started:

Thread-162::DEBUG::2014-11-20 17:06:10,172::BindingXMLRPC::1132::vds::(wrapper) client [127.0.0.1]::call vmCreate with ({'emulatedMachine': 'rhel6.5.0', 'vmId': '1ee80915-9835-46ae-86e6-9c57a40d7aab', 'devices':
 [{'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': '0ecf3686-ac2f-469a-96d7-c0df22abda01', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'},
 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'index': '0', 'iface': 'virtio', 'format': 'raw', 'optional': 'false', 'poolID': '00000000-0000-0000-0000-000000000000', 'volumeID': '1e0e7362
-985d-4910-b6d5-e56b46db4c04', 'imageID': '2c12b8fc-4ea4-4b24-b97f-85fd3d7b2240', 'specParams': {}, 'readonly': 'false', 'domainID': '57d4532d-179a-409d-baf4-deefa75d4acd', 'deviceId': '2c12b8fc-4ea4-4b24-b97f-8
5fd3d7b2240', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'type': 'disk'}, {'device': 'scs
i', 'model': 'virtio-scsi', 'type': 'controller'}, {'nicModel': 'pv', 'macAddr': '00:16:3E:76:D5:D5', 'linkActive': 'true', 'network': 'rhevm', 'bootOrder': '1', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {
}, 'deviceId': '8024ca7a-5a57-4fed-b649-ef3648862858', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'device': 'cons
ole', 'specParams': {}, 'type': 'console', 'deviceId': '83951312-e41f-4b70-ab7b-39bbe798ef18', 'alias': 'console0'}], 'smp': '2', 'memSize': '4096', 'cpuType': 'Conroe', 'spiceSecureChannels': 'smain,sdisplay,si
nputs,scursor,splayback,srecord,ssmartcard,susbredir', 'vmName': 'HostedEngine', 'display': 'qxl'},) {}

This vm is holding a lease on the domain 57d4532d-179a-409d-baf4-deefa75d4acd

    <lease>
        <key>1e0e7362-985d-4910-b6d5-e56b46db4c04</key>
        <lockspace>57d4532d-179a-409d-baf4-deefa75d4acd</lockspace>
        <target offset="111149056" path="/dev/57d4532d-179a-409d-baf4-deefa75d4acd/leases"/>
    </lease>

Then local client ask vdsm to stop monitor this domain:

Thread-170::DEBUG::2014-11-20 17:11:01,935::BindingXMLRPC::318::vds::(wrapper) client [127.0.0.1]
Thread-170::INFO::2014-11-20 17:11:01,936::logUtils::44::dispatcher::(wrapper) Run and protect: stopMonitoringDomain(sdUUID='57d4532d-179a-409d-baf4-deefa75d4acd', options=None)

And this operation fails (expected), because the the vm is holding a lease on this domain:

Thread-170::INFO::2014-11-20 17:11:01,937::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring) Stop monitoring 57d4532d-179a-409d-baf4-deefa75d4acd
Thread-71::DEBUG::2014-11-20 17:11:01,938::domainMonitor::192::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 57d4532d-179a-409d-baf4-deefa75d4acd
Thread-71::INFO::2014-11-20 17:11:01,938::clusterlock::245::Storage.SANLock::(releaseHostId) Releasing host id for domain 57d4532d-179a-409d-baf4-deefa75d4acd (id: 1)
Thread-71::DEBUG::2014-11-20 17:11:01,938::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain 57d4532d-179a-409d-baf4-deefa75d4acd
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: ('57d4532d-179a-409d-baf4-deefa75d4acd', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))

The domain thread does exit normally after such errors - releasing the host id
is the last thing a domain monitor thread does.

I don't see any other storage related error, so I think this bug should
move back to hosted engine developers.

Looking in ovirt-hosted-engine-setup-20141123104216-axscu4.log, this log
is not very useful since lot of info is **FILTERED**, which makes it
impossible to match with the vdsm log. To make progress with this, 
unfiltered log is needed.

Comment 29 Sandro Bonazzola 2014-12-10 10:20:44 UTC

(In reply to Nir Soffer from comment #25)


> Thread-170::DEBUG::2014-11-20
> 17:11:01,935::BindingXMLRPC::318::vds::(wrapper) client [127.0.0.1]
> Thread-170::INFO::2014-11-20
> 17:11:01,936::logUtils::44::dispatcher::(wrapper) Run and protect:
> stopMonitoringDomain(sdUUID='57d4532d-179a-409d-baf4-deefa75d4acd',
> options=None)
> 
> And this operation fails (expected), because the the vm is holding a lease
> on this domain:

HE setup calls stopMonitoringDomain only at cleanup.

Let's focus on description of the problem:

[ ERROR ] The VDSM host was found in a failed state. Please check engine and bootstrap installation logs.
[ ERROR ] Unable to add hosted_engine_1 to the manager
          Please shutdown the VM allowing the system to launch it as a monitored service.
          The system will wait until the VM is down.
[ ERROR ] Failed to execute stage 'Closing up': [Errno 111] Connection refused
[ INFO  ] Stage: Clean up
[ ERROR ] Failed to execute stage 'Clean up': [Errno 111] Connection refused
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20141123113619.conf'

stopMonitoringDomain is called only after 
[ INFO  ] Stage: Clean up

the issue here is why we have:
 [Errno 111] Connection refused

and why we have

[ ERROR ] The VDSM host was found in a failed state.

Comment 31 Nir Soffer 2014-12-10 13:42:42 UTC

(In reply to Sandro Bonazzola from comment #29)
> (In reply to Nir Soffer from comment #25)
> 
> 
> > Thread-170::DEBUG::2014-11-20
> > 17:11:01,935::BindingXMLRPC::318::vds::(wrapper) client [127.0.0.1]
> > Thread-170::INFO::2014-11-20
> > 17:11:01,936::logUtils::44::dispatcher::(wrapper) Run and protect:
> > stopMonitoringDomain(sdUUID='57d4532d-179a-409d-baf4-deefa75d4acd',
> > options=None)
> > 
> > And this operation fails (expected), because the the vm is holding a lease
> > on this domain:
> 
> HE setup calls stopMonitoringDomain only at cleanup.
> 
> Let's focus on description of the problem:
> 
> [ ERROR ] The VDSM host was found in a failed state. Please check engine and
> bootstrap installation logs.
> [ ERROR ] Unable to add hosted_engine_1 to the manager
>           Please shutdown the VM allowing the system to launch it as a
> monitored service.
>           The system will wait until the VM is down.
> [ ERROR ] Failed to execute stage 'Closing up': [Errno 111] Connection
> refused
> [ INFO  ] Stage: Clean up
> [ ERROR ] Failed to execute stage 'Clean up': [Errno 111] Connection refused
> [ INFO  ] Generating answer file
> '/var/lib/ovirt-hosted-engine-setup/answers/answers-20141123113619.conf'
> 
> stopMonitoringDomain is called only after 
> [ INFO  ] Stage: Clean up
> 
> the issue here is why we have:
>  [Errno 111] Connection refused

This is good question - but not storage issue.

You need to understand what hosted engine setup was doing until the point where vdsm was not listening to connections, and how much time vdsm was in this state.

Generally vdsm should be always up unless you stop the service.

> 
> and why we have
> 
> [ ERROR ] The VDSM host was found in a failed state.

I don't know what is "failed state".

Comment 32 Doron Fediuck 2014-12-10 14:07:08 UTC

Removing blocked bugs since it was re-targeted.

Comment 33 Yaniv Lavi 2015-03-02 08:28:49 UTC

In order to advance with this issue we need a reproducer to investigate.
Can you please provide the environment needed to debug this issue?

Comment 34 Elad 2015-03-03 08:19:48 UTC

Will answer offline

Comment 35 Simone Tiraboschi 2015-03-10 17:15:15 UTC

Reproduced on RHEL 7.1 with VT14

Thread-104::ERROR::2015-03-10 19:08:42,954::vm::2331::vm.Vm::(_startUnderlyingVm) vmId=`63d035b1-1e12-45f1-b9e0-8a4ee7c7e8e2`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 2271, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 3335, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3424, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: internal error: Unable to apply rule 'The name org.fedoraproject.FirewallD1 was not provided by any .service files'
Thread-104::DEBUG::2015-03-10 19:08:43,063::vm::2786::vm.Vm::(setDownStatus) vmId=`63d035b1-1e12-45f1-b9e0-8a4ee7c7e8e2`::Changed state to Down: internal error: Unable to apply rule 'The name org.fedoraproject.FirewallD1 was not provided by any .service files' (code=1)

Comment 36 Sandro Bonazzola 2015-03-11 09:33:33 UTC

Looks like a libvirt restart is needed between firewalld stopped and iptables started.

Comment 38 Elad 2015-05-19 11:38:05 UTC

Cannot be tested due to https://bugzilla.redhat.com/show_bug.cgi?id=1215623

Comment 39 Elad 2015-08-02 13:51:54 UTC

Hosted-engine deployment succeeds, iptables service is activated.

Used:
ovirt-3.6.0-alpha3
ovirt-hosted-engine-ha-1.3.0-0.0.master.20150615153650.20150615153645.git5f8c290.el7.noarch
ovirt-hosted-engine-setup-1.3.0-0.0.master.20150729070044.git26149d7.el7.noarch
iptables-services-1.4.21-13.el7.x86_64
iptables-1.4.21-13.el7.x86_64
vdsm-4.17.0-1229.git8299061.el7.noarch

Comment 40 Shivakumara K. madegowda 2015-10-17 01:24:11 UTC

I  faced this issue and observed firewalld was inactive.  Just to try I started firewalld, then issue disappears and could able to create new virtual network.

I am very happy now. !!!

Comment 41 Fabian Deutsch 2015-11-09 15:16:49 UTC

*** Bug 1278715 has been marked as a duplicate of this bug. ***

Comment 42 Nikolai Sednev 2015-12-22 13:21:53 UTC

Reproduced on latest components:
Host:
ovirt-hosted-engine-setup-1.3.2-0.0.master.20151215104504.git25b97ca.el7.centos.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-release36-snapshot-002-2.noarch
vdsm-4.17.13-35.gitcbc2303.el7.centos.noarch
mom-0.5.1-2.el7.noarch
ovirt-engine-sdk-python-3.6.1.1-0.1.20151127.git2400b22.el7.centos.noarch
ovirt-vmconsole-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch
ovirt-release36-002-2.noarch
ovirt-hosted-engine-ha-1.3.3.6-1.20151221120517.gita5d04b3.el7.noarch
ovirt-host-deploy-1.4.2-0.0.master.20151122153544.gitfc808fc.el7.noarch
ovirt-vmconsole-host-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch
ovirt-setup-lib-1.0.1-0.0.master.20151221171731.gitbf50a11.el7.centos.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64
libvirt-client-1.2.17-13.el7_2.2.x86_64
Linux version 3.10.0-327.4.4.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Dec 17 15:51:24 EST 2015

Engine:
rhevm-3.6.1.3-0.1.el6.noarch
ovirt-vmconsole-1.0.0-1.el6ev.noarch
ovirt-host-deploy-1.4.1-1.el6ev.noarch
ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch
ovirt-host-deploy-java-1.4.1-1.el6ev.noarch
ovirt-engine-extension-aaa-jdbc-1.0.4-1.el6ev.noarch
Linux version 2.6.32-573.8.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Fri Sep 25 19:24:22 EDT 2015



[ INFO  ] Engine replied: DB Up!Welcome to Health Status!
[ INFO  ] Acquiring internal CA cert from the engine
[ INFO  ] The following CA certificate is going to be used, please immediately interrupt if not correct:
[ INFO  ] Issuer: C=US, O=qa.lab.tlv.redhat.com, CN=nsednev-he-1.qa.lab.tlv.redhat.com.31514, Subject: C=US, O=qa.lab.tlv.redhat.com, CN=nsednev-he-1.qa.lab.tlv.redhat.com.31514, Fingerprint (SHA-1): 8C7DA7D7948D9B8446A39DD3F86B105E83F38353
[ INFO  ] Connecting to the Engine
          Enter the name of the cluster to which you want to add the host (Default) [Default]: 
[ INFO  ] Waiting for the host to become operational in the engine. This may take several minutes...
[ ERROR ] The VDSM host was found in a failed state. Please check engine and bootstrap installation logs.
[ ERROR ] Unable to add hosted_engine_1 to the manager
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ ERROR ] Failed to execute stage 'Closing up': VDSM did not start within 120 seconds
[ INFO  ] Stage: Clean up
[ ERROR ] Failed to execute stage 'Clean up': [Errno 111] Connection refused
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20151222150453.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination

Comment 43 Nikolai Sednev 2015-12-22 13:23:55 UTC

Created attachment 1108639 [details]
logs from host and engine (sosreports) and configuration log from deployment of the HE

Comment 44 Simone Tiraboschi 2015-12-22 13:40:43 UTC

host-deploy failed deploying the host due to an RPM conflict.

For some reason on that host you have qemu-kvm-tools-ev from upstream oVirt repo and qemu-kvm-tools-rhev (notice the -rhev instead of the -ev!!!) from rhev repo.

Can you please try to reproduce on a clean environment?


2015-12-22 15:02:50 INFO otopi.plugins.otopi.packagers.yumpackager yumpackager.info:95 Yum Status: Check Package Signatures
2015-12-22 15:02:50 INFO otopi.plugins.otopi.packagers.yumpackager yumpackager.info:95 Yum Status: Running Test Transaction
Running Transaction Check
2015-12-22 15:02:50 ERROR otopi.plugins.otopi.packagers.yumpackager yumpackager.error:100 Yum Test Transaction Errors:   file /usr/share/qemu-kvm/tscdeadline_latency.flat from install of qemu-kvm-tools-ev-10:2.3.0-31.el7_2.4.1.x86_64 conflicts with file from package qemu-kvm-tools-rhev-10:2.3.0-31.el7_2.5.x86_64

2015-12-22 15:02:50 DEBUG otopi.context context._executeMethod:156 method exception
Traceback (most recent call last):
  File "/tmp/ovirt-aiBYLewGtE/pythonlib/otopi/context.py", line 146, in _executeMethod
    method['method']()
  File "/tmp/ovirt-aiBYLewGtE/otopi-plugins/otopi/packagers/yumpackager.py", line 274, in _packages
    self._miniyum.processTransaction()
  File "/tmp/ovirt-aiBYLewGtE/pythonlib/otopi/miniyum.py", line 1054, in processTransaction
    rpmDisplay=self._RPMCallback(sink=self._sink)
  File "/usr/lib/python2.7/site-packages/yum/__init__.py", line 6503, in processTransaction
    self._doTestTransaction(callback,display=rpmTestDisplay)
  File "/usr/lib/python2.7/site-packages/yum/__init__.py", line 6600, in _doTestTransaction
    raise Errors.YumTestTransactionError, errstring
YumTestTransactionError: Test Transaction Errors:   file /usr/share/qemu-kvm/tscdeadline_latency.flat from install of qemu-kvm-tools-ev-10:2.3.0-31.el7_2.4.1.x86_64 conflicts with file from package qemu-kvm-tools-rhev-10:2.3.0-31.el7_2.5.x86_64

Comment 45 Simone Tiraboschi 2015-12-22 13:49:07 UTC

(In reply to Nikolai Sednev from comment #42)
> Reproduced on latest components:
> Host:
> ovirt-release36-snapshot-002-2.noarch

> Engine:
> rhevm-3.6.1.3-0.1.el6.noarch

OK, the issue is just here: you mixed upstream (on the host) and downstream components (on the engine VM).
Host-deploy from that rhevm will try to deploy the host using downstream components an this will obviously generate a conflict.

3.6.2 will solve also this one forcefully disabling ovirt-host-deploy packager while deploying the host where the engine VM runs (see https://bugzilla.redhat.com/show_bug.cgi?id=1236588 ).

For now please avoid mixing upstream and downstream components.

Comment 49 errata-xmlrpc 2016-03-09 19:06:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0375.html

Note You need to log in before you can comment on or make changes to this bug.

acanan
amureini
bazulay
bmcclain
dfediuck
didi
ebenahar
fsimonce
gklein
lpeer
lsurette
nsednev
nsoffer
sbonazzo
scohen
sherold
shihliu
shivakumara.madegowda
stirabos
teigland
tnisan
yeylon
ykaul
ylavi