1269176 – Hosted engine deploy fails due to Libvirt cannot be started error

Bug 1269176 - Hosted engine deploy fails due to Libvirt cannot be started error

Summary: Hosted engine deploy fails due to Libvirt cannot be started error

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.17.5
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-3.6.0-rc3
Target Release:	4.17.8
Assignee:	Michal Skrivanek
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:	virt
Depends On:	1271976 1275571
Blocks:	Hosted_Engine_External_GlusterFS 1153278 1172511
TreeView+	depends on / blocked

Reported:	2015-10-06 14:08 UTC by GenadiC
Modified:	2016-02-10 19:24 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-11-27 07:51:07 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-3.6.0+ rule-engine: blocker+ mgoldboi: planning_ack+ rule-engine: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)
messages and hosted-engine-setup logs (103.02 KB, application/zip) 2015-10-06 14:08 UTC, GenadiC	no flags	Details
vdsm logs (654.00 KB, application/zip) 2015-10-07 08:11 UTC, GenadiC	no flags	Details
logs from the host (black) (255.83 KB, application/x-gzip) 2015-10-26 15:39 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1117751	unspecified	CLOSED	sub processes ignore SIGPIPE	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1264667	medium	CLOSED	vdsm 4.17.7 - is not running - TypeError: _execute_child_v276() got an unexpected keyword argument 'restore_sigpipe'	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1265166	unspecified	CLOSED	rebase on python-cpopen-1.4	2022-07-13 08:20:08 UTC

Internal Links: 1117751 1264667 1265166

Description GenadiC 2015-10-06 14:08:33 UTC

Created attachment 1080246 [details]
messages and hosted-engine-setup logs

Description of problem:
After new install of RHEVH from USB stick, configuring VLAN and running Hosted Engine deploy the installation failed because of libvirt couldn't be started

Version Red Hat Enterprise Virtualization Hypervisor release 7.1 (20150917.0.el7ev)
Steps to Reproduce:
1. Install RHEVH with USB stick
2. Run TUI and configure VLAN on it
3. Run Hosted Engine deploy

Actual results:
Installation fails as libvirt couldn't be started

Expected results:
Installation should continue 

Additional info:

Comment 1 Simone Tiraboschi 2015-10-06 14:19:33 UTC

The issue seams to be there:

2015-10-06 11:25:06 DEBUG otopi.plugins.ovirt_hosted_engine_setup.system.vdsmenv plugin.execute:932 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stdout:

Checking configuration status...

libvirt is already configured for vdsm

Running configure...
checking certs..
Reconfiguration of libvirt is done.

2015-10-06 11:25:06 DEBUG otopi.plugins.ovirt_hosted_engine_setup.system.vdsmenv plugin.execute:937 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stderr:
Error:  ServiceOperationError: _systemctlStart failed
Job for libvirtd.service failed. See 'systemctl status libvirtd.service' and 'journalctl -xn' for details.
 


2015-10-06 11:25:06 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 142, in _executeMethod
  File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/system/vdsmenv.py", line 148, in _late_setup
RuntimeError: Failed to reconfigure libvirt for VDSM
2015-10-06 11:25:06 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Environment setup': Failed to reconfigure libvirt for VDSM

Comment 2 Fabian Deutsch 2015-10-06 17:47:10 UTC

Genadi, please provide all the logs from a failed run (including vdsm log)

Comment 3 Sandro Bonazzola 2015-10-07 07:58:48 UTC

Dan, is this a regression on vdsm-tool?

Comment 4 GenadiC 2015-10-07 08:11:23 UTC

Created attachment 1080542 [details]
vdsm logs

Comment 5 Simone Tiraboschi 2015-10-07 08:37:29 UTC

All the hosted-engine jobs on 7.2 are failing for the same reason:

http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_dynamicip_iscsi_install_on_latest/28/

http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_staticip_nfs3_install_on_latest/85/

http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_dynamicip_nfs4_install_on_latest/36/

http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_staticip_nfs4_install_on_latest/38/

http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_staticip_iscsi_install_on_latest/28/

http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_dynamicip_nfs3_install_on_latest/82/

The same jobs are working on 7.1

Comment 6 Fabian Deutsch 2015-10-07 09:55:10 UTC

This is blocking our RHEV-H automation and the Engine appliance testing

Comment 8 Oved Ourfali 2015-10-07 10:08:34 UTC

(In reply to Fabian Deutsch from comment #6)
> This is blocking our RHEV-H automation and the Engine appliance testing

Does it reproduce 100% of the time?

Comment 9 Elad 2015-10-07 10:24:15 UTC

The bug is reproduced also on RHEL (7.2).
Therefore, changing the product to vdsm.

Comment 10 Elad 2015-10-07 10:25:00 UTC

vdsm-4.17.8-1.el7ev.noarch
vdsm-xmlrpc-4.17.8-1.el7ev.noarch
vdsm-infra-4.17.8-1.el7ev.noarch
vdsm-jsonrpc-4.17.8-1.el7ev.noarch
vdsm-cli-4.17.8-1.el7ev.noarch
vdsm-yajsonrpc-4.17.8-1.el7ev.noarch
vdsm-python-4.17.8-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.0-1.el7ev.noarch

Comment 11 Dan Kenigsberg 2015-10-07 11:13:58 UTC

I'm not sure... What is the command-line output of `vdsm-tool configure --force`?

2015-10-06 11:25:06 DEBUG otopi.plugins.ovirt_hosted_engine_setup.system.vdsmenv plugin.execute:937 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stderr:
Error:  ServiceOperationError: _systemctlStart failed

Comment 12 Simone Tiraboschi 2015-10-07 11:27:08 UTC

(In reply to Oved Ourfali from comment #8)
> (In reply to Fabian Deutsch from comment #6)
> > This is blocking our RHEV-H automation and the Engine appliance testing
> 
> Does it reproduce 100% of the time?

This morning on HE CI jobs it reproduced:
- 5 times over 5 jobs on 7.2 (100%).
- 0 times over 5 jobs on 7.1 (0%).

Comment 13 Michal Skrivanek 2015-10-07 11:39:41 UTC

looks like https://gerrit.ovirt.org/#/c/46772/
HE should use JSON-RPC...as it slowly become an issue as XML-RPC is getting less and less testing

Comment 14 Oved Ourfali 2015-10-07 11:44:26 UTC

(In reply to Michal Skrivanek from comment #13)
> looks like https://gerrit.ovirt.org/#/c/46772/
> HE should use JSON-RPC...as it slowly become an issue as XML-RPC is getting
> less and less testing

Sandro - Why is HE using xmlrpc? I suggest to open another bug to move to jsonrpc, and move this one to virt. 
Michal - what do you say?

Comment 15 Michal Skrivanek 2015-10-07 12:02:12 UTC

sounds good to me. The actual bug is actually already fixed in vdsm

Comment 16 Sandro Bonazzola 2015-10-07 12:40:51 UTC

(In reply to Oved Ourfali from comment #14)
> (In reply to Michal Skrivanek from comment #13)
> > looks like https://gerrit.ovirt.org/#/c/46772/
> > HE should use JSON-RPC...as it slowly become an issue as XML-RPC is getting
> > less and less testing
> 
> Sandro - Why is HE using xmlrpc? I suggest to open another bug to move to
> jsonrpc, and move this one to virt. 

Oved whoever introduced JSON-RPC feature forgot about hosted engine. We discovered XMLRPC will disappear in 4.0 in a gerrit comment, so we already opened
Bug #1228641 - [RFE] Switch from XML-RPC to JSON-RPC API



> Michal - what do you say?

Comment 17 Yaniv Bronhaim 2015-10-07 13:00:53 UTC

Even if the patch you refer to is missing I still see an issue with vdsm startup in this environment - it takes almost a minute for it to be response .. long time since I saw such issue which was regarding the vm recovering part. now it seems related to network initialization. Pioter looks into it right now

Comment 18 Sandro Bonazzola 2015-10-07 13:03:38 UTC

(In reply to Sandro Bonazzola from comment #16)

> Oved whoever introduced JSON-RPC feature forgot about hosted engine. We
> discovered XMLRPC will disappear in 4.0 in a gerrit comment, so we already
> opened
> Bug #1228641 - [RFE] Switch from XML-RPC to JSON-RPC API

By the way, according to documentation, JsonRpc is still on development, far away to be ready for use by hosted engine.
http://www.ovirt.org/Features/Design/JsonRpc 
Revision as of 06:07, 10 June 2015 by Ybronhei

Still in design stage.

Comment 19 Oved Ourfali 2015-10-07 13:13:47 UTC

(In reply to Sandro Bonazzola from comment #18)
> (In reply to Sandro Bonazzola from comment #16)
> 
> > Oved whoever introduced JSON-RPC feature forgot about hosted engine. We
> > discovered XMLRPC will disappear in 4.0 in a gerrit comment, so we already
> > opened
> > Bug #1228641 - [RFE] Switch from XML-RPC to JSON-RPC API
> 

Perhaps. But we're changing all 3.6 cluster level hosts to work with jsonrpc, in the way to deprecate xmlrpc.

> By the way, according to documentation, JsonRpc is still on development, far
> away to be ready for use by hosted engine.
> http://www.ovirt.org/Features/Design/JsonRpc 
> Revision as of 06:07, 10 June 2015 by Ybronhei
> 
> Still in design stage.

The jsonrpc feature consist of many phases, so it is still on the works, also in 4.0. But, the protocol is ready, and it is jsonrpc.

Comment 20 Elad 2015-10-07 13:19:17 UTC

On my setup the bug reproduce 100% using RHEL7.2 while deploying 3.6 hosted engine (even after reboot).

On Genadi's setup, with 3.5 7.1 RHEV-H hosted-engine, it didn't reproduce 100%, after reboot it worked fine.

Yaniv, can you please provide Dan request for the vdsm-tool configure output? (I'm not sure if it still has the original code)

Comment 21 Piotr Kliczewski 2015-10-07 13:32:14 UTC

Vdsm replies with "Recovering from crash or Initializing" for around 40 seconds.
Checking the logs I can see that from vdsm starting:

MainThread::INFO::2015-10-07 16:13:06,121::vdsm::157::vds::(run) (PID: 17709) I am the actual vdsm 4.17.8-1.el7ev green-vdsb.qa.lab.tlv.redhat.com (3.10.0-320.el7.x86_64)

to HMS being ready:

storageRefresh::DEBUG::2015-10-07 16:13:37,989::hsm::373::Storage.HSM::(storageRefresh) HSM is ready

the time is very similar. It seems to be not related to infra but it is actually storage issue.

Comment 22 Nir Soffer 2015-10-07 13:40:25 UTC

(In reply to Piotr Kliczewski from comment #21)
> Vdsm replies with "Recovering from crash or Initializing" for around 40
> seconds.
> Checking the logs I can see that from vdsm starting:
> 
> MainThread::INFO::2015-10-07 16:13:06,121::vdsm::157::vds::(run) (PID:
> 17709) I am the actual vdsm 4.17.8-1.el7ev green-vdsb.qa.lab.tlv.redhat.com
> (3.10.0-320.el7.x86_64)
> 
> to HMS being ready:
> 
> storageRefresh::DEBUG::2015-10-07
> 16:13:37,989::hsm::373::Storage.HSM::(storageRefresh) HSM is ready
> 
> the time is very similar. It seems to be not related to infra but it is
> actually storage issue.

Vdsm behavior is expected, caller cannot assume it will be ready in 30 seconds and need to be ready to handle such timeouts.

Usually hsm is ready in few seconds, but if scsi/fc scan get stuck or lvm 
operation get stuck, vdsm may be stuck from several minutes in this state.

This should be improved when bug 880738 will be resolved.

Comment 23 Nir Soffer 2015-10-07 14:25:44 UTC

Looking at vdsm log on the host:

1. Vdsm starts:

MainThread::INFO::2015-10-07 16:18:09,389::vdsm::157::vds::(run) (PID: 18099) I am the actual vdsm 4.17.8-1.el7ev green-vdsb.qa.lab.tlv.redhat.com (3.10.0-320.el7.x86_64)

2. Starting scsi scan (can take up to 30 seconds)

storageRefresh::DEBUG::2015-10-07 16:18:09,477::iscsi::431::Storage.ISCSI::(rescan) Performing SCSI scan, this will take up to 30 seconds

3. Starting FC scan (can take up to 30 seconds)

storageRefresh::DEBUG::2015-10-07 16:18:09,557::hba::56::Storage.HBA::(rescan) Starting scan


4. Trying to connect to supervdsm (fc-scan is run as root)

storageRefresh::DEBUG::2015-10-07 16:18:09,567::supervdsm::76::SuperVdsmProxy::(_connect) Trying to connect to Super Vdsm

5. FC scan timed out

storageRefresh::ERROR::2015-10-07 16:18:39,669::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=18151)

(We will look at /var/log/messages to understand why)

6. HSM is ready

storageRefresh::ERROR::2015-10-07 16:18:39,669::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=18151)

So the issue was timeout scanning FC devices.

There is no bug in vdsm storage init phase, the caller should expect
these timeouts.

Having said that, we should improve the startup behavior, since scsi and fc
scan and the lvm operations should not be needed since vdsm is not connect
to storage yet. We should delay any such access until the point vdsm is
asked to connect to a storage server (should be 4.0 RFE).

Looking in supervdsm log:

# grep Storage.HBA /var/log/vdsm/vdsm.log
storageRefresh::DEBUG::2015-10-07 14:21:37,745::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 14:22:07,861::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=14101)
storageRefresh::DEBUG::2015-10-07 15:44:36,718::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::DEBUG::2015-10-07 15:45:09,391::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::DEBUG::2015-10-07 15:45:27,163::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 15:45:57,268::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=16297)
storageRefresh::DEBUG::2015-10-07 16:07:33,312::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:08:03,433::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=17158)
storageRefresh::DEBUG::2015-10-07 16:11:48,454::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:12:18,577::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=17498)
storageRefresh::DEBUG::2015-10-07 16:13:06,274::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:13:36,387::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=17760)
storageRefresh::DEBUG::2015-10-07 16:18:09,557::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:18:39,669::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=18151)
storageRefresh::DEBUG::2015-10-07 16:20:37,725::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:21:07,842::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=18444)
storageRefresh::DEBUG::2015-10-07 16:21:55,537::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:22:25,633::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=18713)

It is clear that something in FC connection is bad in this host.

Running fc-scan manually show:

[root@green-vdsb ~]# /usr/libexec/vdsm/fc-scan -v
fc-scan: Scanning /sys/class/scsi_host/host0/scan
fc-scan: Scanning /sys/class/scsi_host/host4/scan
fc-scan: Scanned /sys/class/scsi_host/host0/scan: 0.00 seconds
fc-scan: Scanned /sys/class/scsi_host/host4/scan: 82.01 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:0/rescan: 0.01 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:0:1/4:0:1:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:0:1/4:0:1:0/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:0/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:1/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:1/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:2/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:2/rescan: 0.00 seconds

/sys/class/fc_host/host4 is not responding as expected.

Aharon, can you for errors in the server side explaining this timeout?

Comment 24 Nir Soffer 2015-10-07 14:28:51 UTC

This seems like a server issue, not a vdsm bug.

If hosted engine setup fails because of 30 seconds timeout, this is hosted
engine setup bug; it should be ready for storage timeouts.

Comment 25 Simone Tiraboschi 2015-10-07 15:21:21 UTC

http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_dynamicip_nfs3_install_on_latest/82/

It's failing also for the CI jobs testing the deployment of hosted-engine over NFS on RHEL 7.2.
That machines are not using iSCSI and doesn't have FC HBA.
The same code is working on RHEL 7.1.

Maybe there is also an additional issue on that host storage bug the bug is systematically reproducible somewhere else.

Comment 26 Michal Skrivanek 2015-10-07 15:33:11 UTC

based on comment #24, do we have any other problem?
(comment #17) Does it actually fail or just takes longer time, Yaniv?
If it works then this bug should remain in MODIFIED, if there are HE deployment timeouts failing the installation then this need to be moved to integration and addressed for 3.6 (I suppose)

Comment 27 Nir Soffer 2015-10-07 15:38:30 UTC

Checking /var/log/messages while performing fc-scan:

# logger "*** starting fc-scan ***"; /usr/libexec/vdsm/fc-scan -v; logger "*** fc-scan finished ***"
fc-scan: Scanning /sys/class/scsi_host/host0/scan
fc-scan: Scanning /sys/class/scsi_host/host4/scan
fc-scan: Scanned /sys/class/scsi_host/host0/scan: 0.00 seconds
fc-scan: Scanned /sys/class/scsi_host/host4/scan: 82.01 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:0/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:0:1/4:0:1:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:0:1/4:0:1:0/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:0/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:1/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:1/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:2/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:2/rescan: 0.01 seconds


Oct  7 18:30:46 green-vdsb root: *** starting fc-scan ***
Oct  7 18:32:08 green-vdsb kernel: scsi 4:0:3:0: timing out command, waited 82s
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Sense Key : Illegal Request [current] 
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Add. Sense: Logical unit not supported
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Sense Key : Illegal Request [current] 
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Add. Sense: Logical unit not supported
Oct  7 18:32:08 green-vdsb root: *** fc-scan finished ***

Chris, what can cause this timeout (kernel: scsi 4:0:3:0: timing out command, waited 82s), and what is the meaning of errors above?

Comment 28 Simone Tiraboschi 2015-10-07 15:47:46 UTC

Maybe we have to split it into two distinct bugs:
the original issue,and from that the bug title is 'Hosted engine deploy fails due to Libvirt cannot be started error'.
It's fully reproduce on RHEL 7.2 and all of our HE CI jobs are failing there.

Took this one on NFS3 as a reference (just to exclude any iSCSI or FC interferences):
http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_dynamicip_nfs3_install_on_latest/82/

hosted-engine-setup it's calling '/bin/vdsm-tool configure --force' in a blocking mode without any timeout.

but on RHEL 7.2 it's failing:

2015-10-07 06:08:51 DEBUG otopi.plugins.ovirt_hosted_engine_setup.system.vdsmenv plugin.execute:936 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stdout:

Checking configuration status...

multipath requires configuration

2015-10-07 06:08:51 DEBUG otopi.plugins.ovirt_hosted_engine_setup.system.vdsmenv plugin.execute:941 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stderr:
Traceback (most recent call last):
  File "/bin/vdsm-tool", line 219, in main
    return tool_command[cmd]["command"](*args)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/__init__.py", line 38, in wrapper
    func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 137, in configure
    if _should_configure(c, args.force)]
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 327, in _should_configure
    configured = _isconfigured(c)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 113, in _isconfigured
    return getattr(module, 'isconfigured', lambda: configurators.NO)()
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurators/passwd.py", line 43, in isconfigured
    _, out, _ = utils.execCmd(script)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 679, in execCmd
    deathSignal=deathSignal, childUmask=childUmask)
  File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 51, in __init__
    stderr=PIPE)
  File "/usr/lib64/python2.7/subprocess.py", line 712, in __init__
    restore_sigpipe=restore_sigpipe)
TypeError: _execute_child_v276() got an unexpected keyword argument 'restore_sigpipe'


Under /var/log/vdsm/upgrade.log we found the correspondent error:

MainThread::ERROR::2015-10-07 06:08:47,330::upgrade::94::upgrade::(apply_upgrade) Failed to run upgrade-unified-persistence
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/tool/upgrade.py", line 92, in apply_upgrade
    upgrade.run(ns, args)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/unified_persistence.py", line 193, in run
    run()
  File "/usr/lib/python2.7/site-packages/vdsm/tool/unified_persistence.py", line 44, in run
    networks, bondings = _getNetInfo()
  File "/usr/lib/python2.7/site-packages/vdsm/tool/unified_persistence.py", line 131, in _getNetInfo
    networks = _processNetworks(netinfo)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/unified_persistence.py", line 53, in _processNetworks
    defaultGateway = getDefaultGateway()
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 319, in getDefaultGateway
    output = routeShowGateways('main')
  File "/usr/lib/python2.7/site-packages/vdsm/ipwrapper.py", line 537, in routeShowGateways
    return _execCmd(command)
  File "/usr/lib/python2.7/site-packages/vdsm/ipwrapper.py", line 526, in _execCmd
    returnCode, output, error = execCmd(command)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 679, in execCmd
    deathSignal=deathSignal, childUmask=childUmask)
  File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 51, in __init__
    stderr=PIPE)
  File "/usr/lib64/python2.7/subprocess.py", line 712, in __init__
    restore_sigpipe=restore_sigpipe)
TypeError: _execute_child_v276() got an unexpected keyword argument 'restore_sigpipe'


The same code is working on RHEL 7.1 so on my opinion is a regression on vdsm-tool.
It's systematically happening with vdsm-4.17.8-12.gitbabda51.el7.noarch so if it has been recently fixed the path is still not there.

Comment 29 Simone Tiraboschi 2015-10-07 15:56:12 UTC

It's probably just a duplicate of this one:
https://bugzilla.redhat.com/show_bug.cgi?id=1264667

Comment 30 Simone Tiraboschi 2015-10-07 16:23:15 UTC

The CI jobs are running with python-cpopen-1.3-4.el7.x86_64 witch comes from EPEL:
https://bodhi.fedoraproject.org/updates/?packages=python-cpopen

So it's probably the time to release an updated version with that fix.

Comment 31 Nir Soffer 2015-10-07 16:26:28 UTC

(In reply to Simone Tiraboschi from comment #28)
> TypeError: _execute_child_v276() got an unexpected keyword argument
> 'restore_sigpipe'

You are using old cpopen - Python on 7.2 modified internal api which broke
cpopen. Please update cpopen to 1.4.

Comment 32 Oved Ourfali 2015-10-07 16:31:12 UTC

According to https://bugzilla.redhat.com/show_bug.cgi?id=1264667#c9 you can also just update to the latest python version in the CI env.

Comment 33 Simone Tiraboschi 2015-10-07 16:42:35 UTC

(In reply to Oved Ourfali from comment #32)
> According to https://bugzilla.redhat.com/show_bug.cgi?id=1264667#c9 you can
> also just update to the latest python version in the CI env.

Nice, but that CI job is already updating everything with what is available.

Comment 35 Chris Leech 2015-10-07 23:50:46 UTC

(In reply to Nir Soffer from comment #27)
> Checking /var/log/messages while performing fc-scan:
> 
> # logger "*** starting fc-scan ***"; /usr/libexec/vdsm/fc-scan -v; logger
> "*** fc-scan finished ***"
> fc-scan: Scanning /sys/class/scsi_host/host0/scan
> fc-scan: Scanning /sys/class/scsi_host/host4/scan
> fc-scan: Scanned /sys/class/scsi_host/host0/scan: 0.00 seconds
> fc-scan: Scanned /sys/class/scsi_host/host4/scan: 82.01 seconds
> fc-scan: Rescanning device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:
> 0:0/4:0:0:0/rescan
> fc-scan: Rescanned device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:
> 0:0/4:0:0:0/rescan: 0.00 seconds
> fc-scan: Rescanning device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:
> 0:1/4:0:1:0/rescan
> fc-scan: Rescanned device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:
> 0:1/4:0:1:0/rescan: 0.00 seconds
> fc-scan: Rescanning device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:0/rescan
> fc-scan: Rescanned device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:0/rescan: 0.00 seconds
> fc-scan: Rescanning device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:1/rescan
> fc-scan: Rescanned device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:1/rescan: 0.00 seconds
> fc-scan: Rescanning device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:2/rescan
> fc-scan: Rescanned device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:2/rescan: 0.01 seconds
> 
> 
> Oct  7 18:30:46 green-vdsb root: *** starting fc-scan ***
> Oct  7 18:32:08 green-vdsb kernel: scsi 4:0:3:0: timing out command, waited
> 82s
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Read Capacity(10)
> failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Sense Key : Illegal
> Request [current] 
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Add. Sense: Logical
> unit not supported
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Read Capacity(10)
> failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Sense Key : Illegal
> Request [current] 
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Add. Sense: Logical
> unit not supported
> Oct  7 18:32:08 green-vdsb root: *** fc-scan finished ***
> 
> Chris, what can cause this timeout (kernel: scsi 4:0:3:0: timing out
> command, waited 82s), and what is the meaning of errors above?

Was the target side storage reconfigured?  The "Logical unit not supported" response indicates that the LUs for sdb and sdc no longer exist on the target (or targets I guess, as there are multiple targets with one LU each it looks like).  So targets 0 and 1 are responding but each failing Read Capacity(10) on LU 0, and target 3 failed to respond at all.

Comment 36 Elad 2015-10-08 07:28:11 UTC

(In reply to Nir Soffer from comment #23)
> Aharon, can you for errors in the server side explaining this timeout?

No FC errors on storage server side, also, the devices are available for the host:

[root@green-vdsb scsi_host]# cat /proc/scsi/scsi
.
.
Host: scsi4 Channel: 00 Id: 02 Lun: 00
  Vendor: XtremIO  Model: XtremApp         Rev: 4010
  Type:   RAID                             ANSI  SCSI revision: 06
Host: scsi4 Channel: 00 Id: 02 Lun: 01
  Vendor: XtremIO  Model: XtremApp         Rev: 4010
  Type:   Direct-Access                    ANSI  SCSI revision: 06
Host: scsi4 Channel: 00 Id: 02 Lun: 02
  Vendor: XtremIO  Model: XtremApp         Rev: 4010
  Type:   Direct-Access                    ANSI  SCSI revision: 06
.
.


Also, the bug is 100% reproduced on hosts that are not connected to FC as stated on comment #25

Comment 37 Elad 2015-10-08 07:29:45 UTC

Re-setting the need-info flags for Yaniv

Comment 38 Michal Skrivanek 2015-10-08 07:34:05 UTC

(In reply to Elad from comment #36)
> Also, the bug is 100% reproduced on hosts that are not connected to FC as
> stated on comment #25

which bug? Again, please separate the discussion about possible storage scan-related issues and the original reported problem, and cpopen. The latter two are already addressed

Comment 39 Elad 2015-10-08 08:02:44 UTC

(In reply to Michal Skrivanek from comment #38)
> (In reply to Elad from comment #36)
> > Also, the bug is 100% reproduced on hosts that are not connected to FC as
> > stated on comment #25
> 
> which bug? Again, please separate the discussion about possible storage
> scan-related issues and the original reported problem, and cpopen. The
> latter two are already addressed

I'm talking about the hosted-engine failure

Comment 41 Nir Soffer 2015-10-08 11:14:04 UTC

(In reply to Elad from comment #36)
> (In reply to Nir Soffer from comment #23)
> > Aharon, can you for errors in the server side explaining this timeout?
> 
> No FC errors on storage server side, also, the devices are available for the
> host:
> 
> [root@green-vdsb scsi_host]# cat /proc/scsi/scsi
> .
> .
> Host: scsi4 Channel: 00 Id: 02 Lun: 00
>   Vendor: XtremIO  Model: XtremApp         Rev: 4010
>   Type:   RAID                             ANSI  SCSI revision: 06
> Host: scsi4 Channel: 00 Id: 02 Lun: 01
>   Vendor: XtremIO  Model: XtremApp         Rev: 4010
>   Type:   Direct-Access                    ANSI  SCSI revision: 06
> Host: scsi4 Channel: 00 Id: 02 Lun: 02
>   Vendor: XtremIO  Model: XtremApp         Rev: 4010
>   Type:   Direct-Access                    ANSI  SCSI revision: 06

Elad, please open a separate bug kernel bug for this issue. Clearly there
is something wrong here, scanning timeout after 82 seconds and errors
mentioned in comment 35 are not ok.

Please use the command seen in comment 27 on host running 7.2 and 7.1
connected to the same storage server and attach the command output and
the relevant errors seen in /var/log/messages while the command was
running.

Comment 43 Nir Soffer 2015-10-08 11:38:18 UTC

This is not a storage bug, and the unrelated storage/kernel related issue
will be tracked in another bug.

Comment 44 Nir Soffer 2015-10-08 11:39:09 UTC

Jeff, please check comment 42.

Comment 48 Elad 2015-10-08 14:50:24 UTC

> Elad, please open a separate bug kernel bug for this issue. Clearly there
> is something wrong here, scanning timeout after 82 seconds and errors
> mentioned in comment 35 are not ok.
> 
> Please use the command seen in comment 27 on host running 7.2 and 7.1
> connected to the same storage server and attach the command output and
> the relevant errors seen in /var/log/messages while the command was
> running.

Nir, on both 7.1 and 7.2 it behaves the same, the fc scan takes ~80 seconds using XtremIO storage server.

Tried with another storage server (EMC-VNX) and the scan takes less than a second.

Comment 50 Nir Soffer 2015-10-14 15:46:17 UTC

(In reply to Elad from comment #48)
> > Elad, please open a separate bug kernel bug for this issue. Clearly there
> > is something wrong here, scanning timeout after 82 seconds and errors
> > mentioned in comment 35 are not ok.
> > 
> > Please use the command seen in comment 27 on host running 7.2 and 7.1
> > connected to the same storage server and attach the command output and
> > the relevant errors seen in /var/log/messages while the command was
> > running.
> 
> Nir, on both 7.1 and 7.2 it behaves the same, the fc scan takes ~80 seconds
> using XtremIO storage server.
> 
> Tried with another storage server (EMC-VNX) and the scan takes less than a
> second.

So this is an issue with the storage server, not with 7.2 or 7.1.

Comment 51 Yaniv Lavi 2015-10-15 07:23:08 UTC

Moving out yo 3.6.1, since it's dependent on the storage server.

Comment 52 Yaniv Lavi 2015-10-15 07:25:16 UTC

Moving back since this is MODIFIED.

Comment 53 Michal Skrivanek 2015-10-15 12:43:25 UTC

note this bug doesn't track the storage issue mentioned above

Comment 54 Red Hat Bugzilla Rules Engine 2015-10-18 08:34:03 UTC

Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 55 Yaniv Lavi 2015-10-20 11:22:56 UTC

Can you please check if the dependent bugs are really needed to resolve this issue? Can you move them to see also if not?

Comment 56 Michal Skrivanek 2015-10-20 22:12:58 UTC

 They are not. This bug tracks only the regression in xmlrpc needed for hosted engine. That is fixed. Everything else is unrelated

Comment 57 Nikolai Sednev 2015-10-26 15:36:37 UTC

On clean installation on Red Hat Enterprise Virtualization Hypervisor release 7.2 (20151025.0.el7ev) I ran as shown bellow and failed with the error:

[ INFO  ] Stage: Initializing
[ INFO  ] Generating a temporary VNC password.
[ INFO  ] Stage: Environment setup
          Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards.
          Are you sure you want to continue? (Yes, No)[Yes]:           Configuration files: ['/tmp/tmpAM2ccM']
          Log file: /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20151026152224-k4x1eo.log
          Version: otopi-1.4.0 (otopi-1.4.0-1.el7ev)
[ INFO  ] Hardware supports virtualization
[ INFO  ] Stage: Environment packages setup
[ INFO  ] Stage: Programs detection
[ INFO  ] Stage: Environment setup
[ ERROR ] Failed to execute stage 'Environment setup': Failed to reconfigure libvirt for VDSM
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20151026152232.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
Something went wrong setting up hosted engine, or the setup process was cancelled.

Press any key to continue...

If I'm following these steps then:
[root@black-vdsb ~]# vdsm-tool configure --force

Checking configuration status...

Current revision of multipath.conf detected, preserving
libvirt is already configured for vdsm
SUCCESS: ssl configured to true. No conflicts

Running configure...
checking certs..
Reconfiguration of libvirt is done.

Done configuring modules to VDSM.

I'm getting the same results:
[ INFO  ] Stage: Initializing
[ INFO  ] Generating a temporary VNC password.
[ INFO  ] Stage: Environment setup
          Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards.
          Are you sure you want to continue? (Yes, No)[Yes]: 
          Configuration files: ['/tmp/tmpRaZH19']
          Log file: /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20151026153500-e9t177.log
          Version: otopi-1.4.0 (otopi-1.4.0-1.el7ev)
[ INFO  ] Hardware supports virtualization
[ INFO  ] Stage: Environment packages setup
[ INFO  ] Stage: Programs detection
[ INFO  ] Stage: Environment setup
[ ERROR ] Failed to execute stage 'Environment setup': Failed to start service 'vdsmd'
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20151026153512.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
Something went wrong setting up hosted engine, or the setup process was cancelled.

Press any key to continue...



Please see the logs I'm attached.

Comment 58 Nikolai Sednev 2015-10-26 15:39:27 UTC

Created attachment 1086545 [details]
logs from the host (black)

Comment 59 Red Hat Bugzilla Rules Engine 2015-10-26 15:41:28 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 60 Nikolai Sednev 2015-10-27 09:35:07 UTC

I've created a separate bug 1275571 to not mix both issues and have set this bug as depends on 1275571, this is due to inability to verify as deployment fails on network bridge creation issue.

Comment 61 Red Hat Bugzilla Rules Engine 2015-11-02 12:26:54 UTC

Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 62 Nikolai Sednev 2015-11-15 18:04:41 UTC

Works for me, failed to reproduce these errors on my Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151113.123.el7ev) during both iSCSI and NFS deployments:

mom-0.5.1-1.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
libvirt-1.2.17-13.el7.x86_64
qemu-kvm-rhev-2.3.0-31.el7.x86_64
vdsm-4.17.10.1-0.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-node-branding-rhev-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-lib-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-snmp-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
ovirt-node-plugin-vdsm-0.6.1-3.el7ev.noarch
ovirt-setup-lib-1.0.0-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-node-lib-config-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-selinux-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-cim-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.3.0-3.el7ev.noarch
ovirt-node-plugin-snmp-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-rhn-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-lib-legacy-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-host-deploy-offline-1.4.0-1.el7ev.x86_64
ovirt-node-plugin-cim-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-host-deploy-1.4.1-0.0.master.el7ev.noarch
ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch

Comment 63 Sandro Bonazzola 2015-11-27 07:51:07 UTC

Since oVirt 3.6.0 has been released, moving from verified to closed current release.

Note You need to log in before you can comment on or make changes to this bug.

bugs
cshao
cwu
danken
ebenahar
ecohen
fdeutsch
gcheresh
gklein
huiwa
jneedle
lsurette
mavital
mgoldboi
michal.skrivanek
nsoffer
oourfali
pkliczew
sabose
sbonazzo
stirabos
ybronhei
ycui
ylavi