Bug 1269176 - Hosted engine deploy fails due to Libvirt cannot be started error
Hosted engine deploy fails due to Libvirt cannot be started error
Status: CLOSED CURRENTRELEASE
Product: vdsm
Classification: oVirt
Component: Core (Show other bugs)
4.17.5
x86_64 Linux
urgent Severity urgent (vote)
: ovirt-3.6.0-rc3
: 4.17.8
Assigned To: Michal Skrivanek
Nikolai Sednev
virt
: AutomationBlocker, TestBlocker, Triaged
Depends On: 1271976 1275571
Blocks: Hosted_Engine_External_GlusterFS 1153278 1172511
  Show dependency treegraph
 
Reported: 2015-10-06 10:08 EDT by GenadiC
Modified: 2016-02-10 14:24 EST (History)
24 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-27 02:51:07 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.0+
rule-engine: blocker+
mgoldboi: planning_ack+
rule-engine: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
messages and hosted-engine-setup logs (103.02 KB, application/zip)
2015-10-06 10:08 EDT, GenadiC
no flags Details
vdsm logs (654.00 KB, application/zip)
2015-10-07 04:11 EDT, GenadiC
no flags Details
logs from the host (black) (255.83 KB, application/x-gzip)
2015-10-26 11:39 EDT, Nikolai Sednev
no flags Details

  None (edit)
Description GenadiC 2015-10-06 10:08:33 EDT
Created attachment 1080246 [details]
messages and hosted-engine-setup logs

Description of problem:
After new install of RHEVH from USB stick, configuring VLAN and running Hosted Engine deploy the installation failed because of libvirt couldn't be started

Version Red Hat Enterprise Virtualization Hypervisor release 7.1 (20150917.0.el7ev)
Steps to Reproduce:
1. Install RHEVH with USB stick
2. Run TUI and configure VLAN on it
3. Run Hosted Engine deploy

Actual results:
Installation fails as libvirt couldn't be started

Expected results:
Installation should continue 

Additional info:
Comment 1 Simone Tiraboschi 2015-10-06 10:19:33 EDT
The issue seams to be there:

2015-10-06 11:25:06 DEBUG otopi.plugins.ovirt_hosted_engine_setup.system.vdsmenv plugin.execute:932 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stdout:

Checking configuration status...

libvirt is already configured for vdsm

Running configure...
checking certs..
Reconfiguration of libvirt is done.

2015-10-06 11:25:06 DEBUG otopi.plugins.ovirt_hosted_engine_setup.system.vdsmenv plugin.execute:937 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stderr:
Error:  ServiceOperationError: _systemctlStart failed
Job for libvirtd.service failed. See 'systemctl status libvirtd.service' and 'journalctl -xn' for details.
 


2015-10-06 11:25:06 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 142, in _executeMethod
  File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/system/vdsmenv.py", line 148, in _late_setup
RuntimeError: Failed to reconfigure libvirt for VDSM
2015-10-06 11:25:06 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Environment setup': Failed to reconfigure libvirt for VDSM
Comment 2 Fabian Deutsch 2015-10-06 13:47:10 EDT
Genadi, please provide all the logs from a failed run (including vdsm log)
Comment 3 Sandro Bonazzola 2015-10-07 03:58:48 EDT
Dan, is this a regression on vdsm-tool?
Comment 4 GenadiC 2015-10-07 04:11 EDT
Created attachment 1080542 [details]
vdsm logs
Comment 6 Fabian Deutsch 2015-10-07 05:55:10 EDT
This is blocking our RHEV-H automation and the Engine appliance testing
Comment 8 Oved Ourfali 2015-10-07 06:08:34 EDT
(In reply to Fabian Deutsch from comment #6)
> This is blocking our RHEV-H automation and the Engine appliance testing

Does it reproduce 100% of the time?
Comment 9 Elad 2015-10-07 06:24:15 EDT
The bug is reproduced also on RHEL (7.2).
Therefore, changing the product to vdsm.
Comment 10 Elad 2015-10-07 06:25:00 EDT
vdsm-4.17.8-1.el7ev.noarch
vdsm-xmlrpc-4.17.8-1.el7ev.noarch
vdsm-infra-4.17.8-1.el7ev.noarch
vdsm-jsonrpc-4.17.8-1.el7ev.noarch
vdsm-cli-4.17.8-1.el7ev.noarch
vdsm-yajsonrpc-4.17.8-1.el7ev.noarch
vdsm-python-4.17.8-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.0-1.el7ev.noarch
Comment 11 Dan Kenigsberg 2015-10-07 07:13:58 EDT
I'm not sure... What is the command-line output of `vdsm-tool configure --force`?

2015-10-06 11:25:06 DEBUG otopi.plugins.ovirt_hosted_engine_setup.system.vdsmenv plugin.execute:937 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stderr:
Error:  ServiceOperationError: _systemctlStart failed
Comment 12 Simone Tiraboschi 2015-10-07 07:27:08 EDT
(In reply to Oved Ourfali from comment #8)
> (In reply to Fabian Deutsch from comment #6)
> > This is blocking our RHEV-H automation and the Engine appliance testing
> 
> Does it reproduce 100% of the time?

This morning on HE CI jobs it reproduced:
- 5 times over 5 jobs on 7.2 (100%).
- 0 times over 5 jobs on 7.1 (0%).
Comment 13 Michal Skrivanek 2015-10-07 07:39:41 EDT
looks like https://gerrit.ovirt.org/#/c/46772/
HE should use JSON-RPC...as it slowly become an issue as XML-RPC is getting less and less testing
Comment 14 Oved Ourfali 2015-10-07 07:44:26 EDT
(In reply to Michal Skrivanek from comment #13)
> looks like https://gerrit.ovirt.org/#/c/46772/
> HE should use JSON-RPC...as it slowly become an issue as XML-RPC is getting
> less and less testing

Sandro - Why is HE using xmlrpc? I suggest to open another bug to move to jsonrpc, and move this one to virt. 
Michal - what do you say?
Comment 15 Michal Skrivanek 2015-10-07 08:02:12 EDT
sounds good to me. The actual bug is actually already fixed in vdsm
Comment 16 Sandro Bonazzola 2015-10-07 08:40:51 EDT
(In reply to Oved Ourfali from comment #14)
> (In reply to Michal Skrivanek from comment #13)
> > looks like https://gerrit.ovirt.org/#/c/46772/
> > HE should use JSON-RPC...as it slowly become an issue as XML-RPC is getting
> > less and less testing
> 
> Sandro - Why is HE using xmlrpc? I suggest to open another bug to move to
> jsonrpc, and move this one to virt. 

Oved whoever introduced JSON-RPC feature forgot about hosted engine. We discovered XMLRPC will disappear in 4.0 in a gerrit comment, so we already opened
Bug #1228641 - [RFE] Switch from XML-RPC to JSON-RPC API



> Michal - what do you say?
Comment 17 Yaniv Bronhaim 2015-10-07 09:00:53 EDT
Even if the patch you refer to is missing I still see an issue with vdsm startup in this environment - it takes almost a minute for it to be response .. long time since I saw such issue which was regarding the vm recovering part. now it seems related to network initialization. Pioter looks into it right now
Comment 18 Sandro Bonazzola 2015-10-07 09:03:38 EDT
(In reply to Sandro Bonazzola from comment #16)

> Oved whoever introduced JSON-RPC feature forgot about hosted engine. We
> discovered XMLRPC will disappear in 4.0 in a gerrit comment, so we already
> opened
> Bug #1228641 - [RFE] Switch from XML-RPC to JSON-RPC API

By the way, according to documentation, JsonRpc is still on development, far away to be ready for use by hosted engine.
http://www.ovirt.org/Features/Design/JsonRpc 
Revision as of 06:07, 10 June 2015 by Ybronhei

Still in design stage.
Comment 19 Oved Ourfali 2015-10-07 09:13:47 EDT
(In reply to Sandro Bonazzola from comment #18)
> (In reply to Sandro Bonazzola from comment #16)
> 
> > Oved whoever introduced JSON-RPC feature forgot about hosted engine. We
> > discovered XMLRPC will disappear in 4.0 in a gerrit comment, so we already
> > opened
> > Bug #1228641 - [RFE] Switch from XML-RPC to JSON-RPC API
> 

Perhaps. But we're changing all 3.6 cluster level hosts to work with jsonrpc, in the way to deprecate xmlrpc.

> By the way, according to documentation, JsonRpc is still on development, far
> away to be ready for use by hosted engine.
> http://www.ovirt.org/Features/Design/JsonRpc 
> Revision as of 06:07, 10 June 2015 by Ybronhei
> 
> Still in design stage.

The jsonrpc feature consist of many phases, so it is still on the works, also in 4.0. But, the protocol is ready, and it is jsonrpc.
Comment 20 Elad 2015-10-07 09:19:17 EDT
On my setup the bug reproduce 100% using RHEL7.2 while deploying 3.6 hosted engine (even after reboot).

On Genadi's setup, with 3.5 7.1 RHEV-H hosted-engine, it didn't reproduce 100%, after reboot it worked fine.

Yaniv, can you please provide Dan request for the vdsm-tool configure output? (I'm not sure if it still has the original code)
Comment 21 Piotr Kliczewski 2015-10-07 09:32:14 EDT
Vdsm replies with "Recovering from crash or Initializing" for around 40 seconds.
Checking the logs I can see that from vdsm starting:

MainThread::INFO::2015-10-07 16:13:06,121::vdsm::157::vds::(run) (PID: 17709) I am the actual vdsm 4.17.8-1.el7ev green-vdsb.qa.lab.tlv.redhat.com (3.10.0-320.el7.x86_64)

to HMS being ready:

storageRefresh::DEBUG::2015-10-07 16:13:37,989::hsm::373::Storage.HSM::(storageRefresh) HSM is ready

the time is very similar. It seems to be not related to infra but it is actually storage issue.
Comment 22 Nir Soffer 2015-10-07 09:40:25 EDT
(In reply to Piotr Kliczewski from comment #21)
> Vdsm replies with "Recovering from crash or Initializing" for around 40
> seconds.
> Checking the logs I can see that from vdsm starting:
> 
> MainThread::INFO::2015-10-07 16:13:06,121::vdsm::157::vds::(run) (PID:
> 17709) I am the actual vdsm 4.17.8-1.el7ev green-vdsb.qa.lab.tlv.redhat.com
> (3.10.0-320.el7.x86_64)
> 
> to HMS being ready:
> 
> storageRefresh::DEBUG::2015-10-07
> 16:13:37,989::hsm::373::Storage.HSM::(storageRefresh) HSM is ready
> 
> the time is very similar. It seems to be not related to infra but it is
> actually storage issue.

Vdsm behavior is expected, caller cannot assume it will be ready in 30 seconds and need to be ready to handle such timeouts.

Usually hsm is ready in few seconds, but if scsi/fc scan get stuck or lvm 
operation get stuck, vdsm may be stuck from several minutes in this state.

This should be improved when bug 880738 will be resolved.
Comment 23 Nir Soffer 2015-10-07 10:25:44 EDT
Looking at vdsm log on the host:

1. Vdsm starts:

MainThread::INFO::2015-10-07 16:18:09,389::vdsm::157::vds::(run) (PID: 18099) I am the actual vdsm 4.17.8-1.el7ev green-vdsb.qa.lab.tlv.redhat.com (3.10.0-320.el7.x86_64)

2. Starting scsi scan (can take up to 30 seconds)

storageRefresh::DEBUG::2015-10-07 16:18:09,477::iscsi::431::Storage.ISCSI::(rescan) Performing SCSI scan, this will take up to 30 seconds

3. Starting FC scan (can take up to 30 seconds)

storageRefresh::DEBUG::2015-10-07 16:18:09,557::hba::56::Storage.HBA::(rescan) Starting scan


4. Trying to connect to supervdsm (fc-scan is run as root)

storageRefresh::DEBUG::2015-10-07 16:18:09,567::supervdsm::76::SuperVdsmProxy::(_connect) Trying to connect to Super Vdsm

5. FC scan timed out

storageRefresh::ERROR::2015-10-07 16:18:39,669::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=18151)

(We will look at /var/log/messages to understand why)

6. HSM is ready

storageRefresh::ERROR::2015-10-07 16:18:39,669::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=18151)

So the issue was timeout scanning FC devices.

There is no bug in vdsm storage init phase, the caller should expect
these timeouts.

Having said that, we should improve the startup behavior, since scsi and fc
scan and the lvm operations should not be needed since vdsm is not connect
to storage yet. We should delay any such access until the point vdsm is
asked to connect to a storage server (should be 4.0 RFE).

Looking in supervdsm log:

# grep Storage.HBA /var/log/vdsm/vdsm.log
storageRefresh::DEBUG::2015-10-07 14:21:37,745::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 14:22:07,861::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=14101)
storageRefresh::DEBUG::2015-10-07 15:44:36,718::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::DEBUG::2015-10-07 15:45:09,391::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::DEBUG::2015-10-07 15:45:27,163::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 15:45:57,268::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=16297)
storageRefresh::DEBUG::2015-10-07 16:07:33,312::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:08:03,433::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=17158)
storageRefresh::DEBUG::2015-10-07 16:11:48,454::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:12:18,577::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=17498)
storageRefresh::DEBUG::2015-10-07 16:13:06,274::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:13:36,387::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=17760)
storageRefresh::DEBUG::2015-10-07 16:18:09,557::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:18:39,669::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=18151)
storageRefresh::DEBUG::2015-10-07 16:20:37,725::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:21:07,842::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=18444)
storageRefresh::DEBUG::2015-10-07 16:21:55,537::hba::56::Storage.HBA::(rescan) Starting scan
storageRefresh::ERROR::2015-10-07 16:22:25,633::hba::60::Storage.HBA::(rescan) Scan failed: Timeout scanning (pid=18713)

It is clear that something in FC connection is bad in this host.

Running fc-scan manually show:

[root@green-vdsb ~]# /usr/libexec/vdsm/fc-scan -v
fc-scan: Scanning /sys/class/scsi_host/host0/scan
fc-scan: Scanning /sys/class/scsi_host/host4/scan
fc-scan: Scanned /sys/class/scsi_host/host0/scan: 0.00 seconds
fc-scan: Scanned /sys/class/scsi_host/host4/scan: 82.01 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:0/rescan: 0.01 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:0:1/4:0:1:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:0:1/4:0:1:0/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:0/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:1/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:1/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:2/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:2/rescan: 0.00 seconds

/sys/class/fc_host/host4 is not responding as expected.

Aharon, can you for errors in the server side explaining this timeout?
Comment 24 Nir Soffer 2015-10-07 10:28:51 EDT
This seems like a server issue, not a vdsm bug.

If hosted engine setup fails because of 30 seconds timeout, this is hosted
engine setup bug; it should be ready for storage timeouts.
Comment 25 Simone Tiraboschi 2015-10-07 11:21:21 EDT
http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_dynamicip_nfs3_install_on_latest/82/

It's failing also for the CI jobs testing the deployment of hosted-engine over NFS on RHEL 7.2.
That machines are not using iSCSI and doesn't have FC HBA.
The same code is working on RHEL 7.1.

Maybe there is also an additional issue on that host storage bug the bug is systematically reproducible somewhere else.
Comment 26 Michal Skrivanek 2015-10-07 11:33:11 EDT
based on comment #24, do we have any other problem?
(comment #17) Does it actually fail or just takes longer time, Yaniv?
If it works then this bug should remain in MODIFIED, if there are HE deployment timeouts failing the installation then this need to be moved to integration and addressed for 3.6 (I suppose)
Comment 27 Nir Soffer 2015-10-07 11:38:30 EDT
Checking /var/log/messages while performing fc-scan:

# logger "*** starting fc-scan ***"; /usr/libexec/vdsm/fc-scan -v; logger "*** fc-scan finished ***"
fc-scan: Scanning /sys/class/scsi_host/host0/scan
fc-scan: Scanning /sys/class/scsi_host/host4/scan
fc-scan: Scanned /sys/class/scsi_host/host0/scan: 0.00 seconds
fc-scan: Scanned /sys/class/scsi_host/host4/scan: 82.01 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:0/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:0:1/4:0:1:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:0:1/4:0:1:0/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:0/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:0/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:1/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:1/rescan: 0.00 seconds
fc-scan: Rescanning device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:2/rescan
fc-scan: Rescanned device /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:0:2/4:0:2:2/rescan: 0.01 seconds


Oct  7 18:30:46 green-vdsb root: *** starting fc-scan ***
Oct  7 18:32:08 green-vdsb kernel: scsi 4:0:3:0: timing out command, waited 82s
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Sense Key : Illegal Request [current] 
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Add. Sense: Logical unit not supported
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Sense Key : Illegal Request [current] 
Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Add. Sense: Logical unit not supported
Oct  7 18:32:08 green-vdsb root: *** fc-scan finished ***

Chris, what can cause this timeout (kernel: scsi 4:0:3:0: timing out command, waited 82s), and what is the meaning of errors above?
Comment 28 Simone Tiraboschi 2015-10-07 11:47:46 EDT
Maybe we have to split it into two distinct bugs:
the original issue,and from that the bug title is 'Hosted engine deploy fails due to Libvirt cannot be started error'.
It's fully reproduce on RHEL 7.2 and all of our HE CI jobs are failing there.

Took this one on NFS3 as a reference (just to exclude any iSCSI or FC interferences):
http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_dynamicip_nfs3_install_on_latest/82/

hosted-engine-setup it's calling '/bin/vdsm-tool configure --force' in a blocking mode without any timeout.

but on RHEL 7.2 it's failing:

2015-10-07 06:08:51 DEBUG otopi.plugins.ovirt_hosted_engine_setup.system.vdsmenv plugin.execute:936 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stdout:

Checking configuration status...

multipath requires configuration

2015-10-07 06:08:51 DEBUG otopi.plugins.ovirt_hosted_engine_setup.system.vdsmenv plugin.execute:941 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stderr:
Traceback (most recent call last):
  File "/bin/vdsm-tool", line 219, in main
    return tool_command[cmd]["command"](*args)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/__init__.py", line 38, in wrapper
    func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 137, in configure
    if _should_configure(c, args.force)]
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 327, in _should_configure
    configured = _isconfigured(c)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 113, in _isconfigured
    return getattr(module, 'isconfigured', lambda: configurators.NO)()
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurators/passwd.py", line 43, in isconfigured
    _, out, _ = utils.execCmd(script)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 679, in execCmd
    deathSignal=deathSignal, childUmask=childUmask)
  File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 51, in __init__
    stderr=PIPE)
  File "/usr/lib64/python2.7/subprocess.py", line 712, in __init__
    restore_sigpipe=restore_sigpipe)
TypeError: _execute_child_v276() got an unexpected keyword argument 'restore_sigpipe'


Under /var/log/vdsm/upgrade.log we found the correspondent error:

MainThread::ERROR::2015-10-07 06:08:47,330::upgrade::94::upgrade::(apply_upgrade) Failed to run upgrade-unified-persistence
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/tool/upgrade.py", line 92, in apply_upgrade
    upgrade.run(ns, args)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/unified_persistence.py", line 193, in run
    run()
  File "/usr/lib/python2.7/site-packages/vdsm/tool/unified_persistence.py", line 44, in run
    networks, bondings = _getNetInfo()
  File "/usr/lib/python2.7/site-packages/vdsm/tool/unified_persistence.py", line 131, in _getNetInfo
    networks = _processNetworks(netinfo)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/unified_persistence.py", line 53, in _processNetworks
    defaultGateway = getDefaultGateway()
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 319, in getDefaultGateway
    output = routeShowGateways('main')
  File "/usr/lib/python2.7/site-packages/vdsm/ipwrapper.py", line 537, in routeShowGateways
    return _execCmd(command)
  File "/usr/lib/python2.7/site-packages/vdsm/ipwrapper.py", line 526, in _execCmd
    returnCode, output, error = execCmd(command)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 679, in execCmd
    deathSignal=deathSignal, childUmask=childUmask)
  File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 51, in __init__
    stderr=PIPE)
  File "/usr/lib64/python2.7/subprocess.py", line 712, in __init__
    restore_sigpipe=restore_sigpipe)
TypeError: _execute_child_v276() got an unexpected keyword argument 'restore_sigpipe'


The same code is working on RHEL 7.1 so on my opinion is a regression on vdsm-tool.
It's systematically happening with vdsm-4.17.8-12.gitbabda51.el7.noarch so if it has been recently fixed the path is still not there.
Comment 29 Simone Tiraboschi 2015-10-07 11:56:12 EDT
It's probably just a duplicate of this one:
https://bugzilla.redhat.com/show_bug.cgi?id=1264667
Comment 30 Simone Tiraboschi 2015-10-07 12:23:15 EDT
The CI jobs are running with python-cpopen-1.3-4.el7.x86_64 witch comes from EPEL:
https://bodhi.fedoraproject.org/updates/?packages=python-cpopen

So it's probably the time to release an updated version with that fix.
Comment 31 Nir Soffer 2015-10-07 12:26:28 EDT
(In reply to Simone Tiraboschi from comment #28)
> TypeError: _execute_child_v276() got an unexpected keyword argument
> 'restore_sigpipe'

You are using old cpopen - Python on 7.2 modified internal api which broke
cpopen. Please update cpopen to 1.4.
Comment 32 Oved Ourfali 2015-10-07 12:31:12 EDT
According to https://bugzilla.redhat.com/show_bug.cgi?id=1264667#c9 you can also just update to the latest python version in the CI env.
Comment 33 Simone Tiraboschi 2015-10-07 12:42:35 EDT
(In reply to Oved Ourfali from comment #32)
> According to https://bugzilla.redhat.com/show_bug.cgi?id=1264667#c9 you can
> also just update to the latest python version in the CI env.

Nice, but that CI job is already updating everything with what is available.
Comment 35 Chris Leech 2015-10-07 19:50:46 EDT
(In reply to Nir Soffer from comment #27)
> Checking /var/log/messages while performing fc-scan:
> 
> # logger "*** starting fc-scan ***"; /usr/libexec/vdsm/fc-scan -v; logger
> "*** fc-scan finished ***"
> fc-scan: Scanning /sys/class/scsi_host/host0/scan
> fc-scan: Scanning /sys/class/scsi_host/host4/scan
> fc-scan: Scanned /sys/class/scsi_host/host0/scan: 0.00 seconds
> fc-scan: Scanned /sys/class/scsi_host/host4/scan: 82.01 seconds
> fc-scan: Rescanning device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:
> 0:0/4:0:0:0/rescan
> fc-scan: Rescanned device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-0/target4:
> 0:0/4:0:0:0/rescan: 0.00 seconds
> fc-scan: Rescanning device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:
> 0:1/4:0:1:0/rescan
> fc-scan: Rescanned device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-1/target4:
> 0:1/4:0:1:0/rescan: 0.00 seconds
> fc-scan: Rescanning device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:0/rescan
> fc-scan: Rescanned device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:0/rescan: 0.00 seconds
> fc-scan: Rescanning device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:1/rescan
> fc-scan: Rescanned device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:1/rescan: 0.00 seconds
> fc-scan: Rescanning device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:2/rescan
> fc-scan: Rescanned device
> /sys/devices/pci0000:00/0000:00:04.0/0000:07:00.1/host4/rport-4:0-2/target4:
> 0:2/4:0:2:2/rescan: 0.01 seconds
> 
> 
> Oct  7 18:30:46 green-vdsb root: *** starting fc-scan ***
> Oct  7 18:32:08 green-vdsb kernel: scsi 4:0:3:0: timing out command, waited
> 82s
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Read Capacity(10)
> failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Sense Key : Illegal
> Request [current] 
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:0:0: [sdb] Add. Sense: Logical
> unit not supported
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Read Capacity(10)
> failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Sense Key : Illegal
> Request [current] 
> Oct  7 18:32:08 green-vdsb kernel: sd 4:0:1:0: [sdc] Add. Sense: Logical
> unit not supported
> Oct  7 18:32:08 green-vdsb root: *** fc-scan finished ***
> 
> Chris, what can cause this timeout (kernel: scsi 4:0:3:0: timing out
> command, waited 82s), and what is the meaning of errors above?

Was the target side storage reconfigured?  The "Logical unit not supported" response indicates that the LUs for sdb and sdc no longer exist on the target (or targets I guess, as there are multiple targets with one LU each it looks like).  So targets 0 and 1 are responding but each failing Read Capacity(10) on LU 0, and target 3 failed to respond at all.
Comment 36 Elad 2015-10-08 03:28:11 EDT
(In reply to Nir Soffer from comment #23)
> Aharon, can you for errors in the server side explaining this timeout?

No FC errors on storage server side, also, the devices are available for the host:

[root@green-vdsb scsi_host]# cat /proc/scsi/scsi
.
.
Host: scsi4 Channel: 00 Id: 02 Lun: 00
  Vendor: XtremIO  Model: XtremApp         Rev: 4010
  Type:   RAID                             ANSI  SCSI revision: 06
Host: scsi4 Channel: 00 Id: 02 Lun: 01
  Vendor: XtremIO  Model: XtremApp         Rev: 4010
  Type:   Direct-Access                    ANSI  SCSI revision: 06
Host: scsi4 Channel: 00 Id: 02 Lun: 02
  Vendor: XtremIO  Model: XtremApp         Rev: 4010
  Type:   Direct-Access                    ANSI  SCSI revision: 06
.
.


Also, the bug is 100% reproduced on hosts that are not connected to FC as stated on comment #25
Comment 37 Elad 2015-10-08 03:29:45 EDT
Re-setting the need-info flags for Yaniv
Comment 38 Michal Skrivanek 2015-10-08 03:34:05 EDT
(In reply to Elad from comment #36)
> Also, the bug is 100% reproduced on hosts that are not connected to FC as
> stated on comment #25

which bug? Again, please separate the discussion about possible storage scan-related issues and the original reported problem, and cpopen. The latter two are already addressed
Comment 39 Elad 2015-10-08 04:02:44 EDT
(In reply to Michal Skrivanek from comment #38)
> (In reply to Elad from comment #36)
> > Also, the bug is 100% reproduced on hosts that are not connected to FC as
> > stated on comment #25
> 
> which bug? Again, please separate the discussion about possible storage
> scan-related issues and the original reported problem, and cpopen. The
> latter two are already addressed

I'm talking about the hosted-engine failure
Comment 41 Nir Soffer 2015-10-08 07:14:04 EDT
(In reply to Elad from comment #36)
> (In reply to Nir Soffer from comment #23)
> > Aharon, can you for errors in the server side explaining this timeout?
> 
> No FC errors on storage server side, also, the devices are available for the
> host:
> 
> [root@green-vdsb scsi_host]# cat /proc/scsi/scsi
> .
> .
> Host: scsi4 Channel: 00 Id: 02 Lun: 00
>   Vendor: XtremIO  Model: XtremApp         Rev: 4010
>   Type:   RAID                             ANSI  SCSI revision: 06
> Host: scsi4 Channel: 00 Id: 02 Lun: 01
>   Vendor: XtremIO  Model: XtremApp         Rev: 4010
>   Type:   Direct-Access                    ANSI  SCSI revision: 06
> Host: scsi4 Channel: 00 Id: 02 Lun: 02
>   Vendor: XtremIO  Model: XtremApp         Rev: 4010
>   Type:   Direct-Access                    ANSI  SCSI revision: 06

Elad, please open a separate bug kernel bug for this issue. Clearly there
is something wrong here, scanning timeout after 82 seconds and errors
mentioned in comment 35 are not ok.

Please use the command seen in comment 27 on host running 7.2 and 7.1
connected to the same storage server and attach the command output and
the relevant errors seen in /var/log/messages while the command was
running.
Comment 43 Nir Soffer 2015-10-08 07:38:18 EDT
This is not a storage bug, and the unrelated storage/kernel related issue
will be tracked in another bug.
Comment 44 Nir Soffer 2015-10-08 07:39:09 EDT
Jeff, please check comment 42.
Comment 48 Elad 2015-10-08 10:50:24 EDT
> Elad, please open a separate bug kernel bug for this issue. Clearly there
> is something wrong here, scanning timeout after 82 seconds and errors
> mentioned in comment 35 are not ok.
> 
> Please use the command seen in comment 27 on host running 7.2 and 7.1
> connected to the same storage server and attach the command output and
> the relevant errors seen in /var/log/messages while the command was
> running.

Nir, on both 7.1 and 7.2 it behaves the same, the fc scan takes ~80 seconds using XtremIO storage server.

Tried with another storage server (EMC-VNX) and the scan takes less than a second.
Comment 50 Nir Soffer 2015-10-14 11:46:17 EDT
(In reply to Elad from comment #48)
> > Elad, please open a separate bug kernel bug for this issue. Clearly there
> > is something wrong here, scanning timeout after 82 seconds and errors
> > mentioned in comment 35 are not ok.
> > 
> > Please use the command seen in comment 27 on host running 7.2 and 7.1
> > connected to the same storage server and attach the command output and
> > the relevant errors seen in /var/log/messages while the command was
> > running.
> 
> Nir, on both 7.1 and 7.2 it behaves the same, the fc scan takes ~80 seconds
> using XtremIO storage server.
> 
> Tried with another storage server (EMC-VNX) and the scan takes less than a
> second.

So this is an issue with the storage server, not with 7.2 or 7.1.
Comment 51 Yaniv Lavi 2015-10-15 03:23:08 EDT
Moving out yo 3.6.1, since it's dependent on the storage server.
Comment 52 Yaniv Lavi 2015-10-15 03:25:16 EDT
Moving back since this is MODIFIED.
Comment 53 Michal Skrivanek 2015-10-15 08:43:25 EDT
note this bug doesn't track the storage issue mentioned above
Comment 54 Red Hat Bugzilla Rules Engine 2015-10-18 04:34:03 EDT
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Comment 55 Yaniv Lavi 2015-10-20 07:22:56 EDT
Can you please check if the dependent bugs are really needed to resolve this issue? Can you move them to see also if not?
Comment 56 Michal Skrivanek 2015-10-20 18:12:58 EDT
 They are not. This bug tracks only the regression in xmlrpc needed for hosted engine. That is fixed. Everything else is unrelated
Comment 57 Nikolai Sednev 2015-10-26 11:36:37 EDT
On clean installation on Red Hat Enterprise Virtualization Hypervisor release 7.2 (20151025.0.el7ev) I ran as shown bellow and failed with the error:

[ INFO  ] Stage: Initializing
[ INFO  ] Generating a temporary VNC password.
[ INFO  ] Stage: Environment setup
          Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards.
          Are you sure you want to continue? (Yes, No)[Yes]:           Configuration files: ['/tmp/tmpAM2ccM']
          Log file: /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20151026152224-k4x1eo.log
          Version: otopi-1.4.0 (otopi-1.4.0-1.el7ev)
[ INFO  ] Hardware supports virtualization
[ INFO  ] Stage: Environment packages setup
[ INFO  ] Stage: Programs detection
[ INFO  ] Stage: Environment setup
[ ERROR ] Failed to execute stage 'Environment setup': Failed to reconfigure libvirt for VDSM
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20151026152232.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
Something went wrong setting up hosted engine, or the setup process was cancelled.

Press any key to continue...

If I'm following these steps then:
[root@black-vdsb ~]# vdsm-tool configure --force

Checking configuration status...

Current revision of multipath.conf detected, preserving
libvirt is already configured for vdsm
SUCCESS: ssl configured to true. No conflicts

Running configure...
checking certs..
Reconfiguration of libvirt is done.

Done configuring modules to VDSM.

I'm getting the same results:
[ INFO  ] Stage: Initializing
[ INFO  ] Generating a temporary VNC password.
[ INFO  ] Stage: Environment setup
          Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards.
          Are you sure you want to continue? (Yes, No)[Yes]: 
          Configuration files: ['/tmp/tmpRaZH19']
          Log file: /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20151026153500-e9t177.log
          Version: otopi-1.4.0 (otopi-1.4.0-1.el7ev)
[ INFO  ] Hardware supports virtualization
[ INFO  ] Stage: Environment packages setup
[ INFO  ] Stage: Programs detection
[ INFO  ] Stage: Environment setup
[ ERROR ] Failed to execute stage 'Environment setup': Failed to start service 'vdsmd'
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20151026153512.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
Something went wrong setting up hosted engine, or the setup process was cancelled.

Press any key to continue...



Please see the logs I'm attached.
Comment 58 Nikolai Sednev 2015-10-26 11:39 EDT
Created attachment 1086545 [details]
logs from the host (black)
Comment 59 Red Hat Bugzilla Rules Engine 2015-10-26 11:41:28 EDT
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Comment 60 Nikolai Sednev 2015-10-27 05:35:07 EDT
I've created a separate bug 1275571 to not mix both issues and have set this bug as depends on 1275571, this is due to inability to verify as deployment fails on network bridge creation issue.
Comment 61 Red Hat Bugzilla Rules Engine 2015-11-02 07:26:54 EST
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Comment 62 Nikolai Sednev 2015-11-15 13:04:41 EST
Works for me, failed to reproduce these errors on my Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151113.123.el7ev) during both iSCSI and NFS deployments:

mom-0.5.1-1.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
libvirt-1.2.17-13.el7.x86_64
qemu-kvm-rhev-2.3.0-31.el7.x86_64
vdsm-4.17.10.1-0.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-node-branding-rhev-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-lib-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-snmp-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
ovirt-node-plugin-vdsm-0.6.1-3.el7ev.noarch
ovirt-setup-lib-1.0.0-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-node-lib-config-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-selinux-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-cim-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.3.0-3.el7ev.noarch
ovirt-node-plugin-snmp-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-rhn-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-lib-legacy-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-host-deploy-offline-1.4.0-1.el7ev.x86_64
ovirt-node-plugin-cim-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-host-deploy-1.4.1-0.0.master.el7ev.noarch
ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch
Comment 63 Sandro Bonazzola 2015-11-27 02:51:07 EST
Since oVirt 3.6.0 has been released, moving from verified to closed current release.

Note You need to log in before you can comment on or make changes to this bug.