Created attachment 1045043 [details] sosreport Description of problem: Deployment of hosted engine failed on ISCSI storage, with vdsm eception: RuntimeError: Broken communication with supervdsm. Failed call to readSessionInfo Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-1.3.0-0.0.master.20150623153111.git68138d4.el7.noarch vdsm-4.17.0-1054.git562e711.el7.noarch How reproducible: Always Steps to Reproduce: 1. run hosted-engine --deploy 2. choose iscsi storage and enter all necessaries details 3. Actual results: After enter all details about iscsi storage deployment failed with exception: RuntimeError: Error block device action: () Expected results: Deployment continue and success without any errors Additional info: see all logs under sosreport
In SuperVdsm, the call succeeds: MainProcess|Thread-24::DEBUG::2015-06-30 17:03:28,242::supervdsmServer::114::SuperVdsm.ServerCallback::(wrapper) return getScsiSerial with SXtremIO_XtremApp_PSNT_Not_Set MainProcess|Thread-24::DEBUG::2015-06-30 17:03:28,244::supervdsmServer::107::SuperVdsm.ServerCallback::(wrapper) call readSessionInfo with (1,) {} MainProcess|Thread-24::DEBUG::2015-06-30 17:03:28,244::iscsiadm::97::Storage.Misc.excCmd::(_runCmd) /sbin/iscsiadm -m iface -I default (cwd None) MainProcess|Thread-24::DEBUG::2015-06-30 17:03:28,251::iscsiadm::97::Storage.Misc.excCmd::(_runCmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|Thread-24::DEBUG::2015-06-30 17:03:28,244::supervdsmServer::114::SuperVdsm.ServerCallback::(wrapper) return readSessionInfo with IscsiSession(id=1, iface=<IscsiInterface name='default' transport='tcp' netIfaceName='None'>, target=IscsiTarget(portal=IscsiPortal(hostname='10.35.146.129', port=3260), tpgt=1, iqn='iqn.2008-05.com.xtremio:001e675b8ee0'), credentials=<storage.iscsi.ChapCredentials object at 0x7f49b0138d50>) But VDSM is unable to get the result from it: Thread-24::DEBUG::2015-06-30 17:03:28,252::supervdsm::76::SuperVdsmProxy::(_connect) Trying to connect to Super Vdsm Thread-24::ERROR::2015-06-30 17:03:28,258::task::866::Storage.TaskManager.Task::(_setError) Task=`a5426845-cbba-4e7c-90e3-427cc2736e9b`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1984, in getDeviceList devices = self._getDeviceList(storageType=storageType, guids=guids) File "/usr/share/vdsm/storage/hsm.py", line 2014, in _getDeviceList for dev in multipath.pathListIter(guids): File "/usr/share/vdsm/storage/multipath.py", line 304, in pathListIter sess = iscsi.getSessionInfo(sessionID) File "/usr/share/vdsm/storage/iscsi.py", line 87, in getSessionInfo return supervdsm.getProxy().readSessionInfo(sessionID) File "/usr/share/vdsm/supervdsm.py", line 55, in __call__ % self._funcName) RuntimeError: Broken communication with supervdsm. Failed call to readSessionInfo Yaniv - we need infra's help here please.
I see in supervdsm.log many fails after MainThread::DEBUG::2015-06-30 17:02:24,434::__init__::47::blivet::(register_device_format) registered device format class BIOSBoot as biosboot MainThread::DEBUG::2015-06-30 17:02:24,474::storage_log::69::blivet::(log_exception_info) IGNORED: Caught exception, continuing. MainThread::DEBUG::2015-06-30 17:02:24,475::storage_log::72::blivet::(log_exception_info) IGNORED: Problem description: failed to get initiator name from iscsi firmware MainThread::DEBUG::2015-06-30 17:02:24,475::storage_log::73::blivet::(log_exception_info) IGNORED: Begin exception details. MainThread::DEBUG::2015-06-30 17:02:24,475::storage_log::76::blivet::(log_exception_info) IGNORED: Traceback (most recent call last): MainThread::DEBUG::2015-06-30 17:02:24,475::storage_log::76::blivet::(log_exception_info) IGNORED: File "/usr/lib/python2.7/site-packages/blivet/iscsi.py", line 87, in __init__ MainThread::DEBUG::2015-06-30 17:02:24,475::storage_log::76::blivet::(log_exception_info) IGNORED: initiatorname = libiscsi.get_firmware_initiator_name() MainThread::DEBUG::2015-06-30 17:02:24,475::storage_log::76::blivet::(log_exception_info) IGNORED: IOError: Unknown error MainThread::DEBUG::2015-06-30 17:02:24,475::storage_log::77::blivet::(log_exception_info) IGNORED: End exception details. MainThread::DEBUG::2015-06-30 17:02:24,482::supervdsmServer::486::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm - after this call libiscsi.get_firmware_initiator_name supervdsmd is restarted . please figure why it kills the process. its not related to the communication between vdsm to supervdsm. the broken communication happens once after each supervdsm crash - but the crash is the bug here
Moving to VDSM according to comment #1 and comment #2
This is not storage issue, this is an error in supervdsm, probably related to multiprocessing call failing after receiving a signal. These issues started when we added zombiereaper to supervdsm. Each time a process ends, supervdsm get a SIGCHLD signal. If a multiprocessing call is interrupted by the signal, our code typically fail in a wrong way, because we do not check and handle EINTR in such calls. This may be also a duplicate of https://bugzilla.redhat.com/1259310. Greg, do you have some insight on this? I think this should move to infra.
(In reply to Nir Soffer from comment #5) > This is not storage issue, this is an error in supervdsm, probably related > to multiprocessing call failing after receiving a signal. > > These issues started when we added zombiereaper to supervdsm. Each time > a process ends, supervdsm get a SIGCHLD signal. If a multiprocessing call > is interrupted by the signal, our code typically fail in a wrong way, > because we do not check and handle EINTR in such calls. > > This may be also a duplicate of https://bugzilla.redhat.com/1259310. > > Greg, do you have some insight on this? It seems very similar to the bug you mention, which was hosted engine setup failing with file storage. Given that this started with zombiereaper and also involves supervdsm's _runcmd(), I think what you're proposing is likely. Probably worth checking to see if it still happens now that the supervdsm fix for bug 1259310 is merged.
Why isn't it sign as duplication of Bug 1259310 ?
Verified on ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch Deploy succeed without any errors. So I believe Elad can move it to ON_QA.
I'm not sure why this bug is ON_QA if there is no patch that fixes the issue reported here. Should we verify this bug or close it as DUP of another one (1259310)?
Moving to verified as this has a TestOnly keyword. Tested hosted-engine deployment over iSCSI and it succeeded. Verified using: ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.1-1.el7ev.noarch vdsm-4.17.10.1-0.el7ev.noarch
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE