Bug 1169290

Summary: [hosted-engine] [iSCSI support] connectStoragePools fails with "SSLError: The read operation timed out" while adding a new host to the setup
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: ovirt-hosted-engine-setupAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED ERRATA QA Contact: Elad <ebenahar>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.5.0CC: amureini, bazulay, dfediuck, didi, fsimonce, gklein, istein, lpeer, lsurette, lveyde, nsoffer, sbonazzo, scohen, sherold, stirabos, yeylon, ykaul
Target Milestone: ovirt-3.6.0-rcKeywords: Reopened, ZStream
Target Release: 3.6.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, the self-hosted engine was talking with the VDSM-wrapping vdsClient utility. This could cause SSL timeout errors for long sync commands. With this release, the self-hosted engine uses the vdscli library for storage operations instead of vdsClient, reducing SSL timeouts.
Story Points: ---
Clone Of:
: 1190636 (view as bug list) Environment:
Last Closed: 2016-03-09 19:06:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1190207, 1215623, 1220824, 1247942    
Bug Blocks: 1157385, 1190636, 1234915    
Attachments:
Description Flags
/var/log from both host and engine.log
none
logs from engine and vdsm from manual attempt none

Description Elad 2014-12-01 09:22:25 UTC
Created attachment 963189 [details]
/var/log from both host and engine.log

Description of problem:
I tried to add a second host to the hosted-engine setup. Executed hosted-engine --deploy on the second host, picked the LUN which is already in use by the first host as engine disk. The deployment failed with a "The read operation timed out" error.

Version-Release number of selected component (if applicable):
rhel7 installed on both hosts
rhev 3.5 vt12

How reproducible:
Always

Steps to Reproduce:
1. Deploy hosted engine on a host, pick iSCSI. Once the setup is up, created a storage domain (NFS in my case)
2. Deploy hosted engine on an additional host. Pick the LUN where the engine disk located 

          Please specify the lun id (1, 2) [1]: 
          The specified storage location already contains a data domain. Is this an additional host setup (Yes, No)[Yes]? 
[ INFO  ] Installing on additional host

 

Actual results:
The operation fails with:
[ INFO  ] Stage: Setup validation
[ ERROR ] Failed to execute stage 'Setup validation': The read operation timed out
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20141201105208.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination

In hosted-engine-setup log:

20**FILTERED**4-**FILTERED**2-0**FILTERED** **FILTERED**0:52:08 DEBUG otopi.context context._executeMethod:**FILTERED**52 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line **FILTERED**42, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line **FILTERED**96, in _validation
    self._storagePoolConnection()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 594, in _storagePoolConnection
    status, message = method(args=method_args)
  File "/usr/share/vdsm/vdsClient.py", line 920, in connectStoragePool
    pool = self.s.connectStoragePool(*connectArguments)
  File "/usr/lib64/python2.7/xmlrpclib.py", line **FILTERED**224, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.7/xmlrpclib.py", line **FILTERED**578, in __request
    verbose=self.__verbose
  File "/usr/lib64/python2.7/xmlrpclib.py", line **FILTERED**264, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib64/python2.7/xmlrpclib.py", line **FILTERED**294, in single_request
    response = h.getresponse(buffering=True)
  File "/usr/lib64/python2.7/httplib.py", line **FILTERED**045, in getresponse
    response.begin()
  File "/usr/lib64/python2.7/httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline(_MAXLINE + **FILTERED**)
  File "/usr/lib64/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
  File "/usr/lib64/python2.7/ssl.py", line 24**FILTERED**, in recv
    return self.read(buflen)
  File "/usr/lib64/python2.7/ssl.py", line **FILTERED**60, in read
    return self._sslobj.read(len)
SSLError: The read operation timed out
20**FILTERED**4-**FILTERED**2-0**FILTERED** **FILTERED**0:52:08 ERROR otopi.context context._executeMethod:**FILTERED**6**FILTERED** Failed to execute stage 'Setup validation': The read operation timed out


Expected results:
Second host should be able to connect to the pool

Additional info: /var/log from both host and engine.log

Comment 1 Elad 2014-12-01 09:28:03 UTC
Created attachment 963192 [details]
logs from engine and vdsm from manual attempt

I was able to add the host to the DC manually (not via hosted engine deployment).
The host connected successfully to the storage pool:

Thread-22::INFO::2014-12-01 11:22:51,921::logUtils::47::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True

Attaching the logs

Comment 2 Sandro Bonazzola 2014-12-01 09:41:26 UTC
This traceback has been caused by a timeout from VDSM (vdsm log 1 on second host):

Thread-89::WARNING::2014-12-01 10:52:08,639::fileSD::688::Storage.scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/10.35.64.81:_export_elad_1
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/fileSD.py", line 674, in collectMetaFiles
    sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/outOfProcess.py", line 89, in glob
    return self._iop.glob(pattern)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 500, in glob
    return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

which seems a teamporary connection issue with the storage and it seems correct to fail to validate a broken connection to storage.

If more investigation on why the storage failure happened I suggest to move to storage team. Closing as not a bug for now, feel free to reopen if you've more details on the root cause of this failure or if it happens always (so it's not a temporary connection issue).

Comment 3 Elad 2014-12-01 09:47:18 UTC
Sandro, this failure happened several times during the day (100% reproducible). You can check the attached logs, there are several setup logs generated from each failure. 
For making sure this is not a teamporary connection issue with the storage, I tested it also manually, as stated in comment #1

Comment 4 Sandro Bonazzola 2014-12-01 10:22:06 UTC
Nir, Federico, can you take a look and tell what's going on on the storage side while Elad is trying to deploy second host?

Comment 5 Simone Tiraboschi 2014-12-02 16:42:21 UTC
Not reproduce with:
 rhel66 installed on both hosts
 rhevm 3.5 vt12

Comment 6 Sandro Bonazzola 2014-12-03 12:04:37 UTC
Not reproducible with
  rhel70 installed on both hosts (HE from vt13/brew)
  rhel66 in the hosted engine vm with rhevm 3.5 vt12.

Simone, can you sync with Elad and try to do a forensic analysis on his systems?

Comment 7 Simone Tiraboschi 2014-12-05 15:06:24 UTC
Ok, found in vdsm.log for the corresponding timestamp:

Thread-89::WARNING::2014-12-01 10:52:08,639::fileSD::688::Storage.scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/10.35.64.81:_export_elad_1
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/fileSD.py", line 674, in collectMetaFiles
    sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/outOfProcess.py", line 89, in glob
    return self._iop.glob(pattern)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 500, in glob
    return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out
Thread-85::DEBUG::2014-12-01 10:52:08,640::fileUtils::142::Storage.fileUtils::(createdir) Creating directory: /rhev/data-center/mnt/blockSD/1fac6cbf-ccb1-47cb-95d1-743f1fc4ff52/images
Thread-85::DEBUG::2014-12-01 10:52:08,640::fileUtils::142::Storage.fileUtils::(createdir) Creating directory: /rhev/data-center/mnt/blockSD/1fac6cbf-ccb1-47cb-95d1-743f1fc4ff52/images
Thread-85::WARNING::2014-12-01 10:52:08,641::fileUtils::149::Storage.fileUtils::(createdir) Dir /rhev/data-center/mnt/blockSD/1fac6cbf-ccb1-47cb-95d1-743f1fc4ff52/images already exists
Thread-85::DEBUG::2014-12-01 10:52:08,641::fileUtils::142::Storage.fileUtils::(createdir) Creating directory: /rhev/data-center/mnt/blockSD/1fac6cbf-ccb1-47cb-95d1-743f1fc4ff52/dom_md
Thread-85::WARNING::2014-12-01 10:52:08,641::fileUtils::149::Storage.fileUtils::(createdir) Dir /rhev/data-center/mnt/blockSD/1fac6cbf-ccb1-47cb-95d1-743f1fc4ff52/dom_md already exists

and something before:

Thread-93859::ERROR::2014-12-01 08:38:57,425::dispatcher::79::Storage.Dispatcher::(wrapper) Secured object is not in safe state
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 71, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 103, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1179, in prepare
    raise self.error
SecureError: Secured object is not in safe state
Thread-93859::DEBUG::2014-12-01 08:38:57,426::stompReactor::163::yajsonrpc.StompServer::(send) Sending response

It seams quite related to this one:
https://bugzilla.redhat.com/show_bug.cgi?id=1159314

Elad, could you please report the NFS server configuration for 10.35.64.81:/export/elad/1 ?

Comment 8 Simone Tiraboschi 2014-12-05 15:12:21 UTC
Hosted engine is deployed over iSCSI but there is also a shared domain over NFS.

VDSM is failing accessing the NFS storage domain metadata and this is enough to generate a timeout on the connectStoragePool from hosted engine.

Comment 9 Simone Tiraboschi 2014-12-05 15:36:40 UTC
Allon,
it seams some quite strange:

hosted engine send a connectStoragePool for a storage pool on the iSCSI storage domain,
VDSM fails getting info for an other storage domain over NFS (probably something related to https://bugzilla.redhat.com/1159314 )

Thread-89::WARNING::2014-12-01 10:52:08,639::fileSD::688::Storage.scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/10.35.64.81:_export_elad_1
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/fileSD.py", line 674, in collectMetaFiles
    sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/outOfProcess.py", line 89, in glob
    return self._iop.glob(pattern)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 500, in glob
    return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

And this is enough to propagate that to hosted-engine request for the iSCSI storage pool:
2014-12-01 10:52:08 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 142, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 196, in _validation
    self._storagePoolConnection()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 594, in _storagePoolConnection
    status, message = method(args=method_args)
  File "/usr/share/vdsm/vdsClient.py", line 920, in connectStoragePool
    pool = self.s.connectStoragePool(*connectArguments)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1224, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1578, in __request
    verbose=self.__verbose
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1264, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1294, in single_request
    response = h.getresponse(buffering=True)
  File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
    response.begin()
  File "/usr/lib64/python2.7/httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib64/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
  File "/usr/lib64/python2.7/ssl.py", line 241, in recv
    return self.read(buflen)
  File "/usr/lib64/python2.7/ssl.py", line 160, in read
    return self._sslobj.read(len)
SSLError: The read operation timed out
2014-12-01 10:52:08 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Setup validation': The read operation timed out

shouldn't the iSCSI storage pool be connectable also if the NFS one is failing for any reasons?

Comment 10 Adam Litke 2014-12-09 16:32:41 UTC
It looks like the root cause is that the connectStoragePool call to vdsm is taking one minute to complete:

Thread-22::DEBUG::2014-12-01 11:21:51,601::__init__::469::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.connect' in bridge with {u'masterVersion': 2, u'domainDict': {u'0825d293-a9fa-4483-a526-cf822d0e4fb1': u'active'}, u'storagepoolID': u'00000002-0002-0002-0002-000000000248', u'scsiKey': u'00000002-0002-0002-0002-000000000248', u'masterSdUUID': u'0825d293-a9fa-4483-a526-cf822d0e4fb1', u'hostID': 2}
...
Thread-22::DEBUG::2014-12-01 11:22:51,924::__init__::500::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.connect' in bridge with True

This causes ovirt-hosted-engine-setup to time out while waiting for the command to complete.

The reason the command takes so long is because part of the connectStoragePool flow is to scan the metadata of storage domains associated with the storage pool being connected.  In this scenario the domain 0825d293-a9fa-4483-a526-cf822d0e4fb1 associated with the mount point /rhev/data-center/mnt/10.35.64.81:_export_elad_1 is not accessible.  We eventually time out while trying to read the domain info and connectStoragePool is successful.  Unfortunately hosted-engine has given up already.

My point of view is that this bug should not block the release since we're dealing with a case of failing storage.  The workaround for this bug would be to ensure that all storage domains are accessible to the host.

Comment 11 Allon Mureinik 2014-12-09 18:36:16 UTC
(In reply to Adam Litke from comment #10)
> It looks like the root cause is that the connectStoragePool call to vdsm is
> taking one minute to complete:
> 
> Thread-22::DEBUG::2014-12-01
> 11:21:51,601::__init__::469::jsonrpc.JsonRpcServer::(_serveRequest) Calling
> 'StoragePool.connect' in bridge with {u'masterVersion': 2, u'domainDict':
> {u'0825d293-a9fa-4483-a526-cf822d0e4fb1': u'active'}, u'storagepoolID':
> u'00000002-0002-0002-0002-000000000248', u'scsiKey':
> u'00000002-0002-0002-0002-000000000248', u'masterSdUUID':
> u'0825d293-a9fa-4483-a526-cf822d0e4fb1', u'hostID': 2}
> ...
> Thread-22::DEBUG::2014-12-01
> 11:22:51,924::__init__::500::jsonrpc.JsonRpcServer::(_serveRequest) Return
> 'StoragePool.connect' in bridge with True
> 
> This causes ovirt-hosted-engine-setup to time out while waiting for the
> command to complete.
> 
> The reason the command takes so long is because part of the
> connectStoragePool flow is to scan the metadata of storage domains
> associated with the storage pool being connected.  In this scenario the
> domain 0825d293-a9fa-4483-a526-cf822d0e4fb1 associated with the mount point
> /rhev/data-center/mnt/10.35.64.81:_export_elad_1 is not accessible.  We
> eventually time out while trying to read the domain info and
> connectStoragePool is successful.  Unfortunately hosted-engine has given up
> already.
So basically, there's nothing we /can/ do on the storage side - HE setup needs a larger timeout, or even to wait indefinitely for the storage call to either succeed or fail.

Comment 12 Simone Tiraboschi 2014-12-11 14:12:01 UTC
Unfortunately we are already simply calling connectStoragePool as a synchronous command over vdsCli.
I don't see any way on hosted-engine side to force a different timeout on the low level SSL connection.

Nir, do you see any way to force a different SSL timeout from vdsCli?

Comment 13 Nir Soffer 2014-12-11 14:44:59 UTC
(In reply to Simone Tiraboschi from comment #12)
> Unfortunately we are already simply calling connectStoragePool as a
> synchronous command over vdsCli.
> I don't see any way on hosted-engine side to force a different timeout on
> the low level SSL connection.
> 
> Nir, do you see any way to force a different SSL timeout from vdsCli?

No, but if vdsClient or vdscli uses a timeout of 60 seconds, this is a bug in that component.

Comment 14 Simone Tiraboschi 2014-12-11 16:22:38 UTC
(In reply to Nir Soffer from comment #13)
> No, but if vdsClient or vdscli uses a timeout of 60 seconds, this is a bug
> in that component.

Yes, it seams so.
I'm opening a new bug on that.

Comment 15 Sandro Bonazzola 2014-12-12 14:32:53 UTC
Simone can you set the new bug for blocking this one?

Comment 16 Simone Tiraboschi 2015-02-06 17:06:34 UTC
Fixing https://bugzilla.redhat.com/1190207 should be enough to fix this one. No additional change is needed.
Test only.

Comment 18 Simone Tiraboschi 2015-02-09 08:24:35 UTC
Yes, it would be better to have additional tests on it.

Comment 20 Sandro Bonazzola 2015-02-20 11:07:51 UTC
Automated message: can you please update doctext or set it as not required?

Comment 22 Elad 2015-05-19 11:38:42 UTC
Cannot be tested due to https://bugzilla.redhat.com/show_bug.cgi?id=1215623

Comment 23 Elad 2015-08-03 08:08:42 UTC
Adding an additional host to a hosted-engine setup deployed over iSCSI or FC fails as reported https://bugzilla.redhat.com/show_bug.cgi?id=1247942.

Comment 24 Elad 2015-11-09 08:38:06 UTC
- Hosted-engine deployed on one host over iSCSI
- Created NFS storage domain in the DC
- On a second host, blocked connectivity to the storage domain in the hosted-engine setup and started hosted-engine deployment on that host. The deployment wasn't aborted for 5 minutes.
- After 5 minutes, restored connectivity to the storage domain from the second host.
- Deployment of the second host resumed automatically when the connectivity restored and it finished successfully.

Verified using 
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.1-1.el7ev.noarch
vdsm-4.17.10.1-0.el7ev.noarch

Comment 26 errata-xmlrpc 2016-03-09 19:06:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0375.html