Bug 1169290
Summary: | [hosted-engine] [iSCSI support] connectStoragePools fails with "SSLError: The read operation timed out" while adding a new host to the setup | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Elad <ebenahar> | ||||||
Component: | ovirt-hosted-engine-setup | Assignee: | Simone Tiraboschi <stirabos> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Elad <ebenahar> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 3.5.0 | CC: | amureini, bazulay, dfediuck, didi, fsimonce, gklein, istein, lpeer, lsurette, lveyde, nsoffer, sbonazzo, scohen, sherold, stirabos, yeylon, ykaul | ||||||
Target Milestone: | ovirt-3.6.0-rc | Keywords: | Reopened, ZStream | ||||||
Target Release: | 3.6.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Previously, the self-hosted engine was talking with the VDSM-wrapping vdsClient utility. This could cause SSL timeout errors for long sync commands. With this release, the self-hosted engine uses the vdscli library for storage operations instead of vdsClient, reducing SSL timeouts.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1190636 (view as bug list) | Environment: | |||||||
Last Closed: | 2016-03-09 19:06:54 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1190207, 1215623, 1220824, 1247942 | ||||||||
Bug Blocks: | 1157385, 1190636, 1234915 | ||||||||
Attachments: |
|
Created attachment 963192 [details]
logs from engine and vdsm from manual attempt
I was able to add the host to the DC manually (not via hosted engine deployment).
The host connected successfully to the storage pool:
Thread-22::INFO::2014-12-01 11:22:51,921::logUtils::47::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True
Attaching the logs
This traceback has been caused by a timeout from VDSM (vdsm log 1 on second host): Thread-89::WARNING::2014-12-01 10:52:08,639::fileSD::688::Storage.scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/10.35.64.81:_export_elad_1 Traceback (most recent call last): File "/usr/share/vdsm/storage/fileSD.py", line 674, in collectMetaFiles sd.DOMAIN_META_DATA)) File "/usr/share/vdsm/storage/outOfProcess.py", line 89, in glob return self._iop.glob(pattern) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 500, in glob return self._sendCommand("glob", {"pattern": pattern}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out which seems a teamporary connection issue with the storage and it seems correct to fail to validate a broken connection to storage. If more investigation on why the storage failure happened I suggest to move to storage team. Closing as not a bug for now, feel free to reopen if you've more details on the root cause of this failure or if it happens always (so it's not a temporary connection issue). Sandro, this failure happened several times during the day (100% reproducible). You can check the attached logs, there are several setup logs generated from each failure. For making sure this is not a teamporary connection issue with the storage, I tested it also manually, as stated in comment #1 Nir, Federico, can you take a look and tell what's going on on the storage side while Elad is trying to deploy second host? Not reproduce with: rhel66 installed on both hosts rhevm 3.5 vt12 Not reproducible with rhel70 installed on both hosts (HE from vt13/brew) rhel66 in the hosted engine vm with rhevm 3.5 vt12. Simone, can you sync with Elad and try to do a forensic analysis on his systems? Ok, found in vdsm.log for the corresponding timestamp: Thread-89::WARNING::2014-12-01 10:52:08,639::fileSD::688::Storage.scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/10.35.64.81:_export_elad_1 Traceback (most recent call last): File "/usr/share/vdsm/storage/fileSD.py", line 674, in collectMetaFiles sd.DOMAIN_META_DATA)) File "/usr/share/vdsm/storage/outOfProcess.py", line 89, in glob return self._iop.glob(pattern) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 500, in glob return self._sendCommand("glob", {"pattern": pattern}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out Thread-85::DEBUG::2014-12-01 10:52:08,640::fileUtils::142::Storage.fileUtils::(createdir) Creating directory: /rhev/data-center/mnt/blockSD/1fac6cbf-ccb1-47cb-95d1-743f1fc4ff52/images Thread-85::DEBUG::2014-12-01 10:52:08,640::fileUtils::142::Storage.fileUtils::(createdir) Creating directory: /rhev/data-center/mnt/blockSD/1fac6cbf-ccb1-47cb-95d1-743f1fc4ff52/images Thread-85::WARNING::2014-12-01 10:52:08,641::fileUtils::149::Storage.fileUtils::(createdir) Dir /rhev/data-center/mnt/blockSD/1fac6cbf-ccb1-47cb-95d1-743f1fc4ff52/images already exists Thread-85::DEBUG::2014-12-01 10:52:08,641::fileUtils::142::Storage.fileUtils::(createdir) Creating directory: /rhev/data-center/mnt/blockSD/1fac6cbf-ccb1-47cb-95d1-743f1fc4ff52/dom_md Thread-85::WARNING::2014-12-01 10:52:08,641::fileUtils::149::Storage.fileUtils::(createdir) Dir /rhev/data-center/mnt/blockSD/1fac6cbf-ccb1-47cb-95d1-743f1fc4ff52/dom_md already exists and something before: Thread-93859::ERROR::2014-12-01 08:38:57,425::dispatcher::79::Storage.Dispatcher::(wrapper) Secured object is not in safe state Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 71, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/share/vdsm/storage/task.py", line 103, in wrapper return m(self, *a, **kw) File "/usr/share/vdsm/storage/task.py", line 1179, in prepare raise self.error SecureError: Secured object is not in safe state Thread-93859::DEBUG::2014-12-01 08:38:57,426::stompReactor::163::yajsonrpc.StompServer::(send) Sending response It seams quite related to this one: https://bugzilla.redhat.com/show_bug.cgi?id=1159314 Elad, could you please report the NFS server configuration for 10.35.64.81:/export/elad/1 ? Hosted engine is deployed over iSCSI but there is also a shared domain over NFS. VDSM is failing accessing the NFS storage domain metadata and this is enough to generate a timeout on the connectStoragePool from hosted engine. Allon, it seams some quite strange: hosted engine send a connectStoragePool for a storage pool on the iSCSI storage domain, VDSM fails getting info for an other storage domain over NFS (probably something related to https://bugzilla.redhat.com/1159314 ) Thread-89::WARNING::2014-12-01 10:52:08,639::fileSD::688::Storage.scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/10.35.64.81:_export_elad_1 Traceback (most recent call last): File "/usr/share/vdsm/storage/fileSD.py", line 674, in collectMetaFiles sd.DOMAIN_META_DATA)) File "/usr/share/vdsm/storage/outOfProcess.py", line 89, in glob return self._iop.glob(pattern) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 500, in glob return self._sendCommand("glob", {"pattern": pattern}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out And this is enough to propagate that to hosted-engine request for the iSCSI storage pool: 2014-12-01 10:52:08 DEBUG otopi.context context._executeMethod:152 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 142, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 196, in _validation self._storagePoolConnection() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 594, in _storagePoolConnection status, message = method(args=method_args) File "/usr/share/vdsm/vdsClient.py", line 920, in connectStoragePool pool = self.s.connectStoragePool(*connectArguments) File "/usr/lib64/python2.7/xmlrpclib.py", line 1224, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python2.7/xmlrpclib.py", line 1578, in __request verbose=self.__verbose File "/usr/lib64/python2.7/xmlrpclib.py", line 1264, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python2.7/xmlrpclib.py", line 1294, in single_request response = h.getresponse(buffering=True) File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse response.begin() File "/usr/lib64/python2.7/httplib.py", line 409, in begin version, status, reason = self._read_status() File "/usr/lib64/python2.7/httplib.py", line 365, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/usr/lib64/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize) File "/usr/lib64/python2.7/ssl.py", line 241, in recv return self.read(buflen) File "/usr/lib64/python2.7/ssl.py", line 160, in read return self._sslobj.read(len) SSLError: The read operation timed out 2014-12-01 10:52:08 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Setup validation': The read operation timed out shouldn't the iSCSI storage pool be connectable also if the NFS one is failing for any reasons? It looks like the root cause is that the connectStoragePool call to vdsm is taking one minute to complete: Thread-22::DEBUG::2014-12-01 11:21:51,601::__init__::469::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.connect' in bridge with {u'masterVersion': 2, u'domainDict': {u'0825d293-a9fa-4483-a526-cf822d0e4fb1': u'active'}, u'storagepoolID': u'00000002-0002-0002-0002-000000000248', u'scsiKey': u'00000002-0002-0002-0002-000000000248', u'masterSdUUID': u'0825d293-a9fa-4483-a526-cf822d0e4fb1', u'hostID': 2} ... Thread-22::DEBUG::2014-12-01 11:22:51,924::__init__::500::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.connect' in bridge with True This causes ovirt-hosted-engine-setup to time out while waiting for the command to complete. The reason the command takes so long is because part of the connectStoragePool flow is to scan the metadata of storage domains associated with the storage pool being connected. In this scenario the domain 0825d293-a9fa-4483-a526-cf822d0e4fb1 associated with the mount point /rhev/data-center/mnt/10.35.64.81:_export_elad_1 is not accessible. We eventually time out while trying to read the domain info and connectStoragePool is successful. Unfortunately hosted-engine has given up already. My point of view is that this bug should not block the release since we're dealing with a case of failing storage. The workaround for this bug would be to ensure that all storage domains are accessible to the host. (In reply to Adam Litke from comment #10) > It looks like the root cause is that the connectStoragePool call to vdsm is > taking one minute to complete: > > Thread-22::DEBUG::2014-12-01 > 11:21:51,601::__init__::469::jsonrpc.JsonRpcServer::(_serveRequest) Calling > 'StoragePool.connect' in bridge with {u'masterVersion': 2, u'domainDict': > {u'0825d293-a9fa-4483-a526-cf822d0e4fb1': u'active'}, u'storagepoolID': > u'00000002-0002-0002-0002-000000000248', u'scsiKey': > u'00000002-0002-0002-0002-000000000248', u'masterSdUUID': > u'0825d293-a9fa-4483-a526-cf822d0e4fb1', u'hostID': 2} > ... > Thread-22::DEBUG::2014-12-01 > 11:22:51,924::__init__::500::jsonrpc.JsonRpcServer::(_serveRequest) Return > 'StoragePool.connect' in bridge with True > > This causes ovirt-hosted-engine-setup to time out while waiting for the > command to complete. > > The reason the command takes so long is because part of the > connectStoragePool flow is to scan the metadata of storage domains > associated with the storage pool being connected. In this scenario the > domain 0825d293-a9fa-4483-a526-cf822d0e4fb1 associated with the mount point > /rhev/data-center/mnt/10.35.64.81:_export_elad_1 is not accessible. We > eventually time out while trying to read the domain info and > connectStoragePool is successful. Unfortunately hosted-engine has given up > already. So basically, there's nothing we /can/ do on the storage side - HE setup needs a larger timeout, or even to wait indefinitely for the storage call to either succeed or fail. Unfortunately we are already simply calling connectStoragePool as a synchronous command over vdsCli. I don't see any way on hosted-engine side to force a different timeout on the low level SSL connection. Nir, do you see any way to force a different SSL timeout from vdsCli? (In reply to Simone Tiraboschi from comment #12) > Unfortunately we are already simply calling connectStoragePool as a > synchronous command over vdsCli. > I don't see any way on hosted-engine side to force a different timeout on > the low level SSL connection. > > Nir, do you see any way to force a different SSL timeout from vdsCli? No, but if vdsClient or vdscli uses a timeout of 60 seconds, this is a bug in that component. (In reply to Nir Soffer from comment #13) > No, but if vdsClient or vdscli uses a timeout of 60 seconds, this is a bug > in that component. Yes, it seams so. I'm opening a new bug on that. Simone can you set the new bug for blocking this one? Fixing https://bugzilla.redhat.com/1190207 should be enough to fix this one. No additional change is needed. Test only. Yes, it would be better to have additional tests on it. Automated message: can you please update doctext or set it as not required? Cannot be tested due to https://bugzilla.redhat.com/show_bug.cgi?id=1215623 Adding an additional host to a hosted-engine setup deployed over iSCSI or FC fails as reported https://bugzilla.redhat.com/show_bug.cgi?id=1247942. - Hosted-engine deployed on one host over iSCSI - Created NFS storage domain in the DC - On a second host, blocked connectivity to the storage domain in the hosted-engine setup and started hosted-engine deployment on that host. The deployment wasn't aborted for 5 minutes. - After 5 minutes, restored connectivity to the storage domain from the second host. - Deployment of the second host resumed automatically when the connectivity restored and it finished successfully. Verified using ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.1-1.el7ev.noarch vdsm-4.17.10.1-0.el7ev.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0375.html |
Created attachment 963189 [details] /var/log from both host and engine.log Description of problem: I tried to add a second host to the hosted-engine setup. Executed hosted-engine --deploy on the second host, picked the LUN which is already in use by the first host as engine disk. The deployment failed with a "The read operation timed out" error. Version-Release number of selected component (if applicable): rhel7 installed on both hosts rhev 3.5 vt12 How reproducible: Always Steps to Reproduce: 1. Deploy hosted engine on a host, pick iSCSI. Once the setup is up, created a storage domain (NFS in my case) 2. Deploy hosted engine on an additional host. Pick the LUN where the engine disk located Please specify the lun id (1, 2) [1]: The specified storage location already contains a data domain. Is this an additional host setup (Yes, No)[Yes]? [ INFO ] Installing on additional host Actual results: The operation fails with: [ INFO ] Stage: Setup validation [ ERROR ] Failed to execute stage 'Setup validation': The read operation timed out [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20141201105208.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination In hosted-engine-setup log: 20**FILTERED**4-**FILTERED**2-0**FILTERED** **FILTERED**0:52:08 DEBUG otopi.context context._executeMethod:**FILTERED**52 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line **FILTERED**42, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line **FILTERED**96, in _validation self._storagePoolConnection() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 594, in _storagePoolConnection status, message = method(args=method_args) File "/usr/share/vdsm/vdsClient.py", line 920, in connectStoragePool pool = self.s.connectStoragePool(*connectArguments) File "/usr/lib64/python2.7/xmlrpclib.py", line **FILTERED**224, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python2.7/xmlrpclib.py", line **FILTERED**578, in __request verbose=self.__verbose File "/usr/lib64/python2.7/xmlrpclib.py", line **FILTERED**264, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python2.7/xmlrpclib.py", line **FILTERED**294, in single_request response = h.getresponse(buffering=True) File "/usr/lib64/python2.7/httplib.py", line **FILTERED**045, in getresponse response.begin() File "/usr/lib64/python2.7/httplib.py", line 409, in begin version, status, reason = self._read_status() File "/usr/lib64/python2.7/httplib.py", line 365, in _read_status line = self.fp.readline(_MAXLINE + **FILTERED**) File "/usr/lib64/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize) File "/usr/lib64/python2.7/ssl.py", line 24**FILTERED**, in recv return self.read(buflen) File "/usr/lib64/python2.7/ssl.py", line **FILTERED**60, in read return self._sslobj.read(len) SSLError: The read operation timed out 20**FILTERED**4-**FILTERED**2-0**FILTERED** **FILTERED**0:52:08 ERROR otopi.context context._executeMethod:**FILTERED**6**FILTERED** Failed to execute stage 'Setup validation': The read operation timed out Expected results: Second host should be able to connect to the pool Additional info: /var/log from both host and engine.log