Red Hat Bugzilla – Bug 1480071
Concurrent registrations seems to have frequent RHSM timeouts
Last modified: 2017-11-02 08:16:24 EDT
Description of problem:
In the Satellite 6.2.x releases, we were easily able to accomplish nearly 75 concurrent content host registrations. With the Satellite 6.3 Snap releases, this number has came down to close to 40 concurrent registrations only.
The error that appears frequently during the registrations is "Unable to establish server identity: Timed Out"
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Could you provide as much debug information as possible to pinpoint what the issues are? Without any data we can't do any analysis of where the issues actually are.
(In reply to Ivan Necas from comment #1)
> Could you provide as much debug information as possible to pinpoint what the
> issues are? Without any data we can't do any analysis of where the issues
> actually are.
While doing concurrent registrations for the content hosts(75 hosts registering concurrently to Satellite 6.3), we see a number of hosts failing to register to Satellite. The error that is displayed is:
Unable to verify server's identity: timed out
On checking the RHSM log of the failed host, the following trace is present
2017-08-18 09:16:52,477 [INFO] subscription-manager:380:MainThread @hwprobe.py:916 - collected virt facts: virt.is_guest=True, virt.host_type=lxc, docker, virt.uuid=Not Set
2017-08-18 09:16:52,478 [INFO] subscription-manager:380:MainThread @facts.py:139 - Loading custom facts from: /etc/rhsm/facts/katello.facts
2017-08-18 09:23:52,851 [ERROR] subscription-manager:380:MainThread @managercli.py:174 - Error during registration: timed out
2017-08-18 09:23:52,851 [ERROR] subscription-manager:380:MainThread @managercli.py:175 - timed out
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/subscription_manager/managercli.py", line 1136, in _do_command
File "/usr/lib64/python2.7/site-packages/rhsm/connection.py", line 928, in registerConsumer
return self.conn.request_post(url, params)
File "/usr/lib64/python2.7/site-packages/rhsm/connection.py", line 697, in request_post
return self._request("POST", method, params)
File "/usr/lib64/python2.7/site-packages/rhsm/connection.py", line 591, in _request
response = conn.getresponse()
File "/usr/lib64/python2.7/httplib.py", line 1089, in getresponse
File "/usr/lib64/python2.7/httplib.py", line 444, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python2.7/httplib.py", line 400, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib64/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 228, in read
File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 213, in _read_bio
return m2.ssl_read(self.ssl, size, self._timeout)
SSLTimeoutError: timed out
As a measure to fix the solution a little bit, I tried to increase the RHSM timeout, which indeed has some effect on the total no. of hosts which are able to register successfully but that does not seem to help much.
In Satellite 6.2, we were easily able to achieve a concurrent registration count of 75 hosts but the same is giving problems with Satellite 6.3.
Please let me know, if you require logs from any specific component
Michael: is this something we're aware of and should be adressed with later cp version? Is there another bz we could close this as a dupe?
It is likely not a complete fix (as I have not analyzed across katello interactions) but Candlepin 2.1.3-1 includes code to significantly reduce lock contention during bind which often happens during concurrent registration using activation keys that attach to a single pool.
We are going to be including Candlepin 2.1 in 6.3 which has some registration performance fixes that we hope will help with this situation.
flagged as a regression as this degraded from 6.2
Based on https://bugzilla.redhat.com/show_bug.cgi?id=1480071#c7, it seems the issue is no long reproducible + we pulled in new candlepin version, that should have also positive influence on the performance. I'm closing it for now: feel free to re-open, if it re-appears, with more details on what the issue is.