Bug 1480071 - Concurrent registrations seems to have frequent RHSM timeouts
Concurrent registrations seems to have frequent RHSM timeouts
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Performance (Show other bugs)
Unspecified Unspecified
unspecified Severity low (vote)
: GA
: --
Assigned To: satellite6-bugs
: Regression, Triaged
Depends On:
  Show dependency treegraph
Reported: 2017-08-10 01:32 EDT by sbadhwar
Modified: 2017-11-02 08:16 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2017-11-02 08:16:24 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description sbadhwar 2017-08-10 01:32:03 EDT
Description of problem:
In the Satellite 6.2.x releases, we were easily able to accomplish nearly 75 concurrent content host registrations. With the Satellite 6.3 Snap releases, this number has came down to close to 40 concurrent registrations only.

The error that appears frequently during the registrations is "Unable to establish server identity: Timed Out"

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:
Comment 1 Ivan Necas 2017-08-10 11:21:17 EDT
Could you provide as much debug information as possible to pinpoint what the issues are? Without any data we can't do any analysis of where the issues actually are.
Comment 2 sbadhwar 2017-08-21 07:29:54 EDT
(In reply to Ivan Necas from comment #1)
> Could you provide as much debug information as possible to pinpoint what the
> issues are? Without any data we can't do any analysis of where the issues
> actually are.

Hello Ivan,

While doing concurrent registrations for the content hosts(75 hosts registering concurrently to Satellite 6.3), we see a number of hosts failing to register to Satellite. The error that is displayed is:

Unable to verify server's identity: timed out

On checking the RHSM log of the failed host, the following trace is present

2017-08-18 09:16:52,477 [INFO] subscription-manager:380:MainThread @hwprobe.py:916 - collected virt facts: virt.is_guest=True, virt.host_type=lxc, docker, virt.uuid=Not Set
2017-08-18 09:16:52,478 [INFO] subscription-manager:380:MainThread @facts.py:139 - Loading custom facts from: /etc/rhsm/facts/katello.facts
2017-08-18 09:23:52,851 [ERROR] subscription-manager:380:MainThread @managercli.py:174 - Error during registration: timed out
2017-08-18 09:23:52,851 [ERROR] subscription-manager:380:MainThread @managercli.py:175 - timed out
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/subscription_manager/managercli.py", line 1136, in _do_command
  File "/usr/lib64/python2.7/site-packages/rhsm/connection.py", line 928, in registerConsumer
    return self.conn.request_post(url, params)
  File "/usr/lib64/python2.7/site-packages/rhsm/connection.py", line 697, in request_post
    return self._request("POST", method, params)
  File "/usr/lib64/python2.7/site-packages/rhsm/connection.py", line 591, in _request
    response = conn.getresponse()
  File "/usr/lib64/python2.7/httplib.py", line 1089, in getresponse
  File "/usr/lib64/python2.7/httplib.py", line 444, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 400, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib64/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 228, in read
    return self._read_bio(size)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 213, in _read_bio
    return m2.ssl_read(self.ssl, size, self._timeout)
SSLTimeoutError: timed out

As a measure to fix the solution a little bit, I tried to increase the RHSM timeout, which indeed has some effect on the total no. of hosts which are able to register successfully but that does not seem to help much.

In Satellite 6.2, we were easily able to achieve a concurrent registration count of 75 hosts but the same is giving problems with Satellite 6.3.

Please let me know, if you require logs from any specific component

Saurabh Badhwar
Comment 3 Ivan Necas 2017-08-23 04:34:42 EDT
Michael: is this something we're aware of and should be adressed with later cp version? Is there another bz we could close this as a dupe?
Comment 4 Barnaby Court 2017-08-28 14:55:28 EDT
It is likely not a complete fix (as I have not analyzed across katello interactions) but Candlepin 2.1.3-1 includes code to significantly reduce lock contention during bind which often happens during concurrent registration using activation keys that attach to a single pool.
Comment 9 Mike McCune 2017-08-29 12:08:09 EDT
We are going to be including Candlepin 2.1 in 6.3 which has some registration performance fixes that we hope will help with this situation.
Comment 10 Mike McCune 2017-08-29 12:08:44 EDT
flagged as a regression as this degraded from 6.2
Comment 13 Ivan Necas 2017-11-02 08:16:24 EDT
Based on https://bugzilla.redhat.com/show_bug.cgi?id=1480071#c7, it seems the issue is no long reproducible + we pulled in new candlepin version, that should have also positive influence on the performance. I'm closing it for now: feel free to re-open, if it re-appears, with more details on what the issue is.

Note You need to log in before you can comment on or make changes to this bug.