Bug 1480071

Summary: Concurrent registrations seems to have frequent RHSM timeouts
Product: Red Hat Satellite Reporter: sbadhwar
Component: PerformanceAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED WORKSFORME QA Contact:
Severity: low Docs Contact:
Priority: unspecified    
Version: 6.3.0CC: alosadag, bbuckingham, bcourt, inecas, mmccune, mstead, psuriset, sbadhwar
Target Milestone: UnspecifiedKeywords: Regression, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-02 12:16:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sbadhwar 2017-08-10 05:32:03 UTC
Description of problem:
In the Satellite 6.2.x releases, we were easily able to accomplish nearly 75 concurrent content host registrations. With the Satellite 6.3 Snap releases, this number has came down to close to 40 concurrent registrations only.

The error that appears frequently during the registrations is "Unable to establish server identity: Timed Out"

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ivan Necas 2017-08-10 15:21:17 UTC
Could you provide as much debug information as possible to pinpoint what the issues are? Without any data we can't do any analysis of where the issues actually are.

Comment 2 sbadhwar 2017-08-21 11:29:54 UTC
(In reply to Ivan Necas from comment #1)
> Could you provide as much debug information as possible to pinpoint what the
> issues are? Without any data we can't do any analysis of where the issues
> actually are.

Hello Ivan,

While doing concurrent registrations for the content hosts(75 hosts registering concurrently to Satellite 6.3), we see a number of hosts failing to register to Satellite. The error that is displayed is:

Unable to verify server's identity: timed out

On checking the RHSM log of the failed host, the following trace is present

2017-08-18 09:16:52,477 [INFO] subscription-manager:380:MainThread @hwprobe.py:916 - collected virt facts: virt.is_guest=True, virt.host_type=lxc, docker, virt.uuid=Not Set
2017-08-18 09:16:52,478 [INFO] subscription-manager:380:MainThread @facts.py:139 - Loading custom facts from: /etc/rhsm/facts/katello.facts
2017-08-18 09:23:52,851 [ERROR] subscription-manager:380:MainThread @managercli.py:174 - Error during registration: timed out
2017-08-18 09:23:52,851 [ERROR] subscription-manager:380:MainThread @managercli.py:175 - timed out
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/subscription_manager/managercli.py", line 1136, in _do_command
    content_tags=self.installed_mgr.tags)
  File "/usr/lib64/python2.7/site-packages/rhsm/connection.py", line 928, in registerConsumer
    return self.conn.request_post(url, params)
  File "/usr/lib64/python2.7/site-packages/rhsm/connection.py", line 697, in request_post
    return self._request("POST", method, params)
  File "/usr/lib64/python2.7/site-packages/rhsm/connection.py", line 591, in _request
    response = conn.getresponse()
  File "/usr/lib64/python2.7/httplib.py", line 1089, in getresponse
    response.begin()
  File "/usr/lib64/python2.7/httplib.py", line 444, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 400, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib64/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 228, in read
    return self._read_bio(size)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Connection.py", line 213, in _read_bio
    return m2.ssl_read(self.ssl, size, self._timeout)
SSLTimeoutError: timed out


As a measure to fix the solution a little bit, I tried to increase the RHSM timeout, which indeed has some effect on the total no. of hosts which are able to register successfully but that does not seem to help much.

In Satellite 6.2, we were easily able to achieve a concurrent registration count of 75 hosts but the same is giving problems with Satellite 6.3.

Please let me know, if you require logs from any specific component

Regards,
Saurabh Badhwar

Comment 3 Ivan Necas 2017-08-23 08:34:42 UTC
Michael: is this something we're aware of and should be adressed with later cp version? Is there another bz we could close this as a dupe?

Comment 4 Barnaby Court 2017-08-28 18:55:28 UTC
It is likely not a complete fix (as I have not analyzed across katello interactions) but Candlepin 2.1.3-1 includes code to significantly reduce lock contention during bind which often happens during concurrent registration using activation keys that attach to a single pool.

Comment 9 Mike McCune 2017-08-29 16:08:09 UTC
We are going to be including Candlepin 2.1 in 6.3 which has some registration performance fixes that we hope will help with this situation.

Comment 10 Mike McCune 2017-08-29 16:08:44 UTC
flagged as a regression as this degraded from 6.2

Comment 13 Ivan Necas 2017-11-02 12:16:24 UTC
Based on https://bugzilla.redhat.com/show_bug.cgi?id=1480071#c7, it seems the issue is no long reproducible + we pulled in new candlepin version, that should have also positive influence on the performance. I'm closing it for now: feel free to re-open, if it re-appears, with more details on what the issue is.