Bug 720665 - Load balancing fails with 'problem making ssl connection' on rhel6 and fedora clients
Summary: Load balancing fails with 'problem making ssl connection' on rhel6 and fedora...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Update Infrastructure for Cloud Providers
Classification: Red Hat
Component: Tools
Version: 2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Jay Dobies
QA Contact: wes hayutin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-07-12 12:41 UTC by Kedar Bidarkar
Modified: 2012-05-31 12:57 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-05-31 12:57:58 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Kedar Bidarkar 2011-07-12 12:41:23 UTC
Description of problem:



As per the doc bug on load balancer, "That mirror list round robins the order of CDS instances returned", this is the reason I see this issue only for alternate yum tries after yum clean all. yum is having issues fetching from the other url when the primary url is active. I think this will cause a lot of load on a single CDS node, risking the load balancing.


[root@dhcp201-140 ~]# yum clean all
Loaded plugins: rhui-lb
Cleaning up Everything
[root@dhcp201-140 ~]# yum install zsh 
Loaded plugins: rhui-lb
https://dhcp201-131.englab.pnq.redhat.com/pulp/repos//content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os/repodata/repomd.xml: [Errno 14] problem making ssl connection
Trying other mirror.
rhui-rhel-server-6-releases                                                                                                                               | 3.7 kB     00:00     
rhui-rhel-server-6-releases/primary_db                                                                                                                    | 2.9 MB     00:00     
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package zsh.x86_64 0:4.3.10-4.1.el6 set to be updated
--> Finished Dependency Resolution

Dependencies Resolved

=================================================================================================================================================================================
 Package                          Arch                                Version                                     Repository                                                Size
=================================================================================================================================================================================
Installing:
 zsh                              x86_64                              4.3.10-4.1.el6                              rhui-rhel-server-6-releases                              2.1 M

Transaction Summary
=================================================================================================================================================================================
Install       1 Package(s)
Upgrade       0 Package(s)

Total download size: 2.1 M
Installed size: 4.8 M
Is this ok [y/N]: n
Exiting on user Command
Complete!
[root@dhcp201-140 ~]# yum clean all
Loaded plugins: rhui-lb
Cleaning up Everything
[root@dhcp201-140 ~]# yum install zsh 
Loaded plugins: rhui-lb
rhui-rhel-server-6-releases                                                                                                                               | 3.7 kB     00:00     
rhui-rhel-server-6-releases/primary_db                                                                                                                    | 2.9 MB     00:00     
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package zsh.x86_64 0:4.3.10-4.1.el6 set to be updated
--> Finished Dependency Resolution

Dependencies Resolved

=================================================================================================================================================================================
 Package                          Arch                                Version                                     Repository                                                Size
=================================================================================================================================================================================
Installing:
 zsh                              x86_64                              4.3.10-4.1.el6                              rhui-rhel-server-6-releases                              2.1 M

Transaction Summary
=================================================================================================================================================================================
Install       1 Package(s)
Upgrade       0 Package(s)

Total download size: 2.1 M
Installed size: 4.8 M
Is this ok [y/N]: n
Exiting on user Command
Complete!

Version-Release number of selected component (if applicable):

[root@dhcp201-137 noarch]# rpm -qav | grep -ie pulp -ie rh-rhui-tools
pulp-common-0.0.204-1.el6.noarch
pulp-client-0.0.204-1.el6.noarch
rh-rhui-tools-2.0.37-1.el6.noarch
pulp-0.0.204-1.el6.noarch


How reproducible:



Steps to Reproduce:
1.yum clean all
2. yum install zsh 
3. yum clean all
4. yum install zsh
  
Actual results:

Throws error saying 'problem making ssl connection' when using the other url from the mirrorlist.

Expected results:

Shouldn't be facing ssl connection issue on consecutive attempt of installation.

Additional info:


1) It works perfectly for the rhel5 clients, verified also from the /var/log/httpd/ssl_request.log file on both the machines.

2) Also on the primary load balancer node if httpd service is shutdown, the other node then serves requests.

Comment 1 Jay Dobies 2011-07-12 13:25:46 UTC
There's a lot missing from this bug report. Again, instead of just dumping a bunch of console output, please describe your environment. There is really not much for me to go on in this.

Have you tried to make a yum connection directly to both of the CDS instances to see if they are both working? How about wget?

For that matter, what is it that you're trying to test? A normal working environment or a situation where one of the CDS instances has failed?


I also have questions on the report contents itself:

"As per the doc bug on load balancer, "That mirror list round robins the order
of CDS instances returned", this is the reason I see this issue only for
alternate yum tries after yum clean all."

I don't understand. Are you saying one CDS is down?

"yum is having issues fetching from the
other url when the primary url is active."

What do you mean by "when the primary url is active"? What does active mean? Is the server for the primary URL up or down?

"I think this will cause a lot of load on a single CDS node, risking the load balancing.""

So are you saying the load balancing isn't evenly distributing the load?

"2) Also on the primary load balancer node if httpd service is shutdown, the
other node then serves requests."

This makes it sound like you were trying to test a downed CDS, in which case of course there will be a lot of load on a single CDS. But again, I don't have a clue what it is you're trying to do so I'm left guessing.

Comment 2 Kedar Bidarkar 2011-07-12 15:32:20 UTC
My mistake, I think didn't use the right terminology.

CDS1 = primary load balancer = primary url = here dhcp201-136
CDS2 = other url  = here dhcp201-131
active mean = apache/httpd  is running

I am not trying downed cds, but just to rule out some issue with CDS2/dhcp201-131 node , I shutdown the apache service on CDS1/dhcp201-136 and now able to fetch stuff from CDS2/dhcp201-131 without any "problem making ssl connection" issue. This behaviour is weired. 

I have two CDS nodes.

>>>>> Have you tried to make a yum connection directly to both of the CDS instances
to see if they are both working? How about wget?

Yes, yum connection directly to both of the CDS instances works. 


>>>>> For that matter, what is it that you're trying to test? A normal working
environment or a situation where one of the CDS instances has failed?

A normal working environment, both CDS nodes running httpd service.

The issue is that the requests are being sent only to a single CDS node in the case of rhel6 and fedora clients, this works perfectly fine in the case of rhel5.

I believe that "mirror list round robins the order of CDS instances returned", this happens I guess in the case of rhel6 and fedora clients , but for some reason when the CDS node is not primary load balancer , it throws an error saying "Problem making ssl connection".(NOTE: While a direct yum connection request to this node is perfect)

Trying to say that whenever the other url(non primary load balancer url ) is being used or lets say here being returned, its having "problem making ssl connection" ( Both the CDS nodes are running the apache services)

##########################################################################
Below, here I think the negative testing approach may explain things more clearly, which is being a bit tough because of load balancing and round robin stuff.


negative testing => Below ouput is when, trying to download without entitlement certs.

===============================================================================
[root@dhcp201-140 ~]# yum install zsh 
Loaded plugins: rhui-lb
https://dhcp201-131.englab.pnq.redhat.com/pulp/repos//content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os/repodata/repomd.xml: [Errno 14] problem making ssl connection
Trying other mirror.
https://dhcp201-136.englab.pnq.redhat.com/pulp/repos//content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os/repodata/repomd.xml: [Errno 14] HTTP Error 401 : https://dhcp201-136.englab.pnq.redhat.com/pulp/repos//content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os/repodata/repomd.xml 
Trying other mirror.
Error: Cannot retrieve repository metadata (repomd.xml) for repository: rhui-rhel-server-6-releases. Please verify its path and try again

===============================================================================

FRom the above output we can see that the error here it is again , 
a)  problem making ssl connection , for https://dhcp201-131.englab.pnq.redhat.com  which is the CDS2 node or other url I am referring.

b) The error should have been the same in both the cases(for both the url) as Authorization is Required.

###############################################################################
>>>>>> So are you saying the load balancing isn't evenly distributing the load?

yes, in the case of only rhel 6 and fedora and not rhel5.

Comment 3 Jay Dobies 2011-07-13 19:00:31 UTC
Ok, I think I get what you're saying:

- You have two CDS instances running (we'll call one primary and one secondary).

- Yum is configured to use the load balancer on the primary.

- Every other request results in a "problem making ssl connection" error. This lines up to only requests that get balanced to the secondary CDS instance.

- This issue is not seen on RHEL 5, only RHEL 6 clients. RHEL 5 clients connecting to the primary CDS as the load balancer can retrieve content from both CDS instances without this error.

- You can use yum on the RHEL 6 client to connect directly to the secondary CDS correctly. Keep in mind that when testing this, the ID of the repository can _not_ start with "rhui-", otherwise the RHUI load balancing plugin will kick in and affect the destination URL. The safest bet to elliminate all of this would be to use curl directly against both CDS instances to retrieve repomd.xml.

- If you shut down the primary and the RHEL 6 client fails over to using the secondary as the load balancer, then subsequent yum requests to the secondary CDS actually succeed.

Is all of that correct? 

What version of Fedora did you also test this on? If it turns out to be an issue in the yum plugin, I'd guess it's a version thing (since Fedora and RHEL 6 yum versions are probably closer than they are to RHEL 5).

Comment 4 Kedar Bidarkar 2011-07-14 06:52:51 UTC
I have to change the bugs name.

Fedora is F14

>>>>>>If you shut down the primary and the RHEL 6 client fails over to using the
secondary as the load balancer, then subsequent yum requests to the secondary
CDS actually succeed.

1) FAILOVER is working fine, but the issue is with LOAD BALANCING feature.

2) Always, all the requests go to the CDS1 node in the case of RHEL6 and F14 clients, when both the CDS apache servers are running.

3) I see no log updates in ssl_request.log for CDS2 node.

4) As, this works perfectly for RHEL5, may be the issue is with yum plugin, not sure though.

5) Also while connecting directly to the secondary CDS it works, I try this way by disabling the plugin in rhui-lb.conf file, and using the below url.

baseurl=https://dhcp201-131.englab.pnq.redhat.com/pulp/repos/content/dist/rhel/rhui/server-6/releases/$releasever/$basearch/os

Comment 5 Jay Dobies 2011-07-14 12:50:01 UTC
So it's not that the load balancer isn't balancing the load across CDS 1 and CDS 2, it's that when the client attempts to go to CDS 2 for content it's getting this error? The LB on CDS 1 is successfully rotating the returned mirror list, but in cases where the client is getting CDS 2 as the first in the list it fails with the problem making ssl connection error?

So it's not really a load balancing bug, since the load is still being balanced. It's that the calls to CDS 2, when accessed through the load balancing functionality, fail. But only in that case; directly to CDS 2 works correctly?

Comment 6 Jay Dobies 2011-07-14 13:12:33 UTC
Removing needinfo; Kedar confirmed the last comment in chat.

Comment 7 Kedar Bidarkar 2011-07-19 10:01:34 UTC
Also observed with pulp 211 and rhui 2.0.40

Comment 8 Kedar Bidarkar 2011-07-19 10:11:42 UTC
May be related to bug , https://bugzilla.redhat.com/show_bug.cgi?id=716951

Also,

If nss and curl rpms are upgraded to 

=> nss 3.12.9-9
=> curl-7.19.7-26

then the actual load-balancing functionality is working fine. 
(the comment 5 issue not observed after upgrading to above nss and curl rpms)

Comment 9 Jay Dobies 2011-07-25 19:30:05 UTC
I was able to reproduce this using the upgraded versions of NSS and curl.

I think this has something to do with NSS locally caching the certificates. I have two CDS instances set up, "cds-1" and "cds-2". I can connect to cds-1 fine. When I try to use curl to go to cds-2, I get this:

[root@ip-10-195-79-175 yum.repos.d]# curl --cacert /etc/pki/entitlement/ca.crt --cert /etc/pki/entitlement/product/content.crt --key /etc/pki/entitlement/key.pem -v https://cds-2/pulp/repos//content/dist/rhel/rhui/server/5Server/x86_64/rhui/1.2/os/repodata/repomd.xml
* About to connect() to cds-2 port 443 (#0)
*   Trying 10.83.106.3... connected
* Connected to cds-2 (10.83.106.3) port 443 (#0)
* Initializing NSS with certpath: /etc/pki/nssdb
*   CAfile: /etc/pki/entitlement/ca.crt
  CApath: none
* NSS error -8054
* Closing connection #0
* SSL connect error
curl: (35) SSL connect error


The NSS error -8054 has to do with certificate revocation lists. The cert isn't revoked per se, but I think NSS is getting confused.

My initial thought was that it was in the server-side SSL certs, but in this case both have unique serial numbers. I think this is a client-side thing.

Comment 10 Jay Dobies 2011-07-25 19:32:07 UTC
I lied, it's not about CRLs. The description of -8054 is:

"You are attempting to import a cert with the same issuer/serial as an existing cert, but that is not the same cert."

That's probably what's happening; I'm trying to use the same client-side cert as was used previously and it's barfing on the import.

Comment 11 Jay Dobies 2011-07-26 13:13:26 UTC
Kedar -

Can you check the serial numbers of all of your CDS instances' SSL certificates and your client-side entitlement certificate? You can find it in the `openssl x509 -text -in <cert` view.

I suspect the issue may be that there's a duplicate serial number in there somewhere and NSS is viewing that as a security breach (rightfully so).

Comment 12 Jay Dobies 2011-07-26 13:18:46 UTC
commit 1248a82a1235415934e2f6b3bc52930603fb834e
Author: Jay Dobies <jason.dobies>
Date:   Tue Jul 26 09:16:42 2011 -0400

    720665 - Default the serial number to 100. This should alleviate any
    potential overlap in serial numbers if the same CA was used to sign the CDS SSL
    certificates.

rhui-2.0/tools/src/rhui/tools/launcher.py

Comment 13 Kedar Bidarkar 2011-07-26 14:32:11 UTC
[root@ip-10-91-17-24 build]# openssl x509 -in cds1i-950da0f4.crt -noout   -serial
serial=08
[root@ip-10-91-17-24 build]# openssl x509 -in cds2i-970da0f6.crt -noout   -serial
serial=08
[root@ip-10-91-17-24 build]# openssl x509 -in rhuai-910da0f0.crt -noout   -serial
serial=08


And also a single CA was being used for all SSL certs.

We will try to use different serial numbers now and proceed with the testing.

Comment 14 Kedar Bidarkar 2011-07-27 15:27:26 UTC
After using different serial numbers for CDS nodes, the load balancing stuff is working fine for rhel6 nodes

[root@ip-10-172-38-137 ~]# rpm -qav | grep -ie "^curl" -ie "nss-3"
nss-3.12.7-2.el6.x86_64
curl-7.19.7-16.el6.x86_64


Need to still verify on fedora clients.

Comment 15 Kedar Bidarkar 2011-07-29 11:18:10 UTC
LOAD BALANCING stuff is working fine.


CDS2 ssl_request_log
10.88.30.207 - /CN=Red Hat Update Infrastructure [28/Jul/2011:14:12:36 -0400] "GET /pulp/repos//content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os/repodata/repomd.xml HTTP/1.1" 200 3828
10.88.30.207 - /CN=Red Hat Update Infrastructure [28/Jul/2011:14:12:36 -0400] "GET /pulp/repos//content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os/repodata/7e94eeabef95726e82318ff8b73520540d2747feb3c0425130fcafdc560340c1-primary.sqlite.bz2 HTTP/1.1" 200 3002152
10.88.30.207 - /CN=Red Hat Update Infrastructure [28/Jul/2011:14:12:43 -0400] "GET /pulp/repos//content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os/Packages/zsh-4.3.10-4.1.el6.x86_64.rpm HTTP/1.1" 200 2240608

CDS1 ssl_request log primary load balancer

10.88.30.207 - - [28/Jul/2011:14:12:35 -0400] "GET /pulp/mirror/?members HTTP/1.1" 200 58
10.88.30.207 - - [28/Jul/2011:14:12:35 -0400] "GET /pulp/mirror/content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os HTTP/1.1" 200 216
10.88.30.207 - - [28/Jul/2011:14:13:21 -0400] "GET /pulp/mirror/?members HTTP/1.1" 200 58
10.88.30.207 - - [28/Jul/2011:14:13:21 -0400] "GET /pulp/mirror/content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os HTTP/1.1" 200 216
10.88.30.207 - /CN=Red Hat Update Infrastructure [28/Jul/2011:14:13:21 -0400] "GET /pulp/repos//content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os/repodata/repomd.xml HTTP/1.1" 200 3828
10.88.30.207 - /CN=Red Hat Update Infrastructure [28/Jul/2011:14:13:21 -0400] "GET /pulp/repos//content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os/repodata/7e94eeabef95726e82318ff8b73520540d2747feb3c0425130fcafdc560340c1-primary.sqlite.bz2 HTTP/1.1" 200 3002152
10.88.30.207 - /CN=Red Hat Update Infrastructure [28/Jul/2011:14:13:27 -0400] "GET /pulp/repos//content/dist/rhel/rhui/server-6/releases/6Server/x86_64/os/Packages/zsh-4.3.10-4.1.el6.x86_64.rpm HTTP/1.1" 200 2240608


fed14 cds1 ssl_request log

fed14/os/repodata/repomd.xml HTTP/1.1" 200 2693
10.192.198.80 - "" [29/Jul/2011:03:33:32 -0400] "GET /pulp/repos//baseos/x86_64/fed14/os/repodata/primary.sqlite.bz2 HTTP/1.1" 200 8497
10.192.198.80 - "" [29/Jul/2011:03:33:34 -0400] "GET /pulp/repos//baseos/x86_64/fed14/os/zsh-4.3.10-5.fc14.x86_64.rpm HTTP/1.1" 200 2240844
10.86.222.208 - - [29/Jul/2011:03:58:33 -0400] "GET /pulp/mirror/?members HTTP/1.1" 200 58
10.86.222.208 - "" [29/Jul/2011:03:58:34 -0400] "GET /pulp/repos//content/dist/rhel/rhui/server/5Server/x86_64/os/repodata/repomd.xml HTTP/1.1" 401 498
10.86.222.208 - - [29/Jul/2011:04:58:33 -0400] "GET /pulp/mirror/?members HTTP/1.1" 200 58
10.86.222.208 - "" [29/Jul/2011:04:58:34 -0400] "GET /pulp/repos//content/dist/rhel/rhui/server/5Server/x86_64/os/repodata/repomd.xml HTTP/1.1" 401 498


cds2 ssl_request log

10.192.198.80 - "" [29/Jul/2011:03:33:04 -0400] "GET /pulp/repos//baseos/x86_64/fed14/os/repodata/repomd.xml HTTP/1.1" 200 2693
10.192.198.80 - "" [29/Jul/2011:03:33:05 -0400] "GET /pulp/repos//baseos/x86_64/fed14/os/repodata/primary.sqlite.bz2 HTTP/1.1" 200 8497
10.192.198.80 - "" [29/Jul/2011:03:33:16 -0400] "GET /pulp/repos//baseos/x86_64/fed14/os/zsh-4.3.10-5.fc14.x86_64.rpm HTTP/1.1" 200 2240844
10.86.222.208 - "" [29/Jul/2011:03:58:34 -0400] "GET /pulp/repos//content/dist/rhel/rhui/server/5Server/x86_64/os/repodata/repomd.xml HTTP/1.1" 401 497
10.86.222.208 - "" [29/Jul/2011:04:58:34 -0400] "GET /pulp/repos//content/dist/rhel/rhui/server/5Server/x86_64/os/repodata/repomd.xml HTTP/1.1" 401 497

Comment 16 wes hayutin 2011-08-01 21:40:51 UTC
moving to release pending

Comment 17 wes hayutin 2012-05-31 12:57:58 UTC
closing out, product released


Note You need to log in before you can comment on or make changes to this bug.