Bug 947639 - RHN Proxy doesn't work if separated from parent by a slow enough network
Summary: RHN Proxy doesn't work if separated from parent by a slow enough network
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Spacewalk
Classification: Community
Component: Server
Version: 1.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Stephen Herr
QA Contact: Red Hat Satellite QA List
URL:
Whiteboard:
: 1003888 (view as bug list)
Depends On: 767443 949648 949651 967114
Blocks: space20 1335589 sat5-errata
TreeView+ depends on / blocked
 
Reported: 2013-04-02 22:32 UTC by Stephen Herr
Modified: 2018-12-09 17:09 UTC (History)
12 users (show)

Fixed In Version: rhnlib-2.5.64-1 rhn-client-tools-1.10.3-1 yum-rhn-plugin-1.10.1-1 spacewalk-proxy-1.10.1-1 rhncfg-5.10.45-1 spacewalk-backend-1.10.25-1
Doc Type: Bug Fix
Doc Text:
Clone Of: 767443
Environment:
Last Closed: 2013-08-02 13:15:20 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2620871 0 None None None 2018-12-09 17:09:44 UTC

Comment 1 Stephen Herr 2013-04-02 22:38:07 UTC
Description of problem:
We have a Satellite server in our data center in the US. We have proxies in our Amsterdam and Singapore data centers that use the US satellite server as their parent.

On clients if we attempt to run yum makecache we are seeing on RHEL6 errors like these when the clients attempt to download larger repodata files:
Error: failed to retrieve repodata/filelists.xml.gz from rhel-x86_64-server-6 error was [Errno 12] Timeout on https://satproxy01.intranet.prod.int.ams2.redhat.com/XMLRPC/GET-REQ/rhel-x86_64-server-6/repodata/filelists.xml.gz:
(28, 'Operation too slow. Less than 1 bytes/sec transfered the last 60 seconds')

and on RHEL5:
Error: failed to retrieve repodata/other.xml.gz from rhel-x86_64-server-5 error was [Errno 12] Timeout: <urlopen error >

Similar timeouts occur when they attempt to download large rpm's.

What this appears to come down to is that squid must download the entire repodata file or rpm before it can start serving it to the client. Since it can take well over 60 seconds to cross the network from satellite server to proxy the 60 second timeout in /usr/share/yum-plugins/rhnplugin.py causes the client to give up.  

Increasing this number sufficiently (I chose 720) exposes a second timeout issue. Instead of timeouts the client will start receiving 404 errors like these instead:
RHEL5: 
Error: failed to retrieve repodata/filelists.xml.gz from rhel-x86_64-server-5 error was [Errno 14] HTTP Error 404: Not Found 

RHEL6: 
Error: failed to retrieve repodata/other.xml.gz from rhel-x86_64-server-6 error was [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404"

Looking at /var/log/httpd/error_log you can get an idea where this is coming from:
[Tue Dec 13 14:22:37 2011] [error] Exception Handler Information [Tue Dec 13 14:22:37 2011] [error] Traceback (most recent call last): 
[Tue Dec 13 14:22:37 2011] [error] File "/usr/share/rhn/proxy/rhnShared.py", line 195, in _serverCommo 
[Tue Dec 13 14:22:37 2011] [error] status, headers, bodyFd = self._proxy2server(data) 
[Tue Dec 13 14:22:37 2011] [error] File "/usr/share/rhn/proxy/rhnShared.py", line 359, in _proxy2server 
[Tue Dec 13 14:22:37 2011] [error] response = http_connection.getresponse() 
[Tue Dec 13 14:22:37 2011] [error] File "/usr/lib/python2.6/site-packages/rhn/connections.py", line 129, in getresponse 
[Tue Dec 13 14:22:37 2011] [error] response.begin() [Tue Dec 13 14:22:37 2011] [error] File "/usr/lib64/python2.6/httplib.py", line 391, in begin 
[Tue Dec 13 14:22:37 2011] [error] version, status, reason = self._read_status() 
[Tue Dec 13 14:22:37 2011] [error] File "/usr/lib64/python2.6/httplib.py", line 349, in _read_status 
[Tue Dec 13 14:22:37 2011] [error] line = self.fp.readline() 
[Tue Dec 13 14:22:37 2011] [error] File "/usr/lib64/python2.6/socket.py", line 433, in readline 
[Tue Dec 13 14:22:37 2011] [error] data = recv(1) [Tue Dec 13 14:22:37 2011] [error] timeout: timed out

So I added the 'timeout=720' on line 551 of /usr/lib64/python2.6/socket.py and everything started working; I was able to download repodata and rpm's that were consistently failing prior to modifying these timeouts.

So, basically what is needed is a way to sanely modify these timeouts via  config file on clients and proxies...

Version-Release number of selected component (if applicable):
5.4.1

How reproducible:
Always

Steps to Reproduce:
1. Setup a satellite server
2. Setup a proxy on the other side of a sufficiently slow connection
3. Run yum update or yum makecache on your client
  
Actual results:
Timeouts

Expected results:
You can download updates as expected

Additional info:

--- Additional comment from Miroslav Suchý on 2011-12-14 03:34:10 EST ---

Quick investigation:
this can be done by setting:
  self.sock.settimeout(CFG['sometimeout'])
in getresponse() or __init__() of HTTPConnection class in connections.py

The trouble is that this is rhnlib code, which knows nothing about CFG, so we have to pass it from rhnShared.py in _proxy2server()

--- Additional comment from Ricky Nelson on 2013-01-08 14:58:38 EST ---

With BZ 783958 resolved now, want to go ahead and put this through QA again? I don't believe it'll resolve this completely though. The yum.conf modification will allow the client to choose the quicker route, but what about clients that someone has no control over?

It seems to me that the customer wanted a method of configuring RHN Proxy to automatically have a timeout? Part of me thinks that this would mean that we would need to have a lower timeout already in there which does not seem logical to me to put in place. We might break existing environments with that kind of change.

Jason, Can you comment here on what it is exactly that you're looking for from the Proxy side? The client side is doable now, but I'm not sure what we can do from the Proxy side of this.

--- Additional comment from Jason Montleon on 2013-01-08 15:45:06 EST ---

There were two separate timeout issues.

One is was between clients and the proxies, which should now be fixed with the 5.9 release, by allowing us to set a timeout in the yum.conf as we are already able on RHEL 6. I'm not worried about the various groups using the satellites or proxies having to adjust timeouts, as long as it's possible

The second issue is between the proxies and the satellite server.
https://bugzilla.redhat.com/show_bug.cgi?id=783928
https://bugzilla.redhat.com/show_bug.cgi?id=789092 

I see references all over these bugs to adding timeouts in /etc/sysconfig/up2date, /etc/rhn/rhn.conf, and /etc/yum.conf. I'm not sure how creating a timeout option in yum.conf fixes 783928, unless RHN Proxy is going to honor the timeout set in /etc/yum.conf as well, which possibly seems to be the case looking at https://bugzilla.redhat.com/show_bug.cgi?id=783958. I'm not really clear on how it is supposed to work.

--- Additional comment from Jason Montleon on 2013-01-10 12:36:13 EST ---

I don't think that RHEL 6.3 clients honor the timeout in yum.conf for rhn as has been previously stated.

I had a client receiving the error:
error was [Errno 12] Timeout on https://satproxy01.rdu.redhat.com/XMLRPC/GET-REQ/production-rhel-x86_64-workstation-6/repodata/primary.xml.gz: (28, 'Operation too slow. Less than 1 bytes/sec transfered the last 60 seconds')

I addd a timeout in yum.conf and the error did not change.

Modifying the self.timeout in /usr/share/yum-plugins/rhnplugin.py to something greater than 60 once again got it working.

--- Additional comment from Stephen Herr on 2013-04-01 16:22:35 EDT ---

Okay, so there has been a bit of confusion about this issue, especially in previous bugs. To clear it up, there are two problems noted in comment 0. Given a sufficiently slow Proxy -> Hosted / Satellite connection, the user is seeing:

1) yum-rhn-plugin times out. User sees the "less than 1 byte/sec transferred" message on the client.

Bug 789092 is about this issue, should be fixed in the recent releases.

2) Even given that #1 is working, the proxy itself is timing out in its connection to Hosted / Satellite.

I believe this is due to the fact that we hardcode a 120 second timeout for ssl connections in /usr/lib/python2.6/site-packages/rhn/SSL.py on the proxy. This is currently the case for both Proxy 5.4 (the reported version) and Proxy 5.5. A code update to make this value configurable should correct problem #2.

Bug 783928 appears to been another report of problem #2 that was incorrectly closed as a duplicate of problem #1. 

Jason, can you confirm that setting DEFAULT_TIMEOUT in /usr/lib/python2.6/site-packages/rhn/SSL.py (and restarting httpd) on the proxy resolves issue #2?

You are correct in comment 13 that the issue #1 is not fixed in RHEL 6.3. It was fixed in yum-rhn-plugin-0.9.1-41-el6 and later, while -40 is the latest available in the RHEL 6.3 channels.

--- Additional comment from Stephen Herr on 2013-04-02 18:16:49 EDT ---

In the investigation of this bug I found a third potential place that the connection can timeout. 

On the client, in yum-rhn-plugin, it logs into the server and gets a list of subscribed channels. These requests use the hardcoded SSL.DEFAULT_TIMEOUT of 2 minutes. In order to fix this we'd have to do more work in yum-rhn-plugin (which I am not doing as part of this bug).

I do not think this is actually a problem however. Requests to get metadata files time out on a slow network because the files are large, the Proxy needs to get them before it can send them to the Client, and sometimes the Satellite needs to generate them before it can send them to the Proxy. The data loads on these two requests would be minuscule, it should never really time out even on the slowest networks (unless you are simulating a slow network by just making httpd sleep for a while before answering every request, like I am). I would say let's wait and see if anyone ever hits this ever before we worry about it.

--- Additional comment from Stephen Herr on 2013-04-02 18:29:19 EDT ---

Note to QA:

The easiest way to test this is to add

"import time; time.sleep(125)"

to the first line of the handle method in /usr/share/rhn/wsgi/wsgiHandler.py *on the Satellite*, and then restart httpd.

To workaround the problem noted in comment 19, change DEFAULT_TIMEOUT in /usr/lib/python2.6/site-packages/rhn/SSL.py *on the client* to something larger, say 200. This will also mask problem #1 noted in comment 18, but that's okay because we've already released a fix for that.

You can then proceed to run 'yum clean all' and 'yum clean metadata' *on the client*, and mess with config values *on the proxy* to make sure the proxy is correctly reading the timeout config values.

There is now a timeout = 120 option in /usr/share/rhn/config-defaults/rhn_proxy.conf on the proxy, and you can add "proxy.timeout = <whatever>" to /etc/rhn/rhn.conf on the proxy to override that default. If the proxy timeout is larger than the time the satellite is sleeping then it should work fine, just slowly. If not you will see connection errors. You will have to restart httpd on the proxy whenever you change a config value.

Comment 2 Stephen Herr 2013-04-02 22:44:19 UTC
Spacewalk master: 7e713aa5c3a853ff1703faad2b59768ab399713f

Comment 3 Stephen Herr 2013-04-03 19:00:00 UTC
I changed my mind about not modifying yum-rhn-plugin in this bug, I decided it would be best to get every possible avenue for a timeout (at least from yum-rhn-plugin's perspective) configurable so that this won't bite people again. To that end, there is now a new commit in Spacewalk master. There is now no need to modify DEFAULT_TIMEOUT on the client or proxy.

b2c9611f28d5339aa74361ea4ff7040a772f33a5

Comment 4 Stephen Herr 2013-04-09 15:38:41 UTC
We also need to update rhncfg and make rhnlib conflict with any older versions of rhncfg:
5d118ad32395d4ac116518d3d88bdc68d51f178d
2b98828a0237431fa368be2737d4e31d973506f5

Comment 5 Stephen Herr 2013-05-02 12:51:56 UTC
Apparently spacewalk-backend needs to be updated too:
b5005ff67e10176e49f4957eb65a0b98c72e2bd9

Comment 6 Stephen Herr 2013-05-02 13:56:31 UTC
Mark new rhnlib as conflicting with old spacewalk-backend:
c79cb9bbc5c4efb8ef46398db37acadb925eaa9e

Comment 7 Stephen Herr 2013-05-23 19:13:45 UTC
Also: 491ecc1a95450c1f8ee24a732c94942f10c9992a

rhnlib does not conflict with spacewalk-backend as long as either both 
491ecc1a95450c1f8ee24a732c94942f10c9992a
b5005ff67e10176e49f4957eb65a0b98c72e2bd9
exist or do not exist. If just 491ecc1 is present sat-sync returns an error. rhnlib should not conflict with old spacewalk-backend, instead this is just an error that appeared temporarily in Spacewalk nightly and then disappeared with b5005ff.

Reverting commit in comment 6: 15567d2be34fb145e19f38f8d7cb49a285851dfd

Comment 8 Stephen Herr 2013-06-19 21:00:50 UTC
c524392bce72ff9db4f59c8467690a0e82ecc429

rhnlib update to make timeout work correctly.

Comment 9 Tomáš Kašpárek 2013-08-02 13:15:20 UTC
Fix for this bug is present in Spacewalk 2.0, closing this bug as CURRENTRELEASE.

Comment 10 Gennadii Altukhov 2016-09-13 09:12:51 UTC
*** Bug 1003888 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.