Hide Forgot
+++ This bug was initially created as a clone of Bug #630875 +++ Description of problem: rhn_check which is triggered by the rhnsd hangs forever if the satellite server crashed Version-Release number of selected component (if applicable): 0.4.20-33.el5_5.2 How reproducible: Not sure, I'm trying to provoke it again Steps to Reproduce: 1. Crash the satellite server 2. Wait until systems getting of status "inactive" 3. Start up Satellite server again Actual results: rhn_check process hangs and does nothing Expected results: rhn_check should terminate after a timeout to give rhnsd the chance to start rhn_check again -> Systems will get state active again. Additional info: The crash of the satellite server was a strange one, the system was pingable, but access to rhn satellite was not possible anymore, same applies to ssh etc. After restarting the Sat Server, a lsof to the PID of rhn_check shows an established https connection to the satellite. --- Additional comment from jhutar on 2011-01-31 22:10:03 EST --- QA: This will need more testing - ensure you will find a way how to reproduce on OLD version please as this might be important (system stuck in "inactive" state forever because of Satellite crash/restart) --- Additional comment from luc on 2011-02-10 07:51:11 EST --- Hi Jan, It is quite hard to reproduce this. Maybe the best is to drop off a fork bomb like ":(){ :|:& };:" on the satellite, then rhn_check hangs. I don't think that it is bound to a specific phase of rhn_check. On a clean shutdown rhn_check bails out with an error message: * Satellite shutdown after fire rhn_check: server:~# rhn_check Error: Server Unavailable. Please try later. * Fire rhn_check after the shutdown: server:~# rhn_check Could not retrieve action from <RetryServer for sat.example.com/XMLRPC>. Possible networking problem? Thanks, Luc --- Additional comment from msuchy on 2011-03-16 06:18:27 EDT --- Steps to reproduce: 1. shutdown satellite 2. Instead of satellite run: nc -l 0.0.0.0 80 or nc -l 0.0.0.0 443 3. on client run: rhn_check rhn_check will stuck forever and will wait for response. For the connection we use httplib.HTTPConnection from python. It accept as one parameter timeout, which will solve this problem. But this timout was added in python 2.6 whereas RHEL5 has python 2.4. So I'm afraid I could not fix it in RHEL5. I will close this bug and will clone it to RHEL6, where fix is possible. One note for the fix. The fix will only help in situation when Satellite completely die, but did not close connection. If it is "just" under heavy load (as suggested in #3), and will sent at least one byte before timeout, then httplib will not timout.
Note for developer: The change will be here: --- /usr/lib/python2.6/site-packages/rhn/connections.py.orig 2011-03-16 11:37:41.369889498 +0100 +++ /usr/lib/python2.6/site-packages/rhn/connections.py 2011-03-16 11:24:46.604918969 +0100 @@ -64,7 +64,7 @@ response_class = HTTPResponse def __init__(self, host, port=None): - httplib.HTTPConnection.__init__(self, host, port) + httplib.HTTPConnection.__init__(self, host, port, timeout=30) self._cb_rs = [] self._cb_ws = [] self._cb_ex = [] The change must be done in all classes in this module. And of course - it will be nice to set timeout in /etc/sysconfig/rhn/up2date config file. However rhnlib package has no way to read this file. So the timeout value will need to propagate from code from rhn-client-tools packages, which use rhn.connections module (it may be several layers).
spacewalk.git master: 6c2a93bcb7efa9873aee956f0cf7355177d4cc59 satellite.git CLIENT-RHEL-6: ae670a56be85d4ef50720d829dda71066640e88c
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: A bug in rhnlib code. Consequence: Network operations would hang forever in cases when connection to RHN / RHN Satellite would be established but idle. Fix: Establish a timeout for HTTP connections to RHN / RHN Satellite. Result: Idle HTTP connections would timeout after a predefined time interval.
Verified with rhnlib-2.5.22-11.el6. Notes: 1. Problem is only with http (port 80) version. 2. Used "nc -l 0.0.0.0 80" reproducer. 3. Old version didn't timeout. 4. New version: [root@XYZ ~]# time rhn_check -vv Could not retrieve action from <RetryServer for dell-pe-sc1435-02.rhts.englab.brq.redhat.com/XMLRPC>. Possible networking problem? real 2m0.382s user 0m0.110s sys 0m0.030s
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,7 +1 @@ -Cause: A bug in rhnlib code. +Due to an error in the rhnlib code, network operations would have become unresponsive when an HTTP connection to Red Hat Network (RHN) or RHN Satellite became idle. The code has been modified to use timeout for HTTP connections. Network operations are now terminated after predefined time interval and can be restarted.- -Consequence: Network operations would hang forever in cases when connection to RHN / RHN Satellite would be established but idle. - -Fix: Establish a timeout for HTTP connections to RHN / RHN Satellite. - -Result: Idle HTTP connections would timeout after a predefined time interval.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1665.html