Bug 688095 - rhn_check hangs forever when sat not available
Summary: rhn_check hangs forever when sat not available
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: rhnlib
Version: 6.2
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Milan Zázrivec
QA Contact: Martin Minar
URL:
Whiteboard:
Depends On: 630875
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-16 10:20 UTC by Miroslav Suchý
Modified: 2016-07-04 00:56 UTC (History)
6 users (show)

Fixed In Version: rhnlib-2.5.22-11.el6
Doc Type: Bug Fix
Doc Text:
Due to an error in the rhnlib code, network operations would have become unresponsive when an HTTP connection to Red Hat Network (RHN) or RHN Satellite became idle. The code has been modified to use timeout for HTTP connections. Network operations are now terminated after predefined time interval and can be restarted.
Clone Of: 630875
Environment:
Last Closed: 2011-12-06 16:50:07 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1665 0 normal SHIPPED_LIVE rhnlib bug fix update 2011-12-06 00:50:18 UTC

Description Miroslav Suchý 2011-03-16 10:20:17 UTC
+++ This bug was initially created as a clone of Bug #630875 +++

Description of problem:
rhn_check which is triggered by the rhnsd hangs forever if the satellite server crashed 

Version-Release number of selected component (if applicable):
0.4.20-33.el5_5.2

How reproducible:
Not sure, I'm trying to provoke it again

Steps to Reproduce:
1. Crash the satellite server
2. Wait until systems getting of status "inactive"
3. Start up Satellite server again
  
Actual results:
rhn_check process hangs and does nothing


Expected results:
rhn_check should terminate after a timeout to give rhnsd the chance to start rhn_check again -> Systems will get state active again.

Additional info:
The crash of the satellite server was a strange one, the system was pingable, but access to rhn satellite was not possible anymore, same applies to ssh etc.

After restarting the Sat Server, a lsof to the PID of rhn_check shows an established https connection to the satellite.

--- Additional comment from jhutar on 2011-01-31 22:10:03 EST ---

QA: This will need more testing - ensure you will find a way how to reproduce on OLD version please as this might be important (system stuck in "inactive" state forever because of Satellite crash/restart)

--- Additional comment from luc on 2011-02-10 07:51:11 EST ---

Hi Jan,

It is quite hard to reproduce this. Maybe the best is to drop off a fork bomb like
":(){ :|:& };:" on the satellite, then rhn_check hangs.

I don't think that it is bound to a specific phase of rhn_check.

On a clean shutdown rhn_check bails out with an error message:

* Satellite shutdown after fire rhn_check:
server:~# rhn_check 
Error: Server Unavailable. Please try later.

* Fire rhn_check after the shutdown:
server:~# rhn_check 
Could not retrieve action from <RetryServer for sat.example.com/XMLRPC>.
Possible networking problem?

Thanks,

Luc

--- Additional comment from msuchy on 2011-03-16 06:18:27 EDT ---

Steps to reproduce:
1. shutdown satellite
2. Instead of satellite run:
 nc -l 0.0.0.0 80
or 
 nc -l 0.0.0.0 443
3. on client run:
 rhn_check

rhn_check will stuck forever and will wait for response.

For the connection we use httplib.HTTPConnection from python. It accept as one parameter timeout, which will solve this problem. But this timout was added in python 2.6 whereas RHEL5 has python 2.4.

So I'm afraid I could not fix it in RHEL5. I will close this bug and will clone it to RHEL6, where fix is possible.

One note for the fix. The fix will only help in situation when Satellite completely die, but did not close connection. If it is "just" under heavy load (as suggested in #3), and will sent at least one byte before timeout, then httplib will not timout.

Comment 1 Miroslav Suchý 2011-03-16 10:25:38 UTC
Note for developer:
The change will be here:
--- /usr/lib/python2.6/site-packages/rhn/connections.py.orig    2011-03-16 11:37:41.369889498 +0100
+++ /usr/lib/python2.6/site-packages/rhn/connections.py 2011-03-16 11:24:46.604918969 +0100
@@ -64,7 +64,7 @@
     response_class = HTTPResponse
     
     def __init__(self, host, port=None):
-        httplib.HTTPConnection.__init__(self, host, port)
+        httplib.HTTPConnection.__init__(self, host, port, timeout=30)
         self._cb_rs = []
         self._cb_ws = []
         self._cb_ex = []

The change must be done in all classes in this module. And of course - it will be nice to set timeout in /etc/sysconfig/rhn/up2date config file. However rhnlib package has no way to read this file. So the timeout value will need to propagate from code from rhn-client-tools packages, which use rhn.connections module (it may be several layers).

Comment 2 Milan Zázrivec 2011-08-08 11:22:53 UTC
spacewalk.git master: 6c2a93bcb7efa9873aee956f0cf7355177d4cc59
satellite.git CLIENT-RHEL-6: ae670a56be85d4ef50720d829dda71066640e88c

Comment 4 Milan Zázrivec 2011-08-08 15:07:47 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: A bug in rhnlib code.

Consequence: Network operations would hang forever in cases when connection to RHN / RHN Satellite would be established but idle.

Fix: Establish a timeout for HTTP connections to RHN / RHN Satellite.

Result: Idle HTTP connections would timeout after a predefined time interval.

Comment 5 Martin Minar 2011-08-09 12:28:49 UTC
Verified with rhnlib-2.5.22-11.el6.

Notes:
1. Problem is only with http (port 80) version.
2. Used "nc -l 0.0.0.0 80" reproducer.
3. Old version didn't timeout.
4. New version:
[root@XYZ ~]# time rhn_check -vv
Could not retrieve action from <RetryServer for dell-pe-sc1435-02.rhts.englab.brq.redhat.com/XMLRPC>.
Possible networking problem?

real	2m0.382s
user	0m0.110s
sys	0m0.030s

Comment 6 Miroslav Svoboda 2011-08-26 11:51:15 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,7 +1 @@
-Cause: A bug in rhnlib code.
+Due to an error in the rhnlib code, network operations would have become unresponsive when an HTTP connection to Red Hat Network (RHN) or RHN Satellite became idle. The code has been modified to use timeout for HTTP connections. Network operations are now terminated after predefined time interval and can be restarted.-
-Consequence: Network operations would hang forever in cases when connection to RHN / RHN Satellite would be established but idle.
-
-Fix: Establish a timeout for HTTP connections to RHN / RHN Satellite.
-
-Result: Idle HTTP connections would timeout after a predefined time interval.

Comment 7 errata-xmlrpc 2011-12-06 16:50:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1665.html


Note You need to log in before you can comment on or make changes to this bug.