331001 – osad stops working if rhn_check blocks forever

Bug 331001 - osad stops working if rhn_check blocks forever

Summary: osad stops working if rhn_check blocks forever

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	yum-rhn-plugin
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	John Matthews
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-10-13 21:45 UTC by Sven Trenkel
Modified:	2008-05-21 14:27 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2008-0360
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 14:27:24 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0360	0	normal	SHIPPED_LIVE	yum-rhn-plugin bug fix update	2008-05-20 12:45:04 UTC

Description Sven Trenkel 2007-10-13 21:45:05 UTC

From Bugzilla Helper:
User-Agent: Opera/9.23 (Windows NT 5.1; U; en)

Description of problem:
There is a race conditions that can prevent the satellite server from pushing 
scheduled actions to its clients.

rhn_check uses blocking sockets for its communication with the satellite 
server. This means that it's possible that it will block forever in a "read" 
system call on this socket if no more data arrives. osad will only allow one 
instance of rhn_check to run at a time so if this happens osad will still 
receive the push events but will not connect to the satellite to do something.

Version-Release number of selected component (if applicable):
rhn-check-0.4.13-1.el5.noarch

How reproducible:
Sometimes


Steps to Reproduce:
1. Schedule some action on the satellite server. Anything will do but 
installing a really big package will help to get the timing right.
2. After osad has started safe-rhn-check which has started rhn_check, wait till 
rhn_check has established a tcp connection to the satellite server. Once 
rhn_check has begun downloading the package, interrupt the connection between 
the client and the server while rhn_check does a "read" on the socket. Yes this 
can be very tricky.
3.If rhn_check is blocked in the syscall, wait until the satellite gives up on 
this connection. At this point rhn_check will never return from this syscall 
unless it's killed by a signal.

Actual Results:
rhn_check blocks forever in the read syscall. osad will continue to call safe-
rhn-check but safe-rhn-check won't call rhn_check because another instance of 
rhn_check is already running. Pushing from the satellite to this client no 
longer works.

Expected Results:
After a few minutes of doing nothing rhn_check should close the stale 
connection and either die or retry.

Additional info:
Output from "ps auxf":

root     21615  0.0  0.0  75608  2972 ?        S    Oct01   0:00 python /usr/
sbin/safe-rhn-check
root     21616  0.0  0.4 226148 22072 ?        S    Oct01   0:00  \_ /usr/bin/
python /usr/sbin/rhn_check

Oct01 was 12 days ago.

strace -p 21616         
Process 21616 attached - interrupt to quit
read(19,  <unfinished ...>
Process 21616 detached

this is the syscall that never returns


netstat -np | grep 21616
tcp        0      0 10.30.3.204:49284           10.30.3.99:443              
ESTABLISHED 21616/python        
tcp       38      0 10.30.3.204:49281           10.30.3.99:443              
CLOSE_WAIT  21616/python        
tcp       38      0 10.30.3.204:49280           10.30.3.99:443              
CLOSE_WAIT  21616/python        
tcp       38      0 10.30.3.204:49283           10.30.3.99:443              
CLOSE_WAIT  21616/python        
tcp       38      0 10.30.3.204:49282           10.30.3.99:443              
CLOSE_WAIT  21616/python        
tcp       38      0 10.30.3.204:49279           10.30.3.99:443              
CLOSE_WAIT  21616/python        
tcp       38      0 10.30.3.204:49278           10.30.3.99:443              
CLOSE_WAIT  21616/python        
unix  2      [ ]         DGRAM                    1827438 21616/python        

10.30.3.204 is the IP of this system, 10.30.3.99 is the IP of the satellite.
The satellite server doen't know anything about these connections.


There are several ways to solve this problem. The best would probably be to use 
non-blocking sockets (which should be the only kind of sockets used by a 
daemon). Another workaround might be to enable the tcp keepalive feature by 
setting the TCP_KEEPIDLE and TCP_KEEPINTVL socket options.

Comment 1 Clifford Perry 2007-10-16 15:27:04 UTC

Hi there, 
The sanuty check of osad to make sure that not more than one rhn_check is
running at any time is sane and correct thing to do. If rhn_check is hanging, we
need to investigate and determine the cause of rhn_check hanging. 

I am moving this from the Satellite product and onto the RHEL product and
propose to investigate further for the rhn_check command (which is shipped with
RHEL).

Comment 2 RHEL Program Management 2007-10-16 15:35:04 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Red Hat Bugzilla 2007-10-26 00:53:43 UTC

User jslagle's account has been closed

Comment 4 RHEL Program Management 2007-11-03 01:35:17 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Justin Sherrill 2007-12-17 17:13:50 UTC

Fixed in rev 134967.


We decided that instead of allowing multiple instances of rhn_check to be run at
the same time, we simply put a timeout on the socket.  If the socket operation
does not perform within a certain time, the rhn_check process will exit.  

I simulated your situation by using VMWare and "disconnecting" the network
interface while a package download was occurring.  In my tests, after the fix is
used, rhn-check exits and then after the network connection is restored, osad
re-connects to the satellite server after a few minutes, and the action is then
picked up correctly (since the previous rhn_check has exited).  Be aware that it
can take several minutes (sometimes up to 10) for osad to re-connect.

Comment 6 Justin Sherrill 2007-12-17 17:15:48 UTC

Also note, that this fix was actually in yum-rhn-plugin as that is actually what
rhn_check uses to download the package, and so this fix is only for RHEL 5
(Which is what it was reported against).

Comment 8 Cameron Meadors 2008-04-29 18:28:49 UTC

I setup and action to install eclipse (big package) and then waited until client
check in again.  Then yanked the network.  rhn_check timed out with the error:

Could not submit to <RetryServer for xmlrpc.rhn.redhat.com/XMLRPC>.
Possible networking problem?

No blocking here.

Comment 10 errata-xmlrpc 2008-05-21 14:27:24 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0360.html

Note You need to log in before you can comment on or make changes to this bug.