Red Hat Bugzilla – Bug 331001
osad stops working if rhn_check blocks forever
Last modified: 2008-05-21 10:27:24 EDT
From Bugzilla Helper:
User-Agent: Opera/9.23 (Windows NT 5.1; U; en)
Description of problem:
There is a race conditions that can prevent the satellite server from pushing
scheduled actions to its clients.
rhn_check uses blocking sockets for its communication with the satellite
server. This means that it's possible that it will block forever in a "read"
system call on this socket if no more data arrives. osad will only allow one
instance of rhn_check to run at a time so if this happens osad will still
receive the push events but will not connect to the satellite to do something.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Schedule some action on the satellite server. Anything will do but
installing a really big package will help to get the timing right.
2. After osad has started safe-rhn-check which has started rhn_check, wait till
rhn_check has established a tcp connection to the satellite server. Once
rhn_check has begun downloading the package, interrupt the connection between
the client and the server while rhn_check does a "read" on the socket. Yes this
can be very tricky.
3.If rhn_check is blocked in the syscall, wait until the satellite gives up on
this connection. At this point rhn_check will never return from this syscall
unless it's killed by a signal.
rhn_check blocks forever in the read syscall. osad will continue to call safe-
rhn-check but safe-rhn-check won't call rhn_check because another instance of
rhn_check is already running. Pushing from the satellite to this client no
After a few minutes of doing nothing rhn_check should close the stale
connection and either die or retry.
Output from "ps auxf":
root 21615 0.0 0.0 75608 2972 ? S Oct01 0:00 python /usr/
root 21616 0.0 0.4 226148 22072 ? S Oct01 0:00 \_ /usr/bin/
Oct01 was 12 days ago.
strace -p 21616
Process 21616 attached - interrupt to quit
read(19, <unfinished ...>
Process 21616 detached
this is the syscall that never returns
netstat -np | grep 21616
tcp 0 0 10.30.3.204:49284 10.30.3.99:443
tcp 38 0 10.30.3.204:49281 10.30.3.99:443
tcp 38 0 10.30.3.204:49280 10.30.3.99:443
tcp 38 0 10.30.3.204:49283 10.30.3.99:443
tcp 38 0 10.30.3.204:49282 10.30.3.99:443
tcp 38 0 10.30.3.204:49279 10.30.3.99:443
tcp 38 0 10.30.3.204:49278 10.30.3.99:443
unix 2 [ ] DGRAM 1827438 21616/python
10.30.3.204 is the IP of this system, 10.30.3.99 is the IP of the satellite.
The satellite server doen't know anything about these connections.
There are several ways to solve this problem. The best would probably be to use
non-blocking sockets (which should be the only kind of sockets used by a
daemon). Another workaround might be to enable the tcp keepalive feature by
setting the TCP_KEEPIDLE and TCP_KEEPINTVL socket options.
The sanuty check of osad to make sure that not more than one rhn_check is
running at any time is sane and correct thing to do. If rhn_check is hanging, we
need to investigate and determine the cause of rhn_check hanging.
I am moving this from the Satellite product and onto the RHEL product and
propose to investigate further for the rhn_check command (which is shipped with
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
User email@example.com's account has been closed
Fixed in rev 134967.
We decided that instead of allowing multiple instances of rhn_check to be run at
the same time, we simply put a timeout on the socket. If the socket operation
does not perform within a certain time, the rhn_check process will exit.
I simulated your situation by using VMWare and "disconnecting" the network
interface while a package download was occurring. In my tests, after the fix is
used, rhn-check exits and then after the network connection is restored, osad
re-connects to the satellite server after a few minutes, and the action is then
picked up correctly (since the previous rhn_check has exited). Be aware that it
can take several minutes (sometimes up to 10) for osad to re-connect.
Also note, that this fix was actually in yum-rhn-plugin as that is actually what
rhn_check uses to download the package, and so this fix is only for RHEL 5
(Which is what it was reported against).
I setup and action to install eclipse (big package) and then waited until client
check in again. Then yanked the network. rhn_check timed out with the error:
Could not submit to <RetryServer for xmlrpc.rhn.redhat.com/XMLRPC>.
Possible networking problem?
No blocking here.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.