From Bugzilla Helper: User-Agent: Opera/9.23 (Windows NT 5.1; U; en) Description of problem: There is a race conditions that can prevent the satellite server from pushing scheduled actions to its clients. rhn_check uses blocking sockets for its communication with the satellite server. This means that it's possible that it will block forever in a "read" system call on this socket if no more data arrives. osad will only allow one instance of rhn_check to run at a time so if this happens osad will still receive the push events but will not connect to the satellite to do something. Version-Release number of selected component (if applicable): rhn-check-0.4.13-1.el5.noarch How reproducible: Sometimes Steps to Reproduce: 1. Schedule some action on the satellite server. Anything will do but installing a really big package will help to get the timing right. 2. After osad has started safe-rhn-check which has started rhn_check, wait till rhn_check has established a tcp connection to the satellite server. Once rhn_check has begun downloading the package, interrupt the connection between the client and the server while rhn_check does a "read" on the socket. Yes this can be very tricky. 3.If rhn_check is blocked in the syscall, wait until the satellite gives up on this connection. At this point rhn_check will never return from this syscall unless it's killed by a signal. Actual Results: rhn_check blocks forever in the read syscall. osad will continue to call safe- rhn-check but safe-rhn-check won't call rhn_check because another instance of rhn_check is already running. Pushing from the satellite to this client no longer works. Expected Results: After a few minutes of doing nothing rhn_check should close the stale connection and either die or retry. Additional info: Output from "ps auxf": root 21615 0.0 0.0 75608 2972 ? S Oct01 0:00 python /usr/ sbin/safe-rhn-check root 21616 0.0 0.4 226148 22072 ? S Oct01 0:00 \_ /usr/bin/ python /usr/sbin/rhn_check Oct01 was 12 days ago. strace -p 21616 Process 21616 attached - interrupt to quit read(19, <unfinished ...> Process 21616 detached this is the syscall that never returns netstat -np | grep 21616 tcp 0 0 10.30.3.204:49284 10.30.3.99:443 ESTABLISHED 21616/python tcp 38 0 10.30.3.204:49281 10.30.3.99:443 CLOSE_WAIT 21616/python tcp 38 0 10.30.3.204:49280 10.30.3.99:443 CLOSE_WAIT 21616/python tcp 38 0 10.30.3.204:49283 10.30.3.99:443 CLOSE_WAIT 21616/python tcp 38 0 10.30.3.204:49282 10.30.3.99:443 CLOSE_WAIT 21616/python tcp 38 0 10.30.3.204:49279 10.30.3.99:443 CLOSE_WAIT 21616/python tcp 38 0 10.30.3.204:49278 10.30.3.99:443 CLOSE_WAIT 21616/python unix 2 [ ] DGRAM 1827438 21616/python 10.30.3.204 is the IP of this system, 10.30.3.99 is the IP of the satellite. The satellite server doen't know anything about these connections. There are several ways to solve this problem. The best would probably be to use non-blocking sockets (which should be the only kind of sockets used by a daemon). Another workaround might be to enable the tcp keepalive feature by setting the TCP_KEEPIDLE and TCP_KEEPINTVL socket options.
Hi there, The sanuty check of osad to make sure that not more than one rhn_check is running at any time is sane and correct thing to do. If rhn_check is hanging, we need to investigate and determine the cause of rhn_check hanging. I am moving this from the Satellite product and onto the RHEL product and propose to investigate further for the rhn_check command (which is shipped with RHEL).
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
User jslagle's account has been closed
Fixed in rev 134967. We decided that instead of allowing multiple instances of rhn_check to be run at the same time, we simply put a timeout on the socket. If the socket operation does not perform within a certain time, the rhn_check process will exit. I simulated your situation by using VMWare and "disconnecting" the network interface while a package download was occurring. In my tests, after the fix is used, rhn-check exits and then after the network connection is restored, osad re-connects to the satellite server after a few minutes, and the action is then picked up correctly (since the previous rhn_check has exited). Be aware that it can take several minutes (sometimes up to 10) for osad to re-connect.
Also note, that this fix was actually in yum-rhn-plugin as that is actually what rhn_check uses to download the package, and so this fix is only for RHEL 5 (Which is what it was reported against).
I setup and action to install eclipse (big package) and then waited until client check in again. Then yanked the network. rhn_check timed out with the error: Could not submit to <RetryServer for xmlrpc.rhn.redhat.com/XMLRPC>. Possible networking problem? No blocking here.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0360.html