Escalated to Bugzilla from IssueTracker
Hello, This ticket is a follow up of IT#99018. -- Description of the problem -- Client system : RHEL5.2 with osad-5.1.0-7.el5 installed. Satellite : RHN5.1 Satellite running on a RHEL4.7 with jabberpy-0.5-0.12.rhn.rhel4 installed. When osad is running, there's an opened connection to Satellite on port 5222. State of this connection is ESTABLISHED. If for any reason client cannot communicate with Satellite on port 5222, then the connection goes into CLOSE_WAIT state. When the network connection is back, osad opens a new connection with jabberd on port 5222, leaving the CLOSE_WAIT connection as it is. -- How to reproduce -- - register a RHEL5 client to a RHN Satellite. - make sure that osad is running on the client and jabberd is running on the Satellite. - check if there's an opened connection on port 5222 # netstat -anp | grep 5222 tcp 0 0 <client IP>:34269 <Sat IP>:5222 ESTABLISHED 2748/python - switch off jabberd service #service jabberd stop - check the network connection state : # netstat -anp | grep 5222 tcp 1 0 <client IP>:34269 <Sat IP>:5222 CLOSE_WAIT 2748/python - restart jabberd service and check for connections on port 5222 : # netstat -anp | grep 5222 tcp 0 0 <client IP>:39093 <Sat IP>:5222 ESTABLISHED 1807/python tcp 0 0 <client IP>:34238 <Sat IP>:5222 CLOSE_WAIT 1807/python We can see that there are two connections to Satellite on port 5222. Each time jabberd will be restarted a new connection will be opened, stacked with the existing ones. Restarting osad service on client system helps to get rid on these connections. -- Expected results -- These CLOSE_WAIT connections should be closed, leaving only one opened. This event sent from IssueTracker by mpoole [Support Engineering Group] issue 217483
Follow-on from BZ#201166 The problem appears to be the lack of exception handling in the jabber_lib.py code in the osad package. In the process method there is the following, log_debug(5, "Reading %s bytes from ssl socket" % self.BLOCK_SIZE) try: data = self._read(self.BLOCK_SIZE) except SSL.SSL.SysCallError, e: raise SSLError("OpenSSL error; will retry", str(e)) when I set the log level to 5 and changed the above to the following log_debug(5, "Reading %s bytes from ssl socket" % self.BLOCK_SIZE) try: data = self._read(self.BLOCK_SIZE) except: log_debug(5, "Error caught:") log_debug(5, extract_traceback()) the following is noted in the osad.log file 2008-09-02 18:18:54 jabber_lib.process: before select(); timeout None 2008-09-02 18:18:54 jabber_lib.process: select() returned 2008-09-02 18:18:54 jabber_lib.process: Reading 1024 bytes from ssl socket 2008-09-02 18:18:54 jabber_lib.process: Error caught: 2008-09-02 18:18:54 jabber_lib.process: Traceback (most recent call last): File "/usr/share/rhn/osad/jabber_lib.py", line 1037, in process data = self._read(self.BLOCK_SIZE) SysCallError: (-1, 'Unexpected EOF') on changing the code to be log_debug(5, "Reading %s bytes from ssl socket" % self.BLOCK_SIZE) try: data = self._read(self.BLOCK_SIZE) except: log_debug(5, "Closing socket") self._non_ssl_sock.close() we get 2008-09-02 18:50:28 jabber_lib.process: Reading 1024 bytes from ssl socket 2008-09-02 18:50:28 jabber_lib.process: Closing socket 2008-09-02 18:50:28 jabber_lib.main: Sleeping 100 seconds 2008-09-02 18:52:08 osad.setup_config: Skipping config setup; counter=1; interval=65 2008-09-02 18:52:08 jabber_lib.setup_connection: Connecting to dhcp-1-221.fab.redhat.com and the connection restarts cleanly with no lost socket hanging around.
Adding to Sep Customer backlog tracking bug.
Filed upstream (Spacewalk) bug 463143.
Code change committed in upstream (bug 463143).
With /mnt/redhat/devel/candidate-trees/Satellite-5.3.0-RHEL5-re20081128.0 /mnt/redhat/devel/candidate-trees/Satellite-5.3.0-RHEL4-re20081128.0 available, moving ON_QA per Cliff's advice.
Need more information to retest this in 530. After installing osad on the client and jabberd on the satellite, I followed the instructions in the reproduction steps. Even though both services were running, I could not get the 5222 port to display any status other than LISTENING.
(In reply to comment #7) > > Even though both services were running, I could not get the 5222 port to > display any status other than LISTENING. LISTENING is weird. LISTENING means the connection was not made at all. So it looks like you do not have the thing configured / setup yet, to even start the reproduction steps. Please check logs or anything of interest on both ends.
# REOPEN for Satellite-5.3.0-RHEL4-20090616.1 The issue is fixed for a server RHEL5.3 but still exists for server RHEL4.8 installed with a Satellite-5.3.0-RHEL4-20090616.1/tree-x86_64 (the CLOSE_WAIT exists after jabberd restart). How to reproduce: 1. stop osad on client side. 2. stop jabberd on server side. 3. register client to server. 4. start osad on client side. 5. start jabberd on server side. 6. wait approx. 2 minutes (follow by server netstat-s to see connection established to client) 7. restart jabberd on server side. 8. wait approx. 2 minutes (follow by server netstat-s to see a new connection re-established to client) 9. !!! - on client side two connections are there, one is on CLOSE_WAIT state. I still have the servers / client to reproduce. Please feel free to refer for accessing to them (avail till next Mon I think). Versions: server RHEL5.3 - jabberpy-0.5-0.15.el5 server RHEL4.8 - jabberpy-0.5-0.15.el4 client RHEL5.2 - osad-5.1.0-7.el5
The machines are: sun-v60x-01.rhts.bos.redhat.com - server RHEL4.8 pogo-cn1100-01.rhts.bos.redhat.com - server RHEL 5.3 dell-pem710-01.rhts.eng.bos.redhat.com - client with RHEL Server 5.2 version
Garik, the fix went into the *client* package, osad. The machine dell-pem710-01 still has the original version installed, osad-5.1.0-7.el5. The latest version in RHN Tools channels on webqa is osad-5.9.10-4.el5sat.noarch and they contain the code change. Please test with latest packages, on client as well. Thank you.
# VERIFIED The osad version osad-5.9.10-4.el5sat.noarch performs quite well with connections closing in jabberd restarts (i.e. no more CLOSE_WAIT-s). Checked on: RHEL 5.3 (x86_64) and RHEL 4.8 (i386)
checked again with client: x86_64 rhel54-20090805 [root@nec-em14 ~]# netstat -anp | grep 5222 tcp 0 0 10.16.64.156:54457 10.16.65.162:5222 ESTABLISHED 26470/python >>satellite: jabberd manually stopped [root@nec-em14 ~]# netstat -anp | grep 5222 tcp 1 0 10.16.64.156:54457 10.16.65.162:5222 CLOSE_WAIT 26470/python [root@nec-em14 ~]# netstat -anp | grep 5222 >>satellite: jabberd manually started [root@nec-em14 ~]# netstat -anp | grep 5222 tcp 548 0 10.16.64.156:49151 10.16.65.162:5222 ESTABLISHED 26470/python [root@nec-em14 ~]# rpm -q osad osad-5.9.10-5.el5sat -- >> Osa status ping works
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1434.html