Bug 461274

Summary: Osad leaves connections in CLOSE_WAIT state.
Product: Red Hat Satellite 5 Reporter: Issue Tracker <tao>
Component: OtherAssignee: Jan Pazdziora <jpazdziora>
Status: CLOSED CURRENTRELEASE QA Contact: Garik Khachikyan <gkhachik>
Severity: medium Docs Contact:
Priority: medium    
Version: 511CC: bperkins, cperry, mkoci, mpoole, psklenar, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: sat530 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-10 20:27:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 456985, 461291    

Description Issue Tracker 2008-09-05 14:03:40 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2008-09-05 14:03:42 UTC
Hello,

This ticket is a follow up of IT#99018.

-- Description of the problem --

Client system : RHEL5.2 with osad-5.1.0-7.el5 installed.

Satellite : RHN5.1 Satellite running on a RHEL4.7 with  jabberpy-0.5-0.12.rhn.rhel4 installed.

When osad is running, there's an opened connection to Satellite on port 5222. State of this connection is ESTABLISHED. If for any reason client cannot communicate with Satellite on port 5222, then the connection goes into CLOSE_WAIT state. When the network connection is back, osad opens a new connection with jabberd on port 5222, leaving the CLOSE_WAIT connection as it is.

-- How to reproduce --

- register a RHEL5 client to a RHN Satellite.

- make sure that osad is running on the client and jabberd is running on the Satellite.

- check if there's an opened connection on port 5222

# netstat -anp | grep 5222
tcp        0      0 <client IP>:34269            <Sat IP>:5222            ESTABLISHED 2748/python

- switch off jabberd service

#service jabberd stop

- check the network connection state :

# netstat -anp | grep 5222
tcp        1      0 <client IP>:34269            <Sat IP>:5222            CLOSE_WAIT  2748/python

- restart jabberd service and check for connections on port 5222 :

# netstat -anp | grep 5222
tcp        0      0 <client IP>:39093            <Sat IP>:5222            ESTABLISHED 1807/python         
tcp        0      0 <client IP>:34238            <Sat IP>:5222            CLOSE_WAIT  1807/python

We can see that there are two connections to Satellite on port 5222. Each time jabberd will be restarted a new connection will be opened, stacked with the existing ones.
Restarting osad service on client system helps to get rid on these connections.

-- Expected results --

These CLOSE_WAIT connections should be closed, leaving only one opened.
This event sent from IssueTracker by mpoole  [Support Engineering Group]
 issue 217483

Comment 2 Martin Poole 2008-09-05 14:16:36 UTC
Follow-on from BZ#201166

The problem appears to be the lack of exception handling in the jabber_lib.py code in the osad package.

In the process method there is the following,

    log_debug(5, "Reading %s bytes from ssl socket" % self.BLOCK_SIZE)
    try:
        data = self._read(self.BLOCK_SIZE)
    except SSL.SSL.SysCallError, e:
        raise SSLError("OpenSSL error; will retry", str(e))

when I set the log level to 5 and changed the above to the following

    log_debug(5, "Reading %s bytes from ssl socket" % self.BLOCK_SIZE)
    try:
        data = self._read(self.BLOCK_SIZE)
    except:
        log_debug(5, "Error caught:")
        log_debug(5, extract_traceback())

the following is noted in the osad.log file

2008-09-02 18:18:54 jabber_lib.process: before select(); timeout None
2008-09-02 18:18:54 jabber_lib.process: select() returned
2008-09-02 18:18:54 jabber_lib.process: Reading 1024 bytes from ssl socket
2008-09-02 18:18:54 jabber_lib.process: Error caught:
2008-09-02 18:18:54 jabber_lib.process: Traceback (most recent call last):
  File "/usr/share/rhn/osad/jabber_lib.py", line 1037, in process
    data = self._read(self.BLOCK_SIZE)
SysCallError: (-1, 'Unexpected EOF')


on changing the code to be

    log_debug(5, "Reading %s bytes from ssl socket" % self.BLOCK_SIZE)
    try:
        data = self._read(self.BLOCK_SIZE)
    except:
        log_debug(5, "Closing socket")
        self._non_ssl_sock.close()

we get

2008-09-02 18:50:28 jabber_lib.process: Reading 1024 bytes from ssl socket
2008-09-02 18:50:28 jabber_lib.process: Closing socket
2008-09-02 18:50:28 jabber_lib.main: Sleeping 100 seconds
2008-09-02 18:52:08 osad.setup_config: Skipping config setup; counter=1; interval=65
2008-09-02 18:52:08 jabber_lib.setup_connection: Connecting to dhcp-1-221.fab.redhat.com

and the connection restarts cleanly with no lost socket hanging around.

Comment 3 Clifford Perry 2008-09-09 15:15:36 UTC
Adding to Sep Customer backlog tracking bug.

Comment 4 Jan Pazdziora 2008-09-22 08:28:56 UTC
Filed upstream (Spacewalk) bug 463143.

Comment 5 Jan Pazdziora 2008-09-22 08:31:14 UTC
Code change committed in upstream (bug 463143).

Comment 6 Jan Pazdziora 2008-12-02 19:11:38 UTC
With

/mnt/redhat/devel/candidate-trees/Satellite-5.3.0-RHEL5-re20081128.0
/mnt/redhat/devel/candidate-trees/Satellite-5.3.0-RHEL4-re20081128.0

available, moving ON_QA per Cliff's advice.

Comment 7 Jeff Browning 2009-04-27 21:32:13 UTC
Need more information to retest this in 530.

After installing osad on the client and jabberd on the satellite, I followed the instructions in the reproduction steps.

Even though both services were running, I could not get the 5222 port to display any status other than LISTENING.

Comment 8 Jan Pazdziora 2009-04-28 06:34:43 UTC
(In reply to comment #7)
> 
> Even though both services were running, I could not get the 5222 port to
> display any status other than LISTENING.  

LISTENING is weird. LISTENING means the connection was not made at all. So it looks like you do not have the thing configured / setup yet, to even start the reproduction steps.

Please check logs or anything of interest on both ends.

Comment 9 Garik Khachikyan 2009-06-19 08:54:19 UTC
# REOPEN for Satellite-5.3.0-RHEL4-20090616.1

The issue is fixed for a server RHEL5.3 but still exists for server RHEL4.8 installed with a Satellite-5.3.0-RHEL4-20090616.1/tree-x86_64
(the CLOSE_WAIT exists after jabberd restart).

How to reproduce:
1. stop osad on client side.
2. stop jabberd on server side.
3. register client to server.
4. start osad on client side.
5. start jabberd on server side.
6. wait approx. 2 minutes (follow by server netstat-s to see connection established to client)
7. restart jabberd on server side.
8. wait approx. 2 minutes (follow by server netstat-s to see a new connection re-established to client)
9. !!! - on client side two connections are there, one is on CLOSE_WAIT state.

I still have the servers / client to reproduce. Please feel free to refer for accessing to them (avail till next Mon I think).

Versions:
server RHEL5.3 - jabberpy-0.5-0.15.el5 
server RHEL4.8 - jabberpy-0.5-0.15.el4 
client RHEL5.2 - osad-5.1.0-7.el5

Comment 10 Jan Pazdziora 2009-06-19 09:14:00 UTC
The machines are:

sun-v60x-01.rhts.bos.redhat.com - server RHEL4.8
pogo-cn1100-01.rhts.bos.redhat.com - server RHEL 5.3
dell-pem710-01.rhts.eng.bos.redhat.com - client with RHEL Server 5.2 version

Comment 11 Jan Pazdziora 2009-06-19 09:26:03 UTC
Garik, the fix went into the *client* package, osad. The machine dell-pem710-01 still has the original version installed, osad-5.1.0-7.el5. The latest version in RHN Tools channels on webqa is osad-5.9.10-4.el5sat.noarch and they contain the code change. Please test with latest packages, on client as well. Thank you.

Comment 12 Garik Khachikyan 2009-06-19 15:48:06 UTC
# VERIFIED

The osad version osad-5.9.10-4.el5sat.noarch performs quite well with connections closing in jabberd restarts (i.e. no more CLOSE_WAIT-s).

Checked on: RHEL 5.3 (x86_64) and RHEL 4.8 (i386)

Comment 13 Petr Sklenar 2009-08-06 16:10:00 UTC
checked again with

client: x86_64 rhel54-20090805
[root@nec-em14 ~]# netstat -anp | grep 5222
tcp        0      0 10.16.64.156:54457          10.16.65.162:5222           ESTABLISHED 26470/python        

>>satellite: jabberd manually stopped

[root@nec-em14 ~]# netstat -anp | grep 5222
tcp        1      0 10.16.64.156:54457          10.16.65.162:5222           CLOSE_WAIT  26470/python        
[root@nec-em14 ~]# netstat -anp | grep 5222

>>satellite: jabberd manually started

[root@nec-em14 ~]# netstat -anp | grep 5222
tcp      548      0 10.16.64.156:49151          10.16.65.162:5222           ESTABLISHED 26470/python        
[root@nec-em14 ~]# rpm -q osad
osad-5.9.10-5.el5sat

--
>> Osa status ping works

Comment 15 Brandon Perkins 2009-09-10 20:27:55 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1434.html