Hi, RHN osad does not seem to clean connections/sockets properly after the connection gets aborted for some reasons, it does not clean-up before opening a new connection. The versions used for this : > osad-0.9-2.rhel3 > and tested too with osad-0.9-5.rhel3. - > jabberpy-0.5-0.7.rhn.rhel3 If we get disconnected from the network a few times we will get : [root@lthpjstst rhn]# netstat -np | grep :5222 tcp 0 0 10.230.244.51:32955 10.230.52.10:5222 CLOSE_WAIT 27607/python tcp 0 0 10.230.244.51:32953 10.230.52.10:5222 CLOSE_WAIT 27607/python tcp 0 0 10.230.244.51:32959 10.230.52.10:5222 ESTABLISHED 27607/python tcp 0 0 10.230.244.51:32957 10.230.52.10:5222 CLOSE_WAIT 27607/python [root@lthpjstst rhn]# ps -ef | grep osad root 27607 1 0 13:15 pts/0 00:00:00 python /usr/sbin/osad --pid-file /var/run/osad.pid root 28713 15029 0 13:23 pts/0 00:00:00 grep osad [root@lthpjstst rhn]# rpm -q osad osad-0.9-5.rhel3 [root@lpgace11a root]# uname -a Linux lpgace11a 2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005 i686 i686 i386 GNU/Linux [root@lpgace11a root]# The problem is always reproducible. To reproduce the issue we just need to restart the services a few times, letting some time in between to allow osad to sleep and retry the jabber connection. Please let us know if you require more information,
*** Bug 203731 has been marked as a duplicate of this bug. ***
was able to get the osad client to drop the zombie socket when we run out of servers to connect, but it kills the service as well, introduces a dangling pid file and can't reconnect since its dead. need some more time on this one. moving to sat510 triage due to time constraints
This is what is going on: When osad is getting into this state, it calls jabber.Client.disconnected(self) which in turn is calling xmlstream.Client.disconnect The disconnect method tries to close the connection and then the socket but only if the process is not alive. But in our case the process is always alive as we get into the sleep state. This is putting the older ports into a CLOSED_WAIT state. client># while true; do echo; date; netstat -npt | grep -i 5222; sleep 150; done Wed Oct 3 14:19:42 EDT 2007 tcp 0 0 10.10.76.162:33133 10.10.76.168:5222 ESTABLISHED 30378/python tcp 0 0 10.10.76.162:33111 10.10.76.168:5222 TIME_WAIT - Wed Oct 3 14:22:12 EDT 2007 tcp 0 0 10.10.76.162:33133 10.10.76.168:5222 ESTABLISHED 30378/python Wed Oct 3 14:24:42 EDT 2007 tcp 0 0 10.10.76.162:33133 10.10.76.168:5222 ESTABLISHED 30378/python Wed Oct 3 14:27:12 EDT 2007 tcp 0 0 10.10.76.162:33133 10.10.76.168:5222 ESTABLISHED 30378/python
forgot to mention the above run is after adding the fix . As we can see there are no closed-wait state connections.
verified build 47 [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 0.0.0.0:5222 0.0.0.0:* LISTEN tcp 0 0 10.10.76.189:5222 10.10.76.189:32789 ESTABLISHED tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 ESTABLISHED tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 ESTABLISHED [root@rlx-3-18 ~]# /etc/init.d/rhn-satellite stop Shutting down rhn-satellite... Stopping rhn-search... Stopped rhn-search. Stopping satellite-httpd: audit(1200931575.079:14): avc: denied { unlink } for pid=2720 comm="httpd" name="jk-runtime-status.2720.lock" dev=dm-0 ino=6357181 scontext=user_u:system_r:httpd_t tcontext=user_u:object_r:httpd_log_t tclass=file [ OK ] waiting for processes to exit waiting for processes to exit Stopping RHN Taskomatic... Stopped RHN Taskomatic. Shutting down osa-dispatcher: [ OK ] Shutting down rhn-database: [ OK ] Shutting down Jabber router: [ OK ] Done. [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 TIME_WAIT tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 FIN_WAIT2 [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 TIME_WAIT tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 FIN_WAIT2 [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 TIME_WAIT tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 FIN_WAIT2 [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 TIME_WAIT tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 FIN_WAIT2 [root@rlx-3-18 ~]# [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 TIME_WAIT tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 FIN_WAIT2 [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 TIME_WAIT tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 FIN_WAIT2 [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 TIME_WAIT tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 FIN_WAIT2 [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 TIME_WAIT tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 FIN_WAIT2 [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 TIME_WAIT tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 FIN_WAIT2 [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 TIME_WAIT tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 FIN_WAIT2 [root@rlx-3-18 ~]# netstat -an | grep 5222 tcp 0 0 10.10.76.189:32789 10.10.76.189:5222 TIME_WAIT tcp 0 0 10.10.76.189:5222 10.10.76.182:42740 FIN_WAIT2 [root@rlx-3-18 ~]# netstat -an | grep 5222 [root@rlx-3-18 ~]# netstat -an | grep 5222 [root@rlx-3-18 ~]#
Looks good, tested by using the netstat commands above and bringing down rhn-satellite service, no closed_wait states appear.
5.1 Sat GA so Closed for Current Release.