Bug 808580 - Starting osa-dispatcher sometimes encounters 'Connection refused' python traceback.
Summary: Starting osa-dispatcher sometimes encounters 'Connection refused' python trac...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Spacewalk
Classification: Community
Component: Server
Version: 1.8
Hardware: All
OS: Linux
medium
low
Target Milestone: ---
Assignee: Stephen Herr
QA Contact: Red Hat Satellite QA List
URL:
Whiteboard:
Depends On: 748290 834374
Blocks: space18
TreeView+ depends on / blocked
 
Reported: 2012-03-30 18:42 UTC by Stephen Herr
Modified: 2012-11-01 16:21 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 748290
Environment:
Last Closed: 2012-11-01 16:21:54 UTC
Embargoed:


Attachments (Terms of Use)

Description Stephen Herr 2012-03-30 18:42:17 UTC
+++ This bug was initially created as a clone of Bug #748290 +++

Description of problem:
When customer starts all the daemon for our Satellite server, there is an error message while booting the "osa-dispatcher"

-----------
Starting spacewalk services...
Initializing jabberd processes ...
Starting router:                                           [  OK  ]
Starting sm:                                               [  OK  ]
Starting c2s:                                              [  OK  ]
Starting s2s:                                              [  OK  ]
Starting osa-dispatcher: RHN 31458 2011/09/30 09:28:00 +02:00: ('Traceback (most recent call last):\n  File "/usr/share/rhn/osad/jabber_lib.py", line 254, in setup_connection\n    c = self._get_jabber_client(js)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 311, in _get_jabber_client\n    c.connect()\n  File "/usr/share/rhn/osad/jabber_lib.py", line 577, in connect\n    jabber.Client.connect(self)\n  File "/usr/lib/python2.4/site-packages/jabber/xmlstream.py", line 464, in connect\n    else: self._sock.connect((self._hostIP, self._port))\n  File "<string>", line 1, in connect\nerror: (111, \'Connection refused\')\n',)
                                                           [  OK  ]
Starting tomcat5:                                          [  OK  ]
Starting httpd:                                            [  OK  ]
Starting Monitoring ...  
[ OK ]
Starting MonitoringScout ...  
[ OK ]
Starting rhn-search...
Starting cobbler daemon:                                   [  OK  ]
Starting RHN Taskomatic...
SERVING!
Done.

Version-Release number of selected component (if applicable):
Satellite 5.4.1
osa-dispatcher-5.9.38-1.el5sat

How reproducible:
Almost always

Steps to Reproduce:
1. Run 'rhn-satellite start'
2. When osa-dispatcher starts it displays 'Connection refused' traceback.
 
Actual results:
Python traceback on starting osa-dispatcher service.

Expected results:
No traceback.

Additional info:
Customer is using an external Oracle database.  But same error has been observed on other customer systems when using the embedded database.

--- Additional comment from mhuth on 2011-10-23 20:38:04 EDT ---

The problem appears to be a timing issue.  It seems that the jabberd processes haven't completely finished starting up by the time the osa-dispatcher process starts up and tries to connect to them.

## From the osa-dispatcher.log, osa-dispatcher starts up at 09:33:10:
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: rhnSQL/driver_cx_Oracle.connect('Connecting to database', 'SAT01')
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.__init__
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.print_message('socket error',)
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.print_message('Could not connect to jabber server', 'vmsat01.prod.cls.local')

### during the osa-dispatcher startup, the initial connection to the jabberd c2s socket/process fails:
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.setup_connection('ERROR', 'Traceback (most recent call last):\n  File "/usr/share/rhn/osad/jabber_lib.py", line 254, in setup_connection\n    c = self._get_jabber_client(js)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 311, in _get_jabber_client\n    c.connect()\n  File "/usr/share/rhn/osad/jabber_lib.py", line 577, in connect\n    jabber.Client.connect(self)\n  File "/usr/lib/python2.4/site-packages/jabber/xmlstream.py", line 464, in connect\n    else: self._sock.connect((self._hostIP, self._port))\n  File "<string>", line 1, in connect\nerror: (111, \'Connection refused\')\n')
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.setup_connection('Could not connect to any jabber server',)
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.main('Unable to connect to jabber servers, sleeping 10 seconds',)

### In the /var/log/messages at that time I can see the jabberd processes also starting at 09:33:10:
Oct 17 09:33:10 vmsat01 jabberd/router[4337]: starting up
Oct 17 09:33:10 vmsat01 jabberd/router[4337]: process id is 4337, written to /var/lib/jabberd/pid/router.pid
Oct 17 09:33:10 vmsat01 jabberd/router[4337]: loaded user table (1 users)
Oct 17 09:33:10 vmsat01 jabberd/router[4337]: loaded filters (0 rules)
Oct 17 09:33:10 vmsat01 jabberd/router[4337]: [0.0.0.0, port=5347] listening for incoming connections
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: starting up
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: id: vmsat01.prod.cls.local
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: process id is 4344, written to /var/lib/jabberd/pid/sm.pid
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: loading 'db' storage module
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: initialised storage driver 'db'
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: modules search path: /usr/lib64/jabberd
Oct 17 09:33:10 vmsat01 jabberd/c2s[4351]: starting up
...
Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: module 'help' added to chain 'disco-extend' (order 1 index 18 seq 1)
Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: version: jabberd sm 2.2.8
Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: attempting connection to router at 127.0.0.1, port=5347
Oct 17 09:33:11 vmsat01 jabberd/router[4337]: [127.0.0.1, port=45508] connect
Oct 17 09:33:11 vmsat01 jabberd/router[4337]: [127.0.0.1, port=45508] authenticated as jabberd@jabberd-router
Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: connection to router established
Oct 17 09:33:11 vmsat01 jabberd/router[4337]: [vmsat01.prod.cls.local] online (bound to 127.0.0.1, port 45508)
Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: ready for sessions
... but unfortunately the jabberd processes don't all finish starting up until 09:33:11 - 1 second after osa-dispatcher.

### Then osa-dispatcher tries again 10 seconds later and successfully connects this time:
2011/10/17 09:33:20 +02:00 4369 0.0.0.0: osad/jabber_lib.__init__
2011/10/17 09:33:20 +02:00 4369 0.0.0.0: osad/jabber_lib.setup_connection('Connected to jabber server', 'vmsat01.prod.cls.local')
2011/10/17 09:33:21 +02:00 4369 0.0.0.0: osad/osa_dispatcher.fix_connection('Upstream notification server started on port', 1290)
2011/10/17 09:33:21 +02:00 4369 0.0.0.0: osad/jabber_lib.process_forever


Also, I can see in the output of lsof that osa-dispatcher is connected to the c2s socket/process:
osa-dispa  4370        0    7u     IPv4             278879                 TCP 10.246.0.234:60586->10.246.0.234:xmpp-client (ESTABLISHED)  <-- osa-dispatcher connected to c2s
c2s        4351      101    7u     IPv4             278880                 TCP 10.246.0.234:xmpp-client->10.246.0.234:60586 (ESTABLISHED)  <-- c2s connected to osa-dispatcher
... so the osa* infrastructure will actually be working, however osa-dispatcher throws that ugly message on startup because the jabberd processes hadn't completed starting when osa-dispatcher initially tried connecting to them.

--- Additional comment from mhuth on 2011-10-23 20:51:36 EDT ---

Created attachment 529720 [details]
Patch to spacewalk-service script to delay the startup of osa-dispatcher service after jabberd

In this case, the customer is using an external database, so the oracle service doesn't start on their machine.  Normally the oracle service sits between the jabberd and osa-dispatcher services during startup and probably acts like a buffer to delay the osa-dispatcher service startup after the jabberd service (plus osa-dispatcher uses the Oracle database during its own startup).  Because there is no oracle service to start, the osa-dispatcher service starts straight after the jabberd service, and we have seen from the logs that the jabberd service startup hasn't completed by the time osa-dispatcher starts.

This patch delays the startup of the osa-dispatcher service until after tomcat and httpd, thus giving the jabberd services more time to startup before osa-dispatcher tries to make use of jabberd.

Customer verified that making the change to spacewalk-service as per the patch, resolved the issue for them.

Comment 1 Stephen Herr 2012-03-30 18:45:56 UTC
Committing to Spacewalk master: bd6f32f5c5ff9ad7c34f4fe6f457818ae2453e56

Comment 3 Jan Pazdziora 2012-10-30 19:26:11 UTC
Moving ON_QA. Packages that address this bugzilla should now be available in yum repos at http://yum.spacewalkproject.org/nightly/

Comment 4 Jan Pazdziora 2012-11-01 16:21:54 UTC
Spacewalk 1.8 has been released: https://fedorahosted.org/spacewalk/wiki/ReleaseNotes18


Note You need to log in before you can comment on or make changes to this bug.