Bug 808580

Summary: Starting osa-dispatcher sometimes encounters 'Connection refused' python traceback.
Product: [Community] Spacewalk Reporter: Stephen Herr <sherr>
Component: ServerAssignee: Stephen Herr <sherr>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Satellite QA List <satqe-list>
Severity: low Docs Contact:
Priority: medium    
Version: 1.8CC: agouny, cperry, mhuth, slukasik, xdmoon
Target Milestone: ---Keywords: EasyFix, Patch
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 748290 Environment:
Last Closed: 2012-11-01 16:21:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 748290, 834374    
Bug Blocks: 871344    

Description Stephen Herr 2012-03-30 18:42:17 UTC
+++ This bug was initially created as a clone of Bug #748290 +++

Description of problem:
When customer starts all the daemon for our Satellite server, there is an error message while booting the "osa-dispatcher"

-----------
Starting spacewalk services...
Initializing jabberd processes ...
Starting router:                                           [  OK  ]
Starting sm:                                               [  OK  ]
Starting c2s:                                              [  OK  ]
Starting s2s:                                              [  OK  ]
Starting osa-dispatcher: RHN 31458 2011/09/30 09:28:00 +02:00: ('Traceback (most recent call last):\n  File "/usr/share/rhn/osad/jabber_lib.py", line 254, in setup_connection\n    c = self._get_jabber_client(js)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 311, in _get_jabber_client\n    c.connect()\n  File "/usr/share/rhn/osad/jabber_lib.py", line 577, in connect\n    jabber.Client.connect(self)\n  File "/usr/lib/python2.4/site-packages/jabber/xmlstream.py", line 464, in connect\n    else: self._sock.connect((self._hostIP, self._port))\n  File "<string>", line 1, in connect\nerror: (111, \'Connection refused\')\n',)
                                                           [  OK  ]
Starting tomcat5:                                          [  OK  ]
Starting httpd:                                            [  OK  ]
Starting Monitoring ...  
[ OK ]
Starting MonitoringScout ...  
[ OK ]
Starting rhn-search...
Starting cobbler daemon:                                   [  OK  ]
Starting RHN Taskomatic...
SERVING!
Done.

Version-Release number of selected component (if applicable):
Satellite 5.4.1
osa-dispatcher-5.9.38-1.el5sat

How reproducible:
Almost always

Steps to Reproduce:
1. Run 'rhn-satellite start'
2. When osa-dispatcher starts it displays 'Connection refused' traceback.
 
Actual results:
Python traceback on starting osa-dispatcher service.

Expected results:
No traceback.

Additional info:
Customer is using an external Oracle database.  But same error has been observed on other customer systems when using the embedded database.

--- Additional comment from mhuth on 2011-10-23 20:38:04 EDT ---

The problem appears to be a timing issue.  It seems that the jabberd processes haven't completely finished starting up by the time the osa-dispatcher process starts up and tries to connect to them.

## From the osa-dispatcher.log, osa-dispatcher starts up at 09:33:10:
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: rhnSQL/driver_cx_Oracle.connect('Connecting to database', 'SAT01')
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.__init__
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.print_message('socket error',)
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.print_message('Could not connect to jabber server', 'vmsat01.prod.cls.local')

### during the osa-dispatcher startup, the initial connection to the jabberd c2s socket/process fails:
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.setup_connection('ERROR', 'Traceback (most recent call last):\n  File "/usr/share/rhn/osad/jabber_lib.py", line 254, in setup_connection\n    c = self._get_jabber_client(js)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 311, in _get_jabber_client\n    c.connect()\n  File "/usr/share/rhn/osad/jabber_lib.py", line 577, in connect\n    jabber.Client.connect(self)\n  File "/usr/lib/python2.4/site-packages/jabber/xmlstream.py", line 464, in connect\n    else: self._sock.connect((self._hostIP, self._port))\n  File "<string>", line 1, in connect\nerror: (111, \'Connection refused\')\n')
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.setup_connection('Could not connect to any jabber server',)
2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.main('Unable to connect to jabber servers, sleeping 10 seconds',)

### In the /var/log/messages at that time I can see the jabberd processes also starting at 09:33:10:
Oct 17 09:33:10 vmsat01 jabberd/router[4337]: starting up
Oct 17 09:33:10 vmsat01 jabberd/router[4337]: process id is 4337, written to /var/lib/jabberd/pid/router.pid
Oct 17 09:33:10 vmsat01 jabberd/router[4337]: loaded user table (1 users)
Oct 17 09:33:10 vmsat01 jabberd/router[4337]: loaded filters (0 rules)
Oct 17 09:33:10 vmsat01 jabberd/router[4337]: [0.0.0.0, port=5347] listening for incoming connections
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: starting up
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: id: vmsat01.prod.cls.local
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: process id is 4344, written to /var/lib/jabberd/pid/sm.pid
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: loading 'db' storage module
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: initialised storage driver 'db'
Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: modules search path: /usr/lib64/jabberd
Oct 17 09:33:10 vmsat01 jabberd/c2s[4351]: starting up
...
Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: module 'help' added to chain 'disco-extend' (order 1 index 18 seq 1)
Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: version: jabberd sm 2.2.8
Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: attempting connection to router at 127.0.0.1, port=5347
Oct 17 09:33:11 vmsat01 jabberd/router[4337]: [127.0.0.1, port=45508] connect
Oct 17 09:33:11 vmsat01 jabberd/router[4337]: [127.0.0.1, port=45508] authenticated as jabberd@jabberd-router
Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: connection to router established
Oct 17 09:33:11 vmsat01 jabberd/router[4337]: [vmsat01.prod.cls.local] online (bound to 127.0.0.1, port 45508)
Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: ready for sessions
... but unfortunately the jabberd processes don't all finish starting up until 09:33:11 - 1 second after osa-dispatcher.

### Then osa-dispatcher tries again 10 seconds later and successfully connects this time:
2011/10/17 09:33:20 +02:00 4369 0.0.0.0: osad/jabber_lib.__init__
2011/10/17 09:33:20 +02:00 4369 0.0.0.0: osad/jabber_lib.setup_connection('Connected to jabber server', 'vmsat01.prod.cls.local')
2011/10/17 09:33:21 +02:00 4369 0.0.0.0: osad/osa_dispatcher.fix_connection('Upstream notification server started on port', 1290)
2011/10/17 09:33:21 +02:00 4369 0.0.0.0: osad/jabber_lib.process_forever


Also, I can see in the output of lsof that osa-dispatcher is connected to the c2s socket/process:
osa-dispa  4370        0    7u     IPv4             278879                 TCP 10.246.0.234:60586->10.246.0.234:xmpp-client (ESTABLISHED)  <-- osa-dispatcher connected to c2s
c2s        4351      101    7u     IPv4             278880                 TCP 10.246.0.234:xmpp-client->10.246.0.234:60586 (ESTABLISHED)  <-- c2s connected to osa-dispatcher
... so the osa* infrastructure will actually be working, however osa-dispatcher throws that ugly message on startup because the jabberd processes hadn't completed starting when osa-dispatcher initially tried connecting to them.

--- Additional comment from mhuth on 2011-10-23 20:51:36 EDT ---

Created attachment 529720 [details]
Patch to spacewalk-service script to delay the startup of osa-dispatcher service after jabberd

In this case, the customer is using an external database, so the oracle service doesn't start on their machine.  Normally the oracle service sits between the jabberd and osa-dispatcher services during startup and probably acts like a buffer to delay the osa-dispatcher service startup after the jabberd service (plus osa-dispatcher uses the Oracle database during its own startup).  Because there is no oracle service to start, the osa-dispatcher service starts straight after the jabberd service, and we have seen from the logs that the jabberd service startup hasn't completed by the time osa-dispatcher starts.

This patch delays the startup of the osa-dispatcher service until after tomcat and httpd, thus giving the jabberd services more time to startup before osa-dispatcher tries to make use of jabberd.

Customer verified that making the change to spacewalk-service as per the patch, resolved the issue for them.

Comment 1 Stephen Herr 2012-03-30 18:45:56 UTC
Committing to Spacewalk master: bd6f32f5c5ff9ad7c34f4fe6f457818ae2453e56

Comment 3 Jan Pazdziora (Red Hat) 2012-10-30 19:26:11 UTC
Moving ON_QA. Packages that address this bugzilla should now be available in yum repos at http://yum.spacewalkproject.org/nightly/

Comment 4 Jan Pazdziora (Red Hat) 2012-11-01 16:21:54 UTC
Spacewalk 1.8 has been released: https://fedorahosted.org/spacewalk/wiki/ReleaseNotes18