+++ This bug was initially created as a clone of Bug #748290 +++ Description of problem: When customer starts all the daemon for our Satellite server, there is an error message while booting the "osa-dispatcher" ----------- Starting spacewalk services... Initializing jabberd processes ... Starting router: [ OK ] Starting sm: [ OK ] Starting c2s: [ OK ] Starting s2s: [ OK ] Starting osa-dispatcher: RHN 31458 2011/09/30 09:28:00 +02:00: ('Traceback (most recent call last):\n File "/usr/share/rhn/osad/jabber_lib.py", line 254, in setup_connection\n c = self._get_jabber_client(js)\n File "/usr/share/rhn/osad/jabber_lib.py", line 311, in _get_jabber_client\n c.connect()\n File "/usr/share/rhn/osad/jabber_lib.py", line 577, in connect\n jabber.Client.connect(self)\n File "/usr/lib/python2.4/site-packages/jabber/xmlstream.py", line 464, in connect\n else: self._sock.connect((self._hostIP, self._port))\n File "<string>", line 1, in connect\nerror: (111, \'Connection refused\')\n',) [ OK ] Starting tomcat5: [ OK ] Starting httpd: [ OK ] Starting Monitoring ... [ OK ] Starting MonitoringScout ... [ OK ] Starting rhn-search... Starting cobbler daemon: [ OK ] Starting RHN Taskomatic... SERVING! Done. Version-Release number of selected component (if applicable): Satellite 5.4.1 osa-dispatcher-5.9.38-1.el5sat How reproducible: Almost always Steps to Reproduce: 1. Run 'rhn-satellite start' 2. When osa-dispatcher starts it displays 'Connection refused' traceback. Actual results: Python traceback on starting osa-dispatcher service. Expected results: No traceback. Additional info: Customer is using an external Oracle database. But same error has been observed on other customer systems when using the embedded database. --- Additional comment from mhuth on 2011-10-23 20:38:04 EDT --- The problem appears to be a timing issue. It seems that the jabberd processes haven't completely finished starting up by the time the osa-dispatcher process starts up and tries to connect to them. ## From the osa-dispatcher.log, osa-dispatcher starts up at 09:33:10: 2011/10/17 09:33:10 +02:00 4369 0.0.0.0: rhnSQL/driver_cx_Oracle.connect('Connecting to database', 'SAT01') 2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.__init__ 2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.print_message('socket error',) 2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.print_message('Could not connect to jabber server', 'vmsat01.prod.cls.local') ### during the osa-dispatcher startup, the initial connection to the jabberd c2s socket/process fails: 2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.setup_connection('ERROR', 'Traceback (most recent call last):\n File "/usr/share/rhn/osad/jabber_lib.py", line 254, in setup_connection\n c = self._get_jabber_client(js)\n File "/usr/share/rhn/osad/jabber_lib.py", line 311, in _get_jabber_client\n c.connect()\n File "/usr/share/rhn/osad/jabber_lib.py", line 577, in connect\n jabber.Client.connect(self)\n File "/usr/lib/python2.4/site-packages/jabber/xmlstream.py", line 464, in connect\n else: self._sock.connect((self._hostIP, self._port))\n File "<string>", line 1, in connect\nerror: (111, \'Connection refused\')\n') 2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.setup_connection('Could not connect to any jabber server',) 2011/10/17 09:33:10 +02:00 4369 0.0.0.0: osad/jabber_lib.main('Unable to connect to jabber servers, sleeping 10 seconds',) ### In the /var/log/messages at that time I can see the jabberd processes also starting at 09:33:10: Oct 17 09:33:10 vmsat01 jabberd/router[4337]: starting up Oct 17 09:33:10 vmsat01 jabberd/router[4337]: process id is 4337, written to /var/lib/jabberd/pid/router.pid Oct 17 09:33:10 vmsat01 jabberd/router[4337]: loaded user table (1 users) Oct 17 09:33:10 vmsat01 jabberd/router[4337]: loaded filters (0 rules) Oct 17 09:33:10 vmsat01 jabberd/router[4337]: [0.0.0.0, port=5347] listening for incoming connections Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: starting up Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: id: vmsat01.prod.cls.local Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: process id is 4344, written to /var/lib/jabberd/pid/sm.pid Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: loading 'db' storage module Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: initialised storage driver 'db' Oct 17 09:33:10 vmsat01 jabberd/sm[4344]: modules search path: /usr/lib64/jabberd Oct 17 09:33:10 vmsat01 jabberd/c2s[4351]: starting up ... Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: module 'help' added to chain 'disco-extend' (order 1 index 18 seq 1) Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: version: jabberd sm 2.2.8 Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: attempting connection to router at 127.0.0.1, port=5347 Oct 17 09:33:11 vmsat01 jabberd/router[4337]: [127.0.0.1, port=45508] connect Oct 17 09:33:11 vmsat01 jabberd/router[4337]: [127.0.0.1, port=45508] authenticated as jabberd@jabberd-router Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: connection to router established Oct 17 09:33:11 vmsat01 jabberd/router[4337]: [vmsat01.prod.cls.local] online (bound to 127.0.0.1, port 45508) Oct 17 09:33:11 vmsat01 jabberd/sm[4344]: ready for sessions ... but unfortunately the jabberd processes don't all finish starting up until 09:33:11 - 1 second after osa-dispatcher. ### Then osa-dispatcher tries again 10 seconds later and successfully connects this time: 2011/10/17 09:33:20 +02:00 4369 0.0.0.0: osad/jabber_lib.__init__ 2011/10/17 09:33:20 +02:00 4369 0.0.0.0: osad/jabber_lib.setup_connection('Connected to jabber server', 'vmsat01.prod.cls.local') 2011/10/17 09:33:21 +02:00 4369 0.0.0.0: osad/osa_dispatcher.fix_connection('Upstream notification server started on port', 1290) 2011/10/17 09:33:21 +02:00 4369 0.0.0.0: osad/jabber_lib.process_forever Also, I can see in the output of lsof that osa-dispatcher is connected to the c2s socket/process: osa-dispa 4370 0 7u IPv4 278879 TCP 10.246.0.234:60586->10.246.0.234:xmpp-client (ESTABLISHED) <-- osa-dispatcher connected to c2s c2s 4351 101 7u IPv4 278880 TCP 10.246.0.234:xmpp-client->10.246.0.234:60586 (ESTABLISHED) <-- c2s connected to osa-dispatcher ... so the osa* infrastructure will actually be working, however osa-dispatcher throws that ugly message on startup because the jabberd processes hadn't completed starting when osa-dispatcher initially tried connecting to them. --- Additional comment from mhuth on 2011-10-23 20:51:36 EDT --- Created attachment 529720 [details] Patch to spacewalk-service script to delay the startup of osa-dispatcher service after jabberd In this case, the customer is using an external database, so the oracle service doesn't start on their machine. Normally the oracle service sits between the jabberd and osa-dispatcher services during startup and probably acts like a buffer to delay the osa-dispatcher service startup after the jabberd service (plus osa-dispatcher uses the Oracle database during its own startup). Because there is no oracle service to start, the osa-dispatcher service starts straight after the jabberd service, and we have seen from the logs that the jabberd service startup hasn't completed by the time osa-dispatcher starts. This patch delays the startup of the osa-dispatcher service until after tomcat and httpd, thus giving the jabberd services more time to startup before osa-dispatcher tries to make use of jabberd. Customer verified that making the change to spacewalk-service as per the patch, resolved the issue for them.
Committing to Spacewalk master: bd6f32f5c5ff9ad7c34f4fe6f457818ae2453e56
Moving ON_QA. Packages that address this bugzilla should now be available in yum repos at http://yum.spacewalkproject.org/nightly/
Spacewalk 1.8 has been released: https://fedorahosted.org/spacewalk/wiki/ReleaseNotes18