Bug 716064
Summary: | osa-dispatcher not notifying | ||
---|---|---|---|
Product: | [Community] Spacewalk | Reporter: | JDavis4102 |
Component: | Server | Assignee: | Milan Zázrivec <mzazrivec> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Satellite QA List <satqe-list> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 1.4 | CC: | david, helzerr, JDavis4102, jpazdziora, luc, mheldebr, mzazrivec, rhbugzilla, rich.shappard, sander.grendelman, v.motoska |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | osad-5.10.44-1 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | 662593 | Environment: | |
Last Closed: | 2012-11-01 16:19:03 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 871344 |
Description
JDavis4102
2011-06-23 16:08:10 UTC
Aligning under space16. Determined the following code line has allowed remote commands to continue to work. This is by no means a fix as this is just a workaround but may help in resolving the root issue. Line 170 in /usr/share/rhn/osad/osa_dispatcher.py add the following under: else: log_debug(5,"Not notifying jabber nodes") Add this line at 170 -->> self._tcp_server.notify_jabber_nodes() (In reply to comment #2) > Determined the following code line has allowed remote commands to continue to > work. This is by no means a fix as this is just a workaround but may help in > resolving the root issue. > > Line 170 in /usr/share/rhn/osad/osa_dispatcher.py add the following under: > else: > log_debug(5,"Not notifying jabber nodes") > > Add this line at 170 -->> self._tcp_server.notify_jabber_nodes() When you apply this change on your installation, does it reveal anything about the root cause? Jan, The only thing it shows adding this line is the jabber service sm and c2s fails to restart after a while. I have to remove the jabber DB to get it to come back. Also I have a system that trys to connect but is unable to display online in Spacewalk UI even after pinging. This only has started to show up after adding that line. All other clients seem to be up and running with adding that line of code. Any update as to a resolution for this issue? Adding the line above (Comment 3) has resolved the issue but it seems this is only a workaround and not actually resolving the root cause. Any idea as to what this could be? (In reply to comment #4) > Jan, > > The only thing it shows adding this line is the jabber service sm and c2s fails > to restart after a while. I have to remove the jabber DB to get it to come What is on the console, what is in the log, what does strace reveal? > back. Also I have a system that trys to connect but is unable to display online > in Spacewalk UI even after pinging. This only has started to show up after > adding that line. All other clients seem to be up and running with adding that So that change caused regression. What exactly changed on your environment since bug 662593 comment 42 when you reported the issue as resolved. And: can you please rephrase what are the reproducer steps exactly, as the word "issue" is morphing its meaning with each new comment. The problem we are having is that clients are able to ping but unable to pick up actions without performing the following: On the Spacewalk Application server: service osa-dispatcher stop service jabberd stop rm -f /var/lib/jabberd/db/* service jabberd start service osa-dispatcher start On the Spacewalk Proxy server: service jabberd stop rm -f /var/lib/jabberd/db/* service jabberd start On the clients, service osad stop rm -f /etc/sysconfig/rhn/osad-auth.conf service osad start Once the above commands have been ran the client is able to receive actions again for a short time and then goes back to not working. This happens randomly and unable to reproduce in my Development and Test environments. Keep in mind that we have over 4000 clients connected to Spacewalk and about 18 proxy servers. Adding the line listed below has provided a workaround in order to keep the service up and going. Line 170 in /usr/share/rhn/osad/osa_dispatcher.py add the following under: else: log_debug(5,"Not notifying jabber nodes") Add this line at 170 -->> self._tcp_server.notify_jabber_nodes() The other issues noted in my last comment have been resolved by other means and is not related to this issue. Sorry for the confusion on this. (In reply to comment #7) > The problem we are having is that clients are able to ping but unable to pick > up actions without performing the following: > > On the Spacewalk Application server: > > service osa-dispatcher stop > service jabberd stop > rm -f /var/lib/jabberd/db/* > service jabberd start > service osa-dispatcher start > > On the Spacewalk Proxy server: > > service jabberd stop > rm -f /var/lib/jabberd/db/* > service jabberd start > > On the clients, > > service osad stop > rm -f /etc/sysconfig/rhn/osad-auth.conf > service osad start > > Once the above commands have been ran the client is able to receive actions > again for a short time and then goes back to not working. This happens randomly > and unable to reproduce in my Development and Test environments. Keep in mind > that we have over 4000 clients connected to Spacewalk and about 18 proxy > servers. > > Adding the line listed below has provided a workaround in order to keep the > service up and going. > > Line 170 in /usr/share/rhn/osad/osa_dispatcher.py add the following under: > else: > log_debug(5,"Not notifying jabber nodes") > > Add this line at 170 -->> self._tcp_server.notify_jabber_nodes() Alright. The message from the Spacewalk team is: you should *never* need to remove the auth db, osad-auth.conf or do similar things. It might work for you for a while but it does not address the root cause of the problems. We should also not add the self._tcp_server.notify_jabber_nodes() call to a branch in the code which is specifically marked as Not notifying jabber nodes. Thus, we need to figure out why the code execution gets to this branch and not to the other branches where it would actually notify the clients. As I mentioned earlier, we will need logs, strace outputs, and console outputs. Also output of lsof will help determine what the programs are doing. By the way, which Spacewalk version is this, and what client versions do you use? As for the 4000 client -- yes, the number might be the cause of the problems. We might be hitting the number of open files or something here. But we need to *know* if this really is the case or if something else is suboptimal in the code or in the setup. So again, we will need much more detailed information from both the server and client side from the time when this situation appears to investigate and fix it. Sincerely, Jan Please also note that the nightly client repo has osad package with fixes for bug 691847. While it's hard to say if these issues are releated or not, you might want to upgrade to the latest osad packages just to debug the same code upstream is working with. I will update to latest current release and let you know. We can't go to nightly as this is a production environment. (In reply to comment #10) > I will update to latest current release and let you know. We can't go to > nightly as this is a production environment. Any updates from the Spacewalk 1.6 situation? I am unable to upgrade to 1.6 at this time due to production restrictions. I am unable to update my Development and Test environments as these environments were not seeing the issue. I will update this bug report once I am able to update to 1.6/current release.The update should happen around April/May time frame. I have upgraded to spacewalk 1.6 both server and clients and I have the same issue. The clients can be pinged but the osa dispatcher never notifies the clients. I enabled debugging in my osa-dispatcher.log and I think I can see why nothing is being notified of pending actions: 2012/01/19 14:47:23 -05:00 21008 0.0.0.0: rhnSQL/driver_postgresql._execute_wrapper('Executing SQL: "\n select id, name, shared_key, jabber_id\n from rhnPushClient\n where state_id = %(online_id)s\n and last_ping_time is not null\n and next_action_time is null\n and jabber_id is not null\n " with bind params: {online_id: 1}',) I have looked in the rhnPushClient table and I see that all active clients have a last_ping_time set to NULL. For a brief period of time after a manual ping from the web ui the last_ping_time value is present with a correct time stamp, but then this is reset to NULL shortly afterwards. Whatever is setting the NULL value in last_ping_time is preventing the query from finding actions for clients to notify. Also, scheduling an update is not setting next_action_time to a value other than NULL using the web ui. Scheduling an action and then manually pinging put values into both the next_action_time and last_ping_time which were both cleared by some other process shortly afterwards. I'm sure this was working with the base 1.6 release, the packages that I updated before osad stopped working are: Jan 17 12:46:40 Updated: spacewalk-base-minimal-1.6.38-1.el5.noarch Jan 17 12:46:40 Updated: spacewalk-java-config-1.6.102-1.el5.noarch Jan 17 12:46:40 Updated: spacewalk-java-lib-1.6.102-1.el5.noarch Jan 17 12:46:40 Updated: spacewalk-java-postgresql-1.6.102-1.el5.noarch Jan 17 12:46:47 Updated: libxml2-2.6.26-2.1.12.el5_7.2.x86_64 Jan 17 12:46:47 Updated: libxml2-python-2.6.26-2.1.12.el5_7.2.x86_64 Jan 17 12:46:48 Updated: spacewalk-taskomatic-1.6.102-1.el5.noarch Jan 17 12:46:50 Updated: spacewalk-java-1.6.102-1.el5.noarch Jan 17 12:46:51 Updated: kernel-headers-2.6.18-274.17.1.el5.x86_64 Jan 17 12:46:51 Updated: spacewalk-html-1.6.38-1.el5.noarch Jan 17 12:47:04 Installed: kernel-2.6.18-274.17.1.el5.x86_64 Jan 17 12:47:11 Installed: kernel-devel-2.6.18-274.17.1.el5.x86_64 Jan 17 12:47:11 Updated: spacewalk-pxt-1.6.38-1.el5.noarch Jan 17 12:47:12 Updated: spacewalk-base-1.6.38-1.el5.noarch Jan 17 12:47:12 Updated: spacewalk-grail-1.6.38-1.el5.noarch Jan 17 12:47:12 Updated: spacewalk-sniglets-1.6.38-1.el5.noarch Jan 19 11:52:32 Installed: python-lxml-2.0.11-1.el5.x86_64 (In reply to comment #13) > I have upgraded to spacewalk 1.6 both server and clients and I have the same > issue. The clients can be pinged but the osa dispatcher never notifies the > clients. > > I enabled debugging in my osa-dispatcher.log and I think I can see why nothing > is being notified of pending actions: > > 2012/01/19 14:47:23 -05:00 21008 0.0.0.0: > rhnSQL/driver_postgresql._execute_wrapper('Executing SQL: "\n select id, > name, shared_key, jabber_id\n from rhnPushClient\n where > state_id = %(online_id)s\n and last_ping_time is not null\n > and next_action_time is null\n and jabber_id is not null\n " with > bind params: {online_id: 1}',) > > > I have looked in the rhnPushClient table and I see that all active clients have > a last_ping_time set to NULL. For a brief period of time after a manual ping > from the web ui the last_ping_time value is present with a correct time stamp, > but then this is reset to NULL shortly afterwards. Whatever is setting the > NULL value in last_ping_time is preventing the query from finding actions for > clients to notify. > > Also, scheduling an update is not setting next_action_time to a value other > than NULL using the web ui. Scheduling an action and then manually pinging put > values into both the next_action_time and last_ping_time which were both > cleared by some other process shortly afterwards. I don't think your observation is correct. The SQL query you point out is _query_fetch_clients_to_be_pinged = rhnSQL.Statement(""" select id, name, shared_key, jabber_id from rhnPushClient where state_id = :online_id and last_ping_time is not null and next_action_time is null and jabber_id is not null """) So obviously the last_ping_time is set, and when the ping is sent, it is cleared -- that's why you see it there for a while, before osa-dispatcher picks it up and acts on it. I don't think it's related to your notification problem. What is the osad you have on the clients? Please make sure you have at least version 5.10.30-1. Clients are at osad-5.10.34-1.el5 Server is: osa-dispatcher-5.10.34-1.el5 osa-dispatcher-selinux-5.10.34-1.el5 Upgraded to spacewalk 1.7. Clients are still not picking up actions but update status when pinged. I had to remove /var/lib/jabber/db/* due to sm complaining after my upgrade. Mar 15 13:29:25 server jabberd/sm[31368]: db: couldn't open environment: DB_VERSION_MISMATCH: Database environment version mismatch Mar 15 13:29:25 server jabberd/sm[31368]: initialisation of storage driver 'db' failed Clients are at osad-5.10.41-1.el5 Server is: osa-dispatcher-selinux-5.10.41-1.el5 osa-dispatcher-5.10.41-1.el5 What's in the osa-dispatcher.log file? Pinged server before update 2012/03/23 08:51:35 -05:00 4646 0.0.0.0: osad/dispatcher_client._message_callback('Ping response',) I scheduled the update here and nothing is logged and the update is still waiting. Pinged server after about 8 minutes to verify it was still listening. 2012/03/23 08:59:13 -05:00 4646 0.0.0.0: osad/dispatcher_client._message_callback('Ping response',) This is the default log level, what log level would you like me to try logging at for more output? I am not sure if this is related but I will go ahead and post if this helps. With my added temp fix above I know get the following error when I upgraded to 1.6. 2012/03/28 16:26:08 -07:00 3785 0.0.0.0: osad/jabber_lib.main('ERROR', 'Error caught:') 2012/03/28 16:26:08 -07:00 3785 0.0.0.0: osad/jabber_lib.main('ERROR', 'Traceback (most recent call last):\n File "/usr/share/rhn/osad/jabber_lib.py", line 120, in main\n self.process_forever(c)\n File "/usr/share/rhn/osad/jabber_lib.py", line 178, in process_forever\n self.process_once(client)\n File "/usr/share/rhn/osad/osa_dispatcher.py", line 178, in process_once\n client.retrieve_roster()\n File "/usr/share/rhn/osad/jabber_lib.py", line 714, in retrieve_roster\n stanza = self.get_one_stanza()\n File "/usr/share/rhn/osad/jabber_lib.py", line 786, in get_one_stanza\n self.process(timeout=tm)\n File "/usr/share/rhn/osad/jabber_lib.py", line 1048, in process\n self._parser.Parse(data)\n File "/usr/lib/python2.4/site-packages/jabber/xmlstream.py", line 269, in unknown_endtag\n self.dispatch(self._mini_dom)\n File "/usr/share/rhn/osad/jabber_lib.py", line 814, in _orig_dispatch\n jabber.Client.dispatch(self, stanza)\n File "/usr/lib/python2.4/site-packages/jabber/jabber.py", line 290, in dispatch\n else: handler[\'func\'](self,stanza)\n File "/usr/share/rhn/osad/jabber_lib.py", line 369, in dispatch\n callback(client, stanza)\n File "/usr/share/rhn/osad/dispatcher_client.py", line 145, in _message_callback\n sig = self._check_signature_from_message(stanza, actions)\n File "/usr/share/rhn/osad/jabber_lib.py", line 1310, in _check_signature_from_message\n sig = self._check_signature(stanza, actions=actions)\n File "/usr/share/rhn/osad/dispatcher_client.py", line 69, in _check_signature\n row = lookup_client_by_name(x_client_id)\n File "/usr/share/rhn/osad/dispatcher_client.py", line 214, in lookup_client_by_name\n raise InvalidClientError(client_name)\nInvalidClientError: d79c1f5d23dca47d\n') 2012/03/28 16:26:18 -07:00 3785 0.0.0.0: osad/jabber_lib.__init__ 2012/03/28 16:26:18 -07:00 3785 0.0.0.0: osad/jabber_lib.setup_connection('Connected to jabber server', 'removed for security') 2012/03/28 16:26:18 -07:00 3785 0.0.0.0: osad/jabber_lib.register('ERROR', 'Invalid password') know should be now :) One thing I would like to add. I am unable to upgrade to latest as my company has strict restrictions as to when I can upgrade. I just upgraded to 1.6 as this is the release I had in both Dev and Test before 1.7 came out. I would really like to get this resolved some way. This issue has been seen since 1.4 and maybe even 1.2( can't remember that far back). Please let me know what logs you need and any other additional information and I will be glad to provide it do you. spacewalk.git master: c55818af8d22109b7e9d1310d694cae1561e0e41 From the commit message: Previously, in a single loop pass, we only notified clients if there was no more data to read from clients: if (there_is_client_data_to_read): read_the_client_data() else: push_scheduled_actions_to_clients() This schema would allow for starvation of the else branch. All you needed for the starvation was a data to be read from clients, since it would always be prefered to the event pushing. We want to prevent the starvation using the following logic: if (there_is_client_data_to_read): read_the_client_data() if (the_client_socket_is_write_ready): push_scheduled_actions_to_clients() If interested in testing the above fix, please seek for osa-dispatcher-5.10.44-1 from latest Spacewalk nightly to see whether it fixes the issue described in this bug report. I have installed and tested the nightly osa-dispatcher-5.10.44-1 package, installed it on the server, restarted the osa-dispatcher and a target system picked up an action shortly after scheduling one now. I will do some more testing to confirm that all systems are working as expected now. Would I be able to add the updated code to a 1.6 install of Spacewalk or is there other modifications that were made that I need to upgrade to 1.7? (In reply to comment #26) > Would I be able to add the updated code to a 1.6 install of Spacewalk or is > there other modifications that were made that I need to upgrade to 1.7? Yes, the updated osa-dispatcher should work with Spacewalk 1.6 as well. I have updated to the latest osa-dispatcher package and still having the same issue. The log that is generated is below. This was stable for 1 week almost to the hour. Not sure what could be going on. /var/log/rhn/osa-dispatcher.log 2012/05/01 02:59:45 -07:00 5981 0.0.0.0: osad/jabber_lib.main('ERROR', 'Error caught:') 2012/05/01 02:59:45 -07:00 5981 0.0.0.0: osad/jabber_lib.main('ERROR', 'Traceback (most recent call last):\n File "/usr/share/rhn/osad/jabber_lib.py", line 120, in main\n self.process_forever(c)\n File "/usr/share/rhn/osad/jabber_lib.py", line 178, in process_forever\n self.process_once(client)\n File "/usr/share/rhn/osad/osa_dispatcher.py", line 182, in process_once\n client.retrieve_roster()\n File "/usr/share/rhn/osad/jabber_lib.py", line 714, in retrieve_roster\n stanza = self.get_one_stanza()\n File "/usr/share/rhn/osad/jabber_lib.py", line 786, in get_one_stanza\n self.process(timeout=tm)\n File "/usr/share/rhn/osad/jabber_lib.py", line 1048, in process\n self._parser.Parse(data)\n File "/usr/lib/python2.4/site-packages/jabber/xmlstream.py", line 269, in unknown_endtag\n self.dispatch(self._mini_dom)\n File "/usr/share/rhn/osad/jabber_lib.py", line 814, in _orig_dispatch\n jabber.Client.dispatch(self, stanza)\n File "/usr/lib/python2.4/site-packages/jabber/jabber.py", line 290, in dispatch\n else: handler[\'func\'](self,stanza)\n File "/usr/share/rhn/osad/jabber_lib.py", line 369, in dispatch\n callback(client, stanza)\n File "/usr/share/rhn/osad/dispatcher_client.py", line 145, in _message_callback\n sig = self._check_signature_from_message(stanza, actions)\n File "/usr/share/rhn/osad/jabber_lib.py", line 1310, in _check_signature_from_message\n sig = self._check_signature(stanza, actions=actions)\n File "/usr/share/rhn/osad/dispatcher_client.py", line 69, in _check_signature\n row = lookup_client_by_name(x_client_id)\n File "/usr/share/rhn/osad/dispatcher_client.py", line 214, in lookup_client_by_name\n raise InvalidClientError(client_name)\nInvalidClientError: 910f7d384484c87b\n') 2012/05/01 02:59:55 -07:00 5981 0.0.0.0: osad/jabber_lib.__init__ 2012/05/01 02:59:55 -07:00 5981 0.0.0.0: osad/jabber_lib.setup_connection('Connected to jabber server', 'Removed for security') 2012/05/01 02:59:55 -07:00 5981 0.0.0.0: osad/jabber_lib.register('ERROR', 'Invalid password') (In reply to comment #28) This actually seems to be a different problem than what was reported originally: osad/jabber_lib.register('ERROR', 'Invalid password') Is your auth db OK? (/var/lib/jabberd/db/authreg.db in the default setup). Does removing jabberd db and restarting jabberd & osa-dispatcher help the issue? (In reply to comment #29) > (In reply to comment #28) > > This actually seems to be a different problem than what was reported > originally: > > osad/jabber_lib.register('ERROR', 'Invalid password') > > Is your auth db OK? (/var/lib/jabberd/db/authreg.db in the default setup). > > Does removing jabberd db and restarting jabberd & osa-dispatcher help > the issue? This maybe a different issue as of right now other than that error everything seems to be up. It is hard to tell as these both cause osad to not pick up actions. How would I check to see if the auth db is ok? When I get the "invalid password" error I have to stop jabber and osa-dispatcher then remove the jabberd db and then restart the services. This will get everything back up and going for a while until I have to do it again. Any ideas as to comment #30? It seems that the original bug that was opened may have been resolved. It is hard to tell as both case osad to not pick up actions. Should I open a new bug regarding these new error logs that I am getting? Thank you for your time and have a great day! (In reply to comment #31) > Any ideas as to comment #30? It seems that the original bug that was opened may > have been resolved. It is hard to tell as both case osad to not pick up > actions. Should I open a new bug regarding these new error logs that I am > getting? Yes. We consider the original 2010/12/13 10:56:38 +02:00 31368 0.0.0.0: osad/osa_dispatcher.process_once('Not notifying jabber nodes',) issue resolved as per comment 23. And the bugzilla will be CLOSED as resolved when Spacewalk 1.8 is released. Moving ON_QA. Packages that address this bugzilla should now be available in yum repos at http://yum.spacewalkproject.org/nightly/ Spacewalk 1.8 has been released: https://fedorahosted.org/spacewalk/wiki/ReleaseNotes18 |