716064 – osa-dispatcher not notifying

Bug 716064 - osa-dispatcher not notifying

Summary: osa-dispatcher not notifying

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Spacewalk
Classification:	Community
Component:	Server
Sub Component:
Version:	1.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Milan Zázrivec
QA Contact:	Red Hat Satellite QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	space18
TreeView+	depends on / blocked

Reported:	2011-06-23 16:08 UTC by JDavis4102
Modified:	2012-11-01 16:19 UTC (History)
CC List:	11 users (show)
Fixed In Version:	osad-5.10.44-1
Clone Of:	662593
Environment:
Last Closed:	2012-11-01 16:19:03 UTC
Embargoed:

Attachments	(Terms of Use)

Description JDavis4102 2011-06-23 16:08:10 UTC

+++ This bug was initially created as a clone of Bug #662593 +++

Description of problem:
After 1.1->1.2 upgrade we are not able to notify osad enabled clients. Log is full of:
2010/12/13 10:56:38 +02:00 31368 0.0.0.0: osad/osa_dispatcher.process_once('Not notifying jabber nodes',)

The following steps don't resolve the issue:
sed -i "s/1\.1-client/1\.2-client/" /etc/yum.repos.d/spacewalk-client.repo
/etc/init.d/osad stop
rm -f /etc/sysconfig/rhn/osad-auth.conf
yum -y update osad yum-rhn-plugin rhnsd rhnlib rhn-setup rhn-client-tools rhn-check
/etc/init.d/osad restart
/etc/init.d/rhnsd restart

We had this issue also with 1.1 Spacewalk, when clients had re-registered. Our workaround was to delete all osad enabled clients from Spacewalk, jabber DBs, and re-register clients. But this is very time consuming solution.

Version-Release number of selected component (if applicable):
jabberd-2.2.11-2.el5
jabberpy-0.5-0.17.el5
osad-5.9.44-1.el5
osa-dispatcher-5.9.44-1.el5
spacewalk-setup-1.2.16-1.el5
spacewalk-setup-jabberd-1.3.1-1.el5

Actual results:
2010/12/13 10:56:38 +02:00 31368 0.0.0.0: osad/osa_dispatcher.process_once('Not notifying jabber nodes',)
Clients are not picking the tasks up.

Expected results:
Clients should pick the tasks up seamlessly.

--- Additional comment from marcus.moeller on 2011-01-11 08:56:24 EST ---

Dear David,

are you able to ping the clients? As here, OSA ping works but actions are not picked up.

Kind Regards
Marcus

--- Additional comment from david.hrbac on 2011-01-11 09:59:47 EST ---

Marcus,
we had been able to ping, but clients did not pick up the tasks one day. A few weeks we are not able to even ping. Clients are picking up with rhn_check only. BTW ping is not working even to spacewalk server itself. :o(
Regards,
David Hrbáč

--- Additional comment from marcus.moeller on 2011-01-13 09:41:00 EST ---

Dear David,

it seems to depend on the number of clients registered to jabbered.

On a large number of systems jabber_connection.jid_available(jabber_id) seems to report that the Node is not available (even if it is).

Next,

 rfds, wfds, efds = select.select([client, self._tcp_server], [], [], npi)

detects the client as rfds which leads to 'Not notifying jabber nodes'

Kind Regards
Marcus

--- Additional comment from david.hrbac on 2011-01-14 02:12:07 EST ---

Marcus,
Well, not sure about it. We have only 14 boxes registered within our testing Spacewalk instance. I'm very disappointed with Spacewalk so far. We are evaluating Spacewalk for a few weeks. I have found a lot of bugs, reported them, and found workarounds during this period.
Regards,
David

--- Additional comment from marcus.moeller on 2011-01-14 03:09:31 EST ---

Dear David,

we have a production and a test environment, both running 1.2 and both upgraded from previous releases.

Within the production environment we manage about 500 clients, the test env is (as the name implies) just for testing purpose.

I have registered one of the clients to the test environment to make sure that it's not a general problem and ping/remote commands work there.

Unregistering a system from production env and re-registering does not seem to help so I tried to figure out the differences.

Our testing environment has just a few systems connected (2-3) so I this might be the reason.

Overall the connection handling is just brokenn atm. It does not correctly detect if a connection is established or not (as mentioned in Comment 3). I am not yet sure under what circumstances that happens but I guess it has something to do with the number of connected clients.

Kind Regards
Marcus

--- Additional comment from JDavis4102 on 2011-02-01 22:05:05 EST ---

I am also having this issue.

I can ping but no clients are picking up the actions.

I have just added about 10 servers and it seems that this broke osad from picking up actions. It was working before this.

I have restarted everything. Removed the DB and auth configurations and restarted osad and nothing. The only thing I have not done is remove all clients. I would like to not do that.

--- Additional comment from JDavis4102 on 2011-02-02 12:35:46 EST ---

Is there any ETA as to a fix for this? We use OSAD/OSA-dispatcher for our normal system management and would hate to lose this functionality. Thank you for your time and have a great day!

--- Additional comment from mzazrivec on 2011-02-03 13:38:38 EST ---

(In reply to comment #0)
> Description of problem:
> After 1.1->1.2 upgrade we are not able to notify osad enabled clients. Log is
> full of:
> 2010/12/13 10:56:38 +02:00 31368 0.0.0.0: osad/osa_dispatcher.process_once('Not
> notifying jabber nodes',)
> 
> The following steps don't resolve the issue:
> sed -i "s/1\.1-client/1\.2-client/" /etc/yum.repos.d/spacewalk-client.repo
> /etc/init.d/osad stop
> rm -f /etc/sysconfig/rhn/osad-auth.conf
> yum -y update osad yum-rhn-plugin rhnsd rhnlib rhn-setup rhn-client-tools
> rhn-check
> /etc/init.d/osad restart
> /etc/init.d/rhnsd restart

Why the above procedure?

> We had this issue also with 1.1 Spacewalk, when clients had re-registered. Our
> workaround was to delete all osad enabled clients from Spacewalk, jabber DBs,
> and re-register clients. But this is very time consuming solution.

As far as system re-registration is concerned, there's a separate bug report
dealing with a situation where the push functionality stops working after
system's re-registration:

    https://bugzilla.redhat.com/show_bug.cgi?id=590608

--- Additional comment from mzazrivec on 2011-02-03 13:45:54 EST ---

(In reply to comment #1)
> Dear David,
> 
> are you able to ping the clients? As here, OSA ping works but actions are not
> picked up.

What does 'OSA ping works' mean exactly?

In the Spacewalk webui, do you see the system online?

When pinging the system via webui, do you see the ping time stamp being updated?

--- Additional comment from mzazrivec on 2011-02-03 13:51:08 EST ---

(In reply to comment #6)
> I am also having this issue.
> 
> I can ping but no clients are picking up the actions.
> 
> I have just added about 10 servers and it seems that this broke osad from
> picking up actions. It was working before this.
> 
> I have restarted everything. Removed the DB and auth configurations and
> restarted osad and nothing. The only thing I have not done is remove all
> clients. I would like to not do that.

Could you please try to do the following on your Spacewalk server:

# service osa-dispatcher stop
# service jabberd stop

Edit following three files:

/etc/jabberd/c2s.xml
/etc/jabberd/s2s.xml
/etc/jabberd/router.xml

and change the <max_fds>...</max_fds> value in each of them (e.g. double it).

# service jabberd start
# service osa-dispatcher start

--- Additional comment from JDavis4102 on 2011-02-03 19:38:29 EST ---

Milan,

I have made the changes as requested. 

I performed a ping on all clients and the ping information has updated on the WEB UI. I then scheduled a remote command to run and the servers didn't pick up the action as expected.

It seems modifying the max_fds setting (from 1024 to 2048) has not resolved my issue.

Thank you for your time and have a great day!

--- Additional comment from david.hrbac on 2011-02-04 03:37:48 EST ---

Milan,
those settings have nothing to do with this issue. We are experiencing this issue event with small spacewalk instances having about 15 clients.
Thanks,
David Hrbáč

--- Additional comment from david.hrbac on 2011-02-10 04:42:47 EST ---

Hi,
upgrade to Spacewalk 1.3 nor the latest osad-5.9.55-1 don't solve the issue. I'm still not able to ping the clients. Osad-dispather is still not sending the notifies.
Regards,
David Hrbáč

--- Additional comment from JDavis4102 on 2011-02-14 14:53:59 EST ---

Hello Milan, 

Has there been any progress with this issue? Thank you for your time and have a great day!

--- Additional comment from v.motoska on 2011-02-15 06:23:31 EST ---

Hi, I've spend some time with this issue. After looking at osad/jabber_lib.py I just commented out the roster in jabber server and restarted the clietns and dispatcher. Now it works like a dream. 

It looks like the client can't get the subscription to dispatcher when the roster is enabled.

--- Additional comment from david.hrbac on 2011-02-15 07:23:40 EST ---

Vlado,
thanks for the point. It really seems to help. Just for the record: Vlado is talking about /etc/jabberd/sm.xml and commenting out roster* within the file. I'm attaching the patch.
Thanks!
David

--- Additional comment from david.hrbac on 2011-02-15 07:24:46 EST ---

Created attachment 478869 [details]
Vlado's solution

--- Additional comment from tvr1 on 2011-02-15 15:07:02 EST ---

I can validate that removing the roster does indeed seem to "fix" the issue on 1.2. Off to try against 1.3

--- Additional comment from tvr1 on 2011-02-15 17:55:39 EST ---

Works for 1.3 issues as well. I am happily back to the land of a work osad enabled spacewalk install.

Thanks Vlado!

--- Additional comment from JDavis4102 on 2011-02-15 17:59:01 EST ---

I can validate that removing the roster modules does resolve this issue on 1.2.

Thank you so much!!! :)

--- Additional comment from helzerr on 2011-02-16 12:11:42 EST ---

Works for me in 1.3; thank you!

I had to perform these additional steps after modifying sm.xml:

On the Spacewalk / Jabber server,

  service osa-dispatcher stop
  service jabberd stop
  rm -f /var/lib/jabberd/db/*
  service jabberd start
  service osa-dispatcher start

On the clients,

  service osad stop
  rm /etc/sysconfig/rhn/osad-auth.conf
  service osad start

--- Additional comment from mzazrivec on 2011-02-21 12:31:55 EST ---

Fixed in spacewalk.git master: 2ca8629f4d2bd681bd1db48b4672059fb1cdc653

The fix above is to ensure presence subscription works with standard
Spacewalk jabberd setup as created by spacewalk-setup-jabberd (i.e.
no need to disable roster module at all).

--- Additional comment from JDavis4102 on 2011-02-24 18:31:59 EST ---

Milan, 

I have tried the fix above by modifying my /usr/share/rhn/osad/osa_dispatcher.py with the changes noted in the latest patch. I have removed the module roster comments and restarted all services. It doesn't resolve this issue. 

I added the comments back to the modules and everything came up and started working again without modifying /usr/share/rhn/osad/osa_dispatcher.py.

So right now I have /usr/share/rhn/osad/osa_dispatcher.py modified as you submitted with <module>roster*</module> commented and osad is picking up actions as expected.

One issue that is now in my environment is that when a server reboots or osad is restarted it doesn't fully connect to the Spacewalk server (This issue was happening before modifying osa_dispatcher.py). In order to get OSAD to attach back to the server I have to remove /etc/sysconfig/rhn/osad-auth.conf and then restart OSAD and then everything starts as expected.

--- Additional comment from mzazrivec on 2011-02-25 09:35:08 EST ---

(In reply to comment #23)
> Milan, 
> 
> I have tried the fix above by modifying my
> /usr/share/rhn/osad/osa_dispatcher.py with the changes noted in the latest
> patch. I have removed the module roster comments and restarted all services.
> It doesn't resolve this issue.

Actually, you'd need to do the following:

1. Stop osad(s) on the client(s)
2. Stop osa-dispatcher and jabberd on the Spacewalk server
3. Apply the patch from comment #22
4. On the Spacewalk server: rm -f /var/lib/jabberd/db/*
5. Make sure you're using standard Spacewalk configured sm.xml
6. Start jabberd
7. Start osa-dispatcher
8. Start osad(s) on your client(s)

--- Additional comment from JDavis4102 on 2011-02-25 12:43:03 EST ---

Milan, 

I performed the steps listed and it seems to have resolved my issue in my Development and Test environment. Now this issue at first wasn't seen initially in these environments so I am not sure if this has totally resolved the issue. I will be pushing to the Production environment soon. Once in my Production environment I will let you know how it turns out.

Thank you for your work and hope you have a great day!

Kind regards,
JD

--- Additional comment from JDavis4102 on 2011-03-04 13:53:01 EST ---

I have released to the Production environment and it seems that this patch has not resolved my issue. OSA ping works but actions are not being sent to the client. 

This has resolved the issue of being able to restart osad without having to remove /etc/sysconfig/rhn/osad-auth.conf.

--- Additional comment from mzazrivec on 2011-03-04 14:14:47 EST ---

(In reply to comment #26)
> I have released to the Production environment and it seems that this patch has
> not resolved my issue. OSA ping works but actions are not being sent to the
> client. 
> 
> This has resolved the issue of being able to restart osad without having to
> remove /etc/sysconfig/rhn/osad-auth.conf.

This is quite odd actually. Could you please do the following for me?

1. Perform steps one to six from comment #24
2. On your Spacewalk as root:
    # osa-dispatcher -N -vvvvvvvvvv >& osa-dispatcher.log
3. On one of your clients as root:
    # osad -N -vvvvvvvvvv >& osad.log

Once you see the client system in question shows as online in webui, try
ping, then schedule some remote action. Have it running for a while
(e.g. a minute), then Crtl-C osad and osa-dispatcher and attach both
log files to this bug report (feel free to obfuscate hostnames in
the log files if you don't feel like exposing it).

--- Additional comment from JDavis4102 on 2011-03-04 15:30:07 EST ---

Is there anyway to only do this on one of the clients. I have about 300+ servers connected and it takes a while to make the change on them all.

--- Additional comment from mzazrivec on 2011-03-04 15:57:55 EST ---

(In reply to comment #28)
> Is there anyway to only do this on one of the clients. I have about 300+
> servers connected and it takes a while to make the change on them all.

OK, in that case just do steps 2. and 3. from comment #27 and attach both
log files please.

--- Additional comment from JDavis4102 on 2011-03-04 20:05:09 EST ---

Created attachment 482404 [details]
Spacewalk logs

--- Additional comment from JDavis4102 on 2011-03-04 20:06:40 EST ---

It seems that the issue has become intermittent. After performing the requested actions it seems to have started working. Not sure what is going on now.

--- Additional comment from mzazrivec on 2011-03-07 05:38:26 EST ---

(In reply to comment #31)
> It seems that the issue has become intermittent. After performing the requested
> actions it seems to have started working. Not sure what is going on now.

OK, I see both osa-dispatcher and osad subscribed to each other's presence,
ping works, client picked up the scheduled action.

Seeing you have about 300+ systems connected to your Spacewalk, may I also
suggest to increase max_fds settings as suggested in comment #10 (I saw
cases where this was necessary in environments with many client systems).

Thanks.

--- Additional comment from JDavis4102 on 2011-03-08 00:30:27 EST ---

Created attachment 482838 [details]
New-Spacewalk-OSA-Logs

I have performed the actions you have requested. It doesn't seem to resolve my issue. 

However, I was able to replicate the original issue even with max_fds modifications. Attached you will find the logs in question.

--- Additional comment from JDavis4102 on 2011-03-09 14:21:04 EST ---

Any ideas as to what I could try next? I thank you for the assistance you have provided thus far.

--- Additional comment from JDavis4102 on 2011-03-11 17:06:30 EST ---

I have some updates to this issue. I currently have OSAD actions working. I have restarted each client and removed the jabber DB and everything started working. Then about 24-36 hours later everything stopped receiving actions. I restarted osa-dispatcher and it started working again. I am not sure what may be causing this and if this is the same issue. Please let me know if I need to open another bug for this issue.

--- Additional comment from JDavis4102 on 2011-03-11 17:17:50 EST ---

Created attachment 483834 [details]
Traceback error

I also started receiving traceback logs from a proxy server when I perform a remote command action.

--- Additional comment from rhbugzilla on 2011-03-12 23:25:14 EST ---

I was experiencing problems similar to others posting here.  Pings would work, but osad was not picking up actions.  Spacewalk was upgraded from 1.0->1.1->1.2->1.3  and osad stopped working somewhere along the way.

To get my jabber configuration back to normal, I ended up doing:
On spacewalk server:
service osa-dispatcher stop
rpm --nodeps --erase jabberd
mv /etc/jabberd /etc/jabberd.old
rm -f /var/lib/jabberd/db/*
yum install jabberd
spacewalk-setup-jabberd
service jabberd start
service osa-dispatcher start

Then on clients I had to run:
service osad stop
rm -f /etc/sysconfig/rhn/osad-auth.conf
service osad start

Now osad is working, and hopefully will still be working tomorrow.

--- Additional comment from mzazrivec on 2011-03-14 06:51:22 EDT ---

(In reply to comment #36)
> Created attachment 483834 [details]
> Traceback error

The important part from the traceback shown is:

    ORA-12519: TNS:no appropriate service handler found

which simply suggests osa-dispatcher had problems with database connection.

When setting up the Oracle DB (XE or 10g / 11g), did you alter system
processes as described in https://fedorahosted.org/spacewalk/wiki/OracleXeSetup ?

Specifically I mean that 

    alter system set processes = 400 scope=spfile;

part. If so, what value did you set the processes to? Our documentation
suggests 400, but if that's not enough, increasing the amount and restarting
the server may help.

> I also started receiving traceback logs from a proxy server when I perform a
> remote command action.

Are we talking Spacewalk proxy here? What do those tracebacks look like?

--- Additional comment from JDavis4102 on 2011-03-14 12:05:08 EDT ---

I have an Oracle Standalone 11g DB. I have not made the recommended change as I thought it was only for XE and not 11g. I will go ahead and make the change. One question before I do is the suggestion 400 for each client? If so I would need to bump that number up a lot.

When I said proxy server I was getting the tracebacks from the Application server that was having problems with the proxy. The traceback I have uploaded to this bug are these logs.

Could the issue with connections be the cause for the clients to not receive actions?

--- Additional comment from mzazrivec on 2011-03-14 12:26:45 EDT ---

(In reply to comment #39)
> I have an Oracle Standalone 11g DB. I have not made the recommended change
> as I thought it was only for XE and not 11g.

Sure. But there are several resources on the internet suggesting that increasing
the db processes helps with ORA-12519 (regardless of what Oracle version
you're using).

> I will go ahead and make the change.
> One question before I do is the suggestion 400 for each client? If so I would
> need to bump that number up a lot.

You mean altering the number of processes to (400 * no_of_clients) ? No.

Just check what the current settings are (in sqlplus, type:
show parameter processes) and try to increase the value.

> When I said proxy server I was getting the tracebacks from the Application
> server that was having problems with the proxy. The traceback I have uploaded
> to this bug are these logs.
> 
> Could the issue with connections be the cause for the clients to not receive
> actions?

Yes. osa-dispatcher is the component which retrieves the scheduled actions
from database (and therefore needs a db connection) and pushes them to the
client systems via jabber network.

--- Additional comment from JDavis4102 on 2011-03-14 13:18:34 EDT ---

Currently it is set to 150. I will go ahead and make the changes. Just a heads up. I have restarted osa-dispatcher service and after I have restarted everything seems to be working. It has continued to work over the weekend and today. Not sure what is going on now.

--- Additional comment from JDavis4102 on 2011-03-29 14:42:35 EDT ---

Issue appears to be resolved.

--- Additional comment from msuchy on 2011-04-11 03:45:19 EDT ---

Mass moving to ON_QA before release of Spacewalk 1.4

--- Additional comment from msuchy on 2011-04-26 05:10:56 EDT ---

Spacewalk 1.4 has been released

--- Additional comment from JDavis4102 on 2011-06-23 12:05:07 EDT ---

Issue seems to have reappeared after applying patch and performing steps above. I would also like to add that I am using 13 proxy servers with 1000+ servers connected to these proxy servers.

Comment 1 Jan Pazdziora (Red Hat) 2011-07-20 11:50:19 UTC

Aligning under space16.

Comment 2 JDavis4102 2011-08-23 15:42:27 UTC

Determined the following code line has allowed remote commands to continue to work. This is by no means a fix as this is just a workaround but may help in resolving the root issue.

Line 170 in /usr/share/rhn/osad/osa_dispatcher.py add the following under: 
else:
           log_debug(5,"Not notifying jabber nodes")

           Add this line at 170 -->> self._tcp_server.notify_jabber_nodes()

Comment 3 Jan Pazdziora (Red Hat) 2011-08-30 12:48:41 UTC

(In reply to comment #2)
> Determined the following code line has allowed remote commands to continue to
> work. This is by no means a fix as this is just a workaround but may help in
> resolving the root issue.
> 
> Line 170 in /usr/share/rhn/osad/osa_dispatcher.py add the following under: 
> else:
>            log_debug(5,"Not notifying jabber nodes")
> 
>            Add this line at 170 -->> self._tcp_server.notify_jabber_nodes()

When you apply this change on your installation, does it reveal anything about the root cause?

Comment 4 JDavis4102 2011-09-08 15:49:18 UTC

Jan,

The only thing it shows adding this line is the jabber service sm and c2s fails to restart after a while. I have to remove the jabber DB to get it to come back. Also I have a system that trys to connect but is unable to display online in Spacewalk UI even after pinging. This only has started to show up after adding that line. All other clients seem to be up and running with adding that line of code.

Comment 5 JDavis4102 2011-11-29 16:21:37 UTC

Any update as to a resolution for this issue? Adding the line above (Comment 3) has resolved the issue but it seems this is only a workaround and not actually resolving the root cause. Any idea as to what this could be?

Comment 6 Jan Pazdziora (Red Hat) 2011-12-09 10:33:46 UTC

(In reply to comment #4)
> Jan,
> 
> The only thing it shows adding this line is the jabber service sm and c2s fails
> to restart after a while. I have to remove the jabber DB to get it to come

What is on the console, what is in the log, what does strace reveal?

> back. Also I have a system that trys to connect but is unable to display online
> in Spacewalk UI even after pinging. This only has started to show up after
> adding that line. All other clients seem to be up and running with adding that

So that change caused regression.

What exactly changed on your environment since bug 662593 comment 42 when you reported the issue as resolved. And: can you please rephrase what are the reproducer steps exactly, as the word "issue" is morphing its meaning with each new comment.

Comment 7 JDavis4102 2011-12-09 19:09:47 UTC

The problem we are having is that clients are able to ping but unable to pick up actions without performing the following:

On the Spacewalk Application server:

  service osa-dispatcher stop
  service jabberd stop
  rm -f /var/lib/jabberd/db/*
  service jabberd start
  service osa-dispatcher start

On the Spacewalk Proxy server:

  service jabberd stop
  rm -f /var/lib/jabberd/db/*
  service jabberd start

On the clients,

  service osad stop
  rm -f /etc/sysconfig/rhn/osad-auth.conf
  service osad start

Once the above commands have been ran the client is able to receive actions again for a short time and then goes back to not working. This happens randomly and unable to reproduce in my Development and Test environments. Keep in mind that we have over 4000 clients connected to Spacewalk and about 18 proxy servers. 

Adding the line listed below has provided a workaround in order to keep the service up and going.

Line 170 in /usr/share/rhn/osad/osa_dispatcher.py add the following under: 
else:
           log_debug(5,"Not notifying jabber nodes")

           Add this line at 170 -->> self._tcp_server.notify_jabber_nodes()

The other issues noted in my last comment have been resolved by other means and is not related to this issue. Sorry for the confusion on this.

Comment 8 Jan Pazdziora (Red Hat) 2011-12-12 08:40:22 UTC

(In reply to comment #7)
> The problem we are having is that clients are able to ping but unable to pick
> up actions without performing the following:
> 
> On the Spacewalk Application server:
> 
>   service osa-dispatcher stop
>   service jabberd stop
>   rm -f /var/lib/jabberd/db/*
>   service jabberd start
>   service osa-dispatcher start
> 
> On the Spacewalk Proxy server:
> 
>   service jabberd stop
>   rm -f /var/lib/jabberd/db/*
>   service jabberd start
> 
> On the clients,
> 
>   service osad stop
>   rm -f /etc/sysconfig/rhn/osad-auth.conf
>   service osad start
> 
> Once the above commands have been ran the client is able to receive actions
> again for a short time and then goes back to not working. This happens randomly
> and unable to reproduce in my Development and Test environments. Keep in mind
> that we have over 4000 clients connected to Spacewalk and about 18 proxy
> servers. 
> 
> Adding the line listed below has provided a workaround in order to keep the
> service up and going.
> 
> Line 170 in /usr/share/rhn/osad/osa_dispatcher.py add the following under: 
> else:
>            log_debug(5,"Not notifying jabber nodes")
> 
>            Add this line at 170 -->> self._tcp_server.notify_jabber_nodes()

Alright. The message from the Spacewalk team is: you should *never* need to remove the auth db, osad-auth.conf or do similar things. It might work for you for a while but it does not address the root cause of the problems.

We should also not add the self._tcp_server.notify_jabber_nodes() call to a branch in the code which is specifically marked as Not notifying jabber nodes.

Thus, we need to figure out why the code execution gets to this branch and not to the other branches where it would actually notify the clients. As I mentioned earlier, we will need logs, strace outputs, and console outputs. Also output of lsof will help determine what the programs are doing.

By the way, which Spacewalk version is this, and what client versions do you use?

As for the 4000 client -- yes, the number might be the cause of the problems. We might be hitting the number of open files or something here. But we need to *know* if this really is the case or if something else is suboptimal in the code or in the setup. So again, we will need much more detailed information from both the server and client side from the time when this situation appears to investigate and fix it.

Sincerely,

Jan

Comment 9 Jan Pazdziora (Red Hat) 2011-12-12 09:02:32 UTC

Please also note that the nightly client repo has osad package with fixes for bug 691847. While it's hard to say if these issues are releated or not, you might want to upgrade to the latest osad packages just to debug the same code upstream is working with.

Comment 10 JDavis4102 2012-01-03 23:21:27 UTC

I will update to latest current release and let you know. We can't go to nightly as this is a production environment.

Comment 11 Jan Pazdziora (Red Hat) 2012-01-18 09:42:05 UTC

(In reply to comment #10)
> I will update to latest current release and let you know. We can't go to
> nightly as this is a production environment.

Any updates from the Spacewalk 1.6 situation?

Comment 12 JDavis4102 2012-01-18 16:23:52 UTC

I am unable to upgrade to 1.6 at this time due to production restrictions. I am unable to update my Development and Test environments as these environments were not seeing the issue. I will update this bug report once I am able to update to 1.6/current release.The update should happen around April/May time frame.

Comment 13 mheldebr 2012-01-19 22:48:32 UTC

I have upgraded to spacewalk 1.6 both server and clients and I have the same issue.  The clients can be pinged but the osa dispatcher never notifies the clients.

I enabled debugging in my osa-dispatcher.log and I think I can see why nothing is being notified of pending actions:

2012/01/19 14:47:23 -05:00 21008 0.0.0.0: rhnSQL/driver_postgresql._execute_wrapper('Executing SQL: "\n        select id, name, shared_key, jabber_id\n          from rhnPushClient\n         where state_id = %(online_id)s\n           and last_ping_time is not null\n           and next_action_time is null\n           and jabber_id is not null\n    " with bind params: {online_id: 1}',)


I have looked in the rhnPushClient table and I see that all active clients have a last_ping_time set to NULL.  For a brief period of time after a manual ping from the web ui the last_ping_time value is present with a correct time stamp, but then this is reset to NULL shortly afterwards.  Whatever is setting the NULL value in last_ping_time is preventing the query from finding actions for clients to notify.

Also, scheduling an update is not setting next_action_time to a value other than NULL using the web ui.  Scheduling an action and then manually pinging put values into both the next_action_time and last_ping_time which were both cleared by some other process shortly afterwards.

Comment 14 mheldebr 2012-01-20 13:32:20 UTC

I'm sure this was working with the base 1.6 release, the packages that I updated before osad stopped working are:

Jan 17 12:46:40 Updated: spacewalk-base-minimal-1.6.38-1.el5.noarch
Jan 17 12:46:40 Updated: spacewalk-java-config-1.6.102-1.el5.noarch
Jan 17 12:46:40 Updated: spacewalk-java-lib-1.6.102-1.el5.noarch
Jan 17 12:46:40 Updated: spacewalk-java-postgresql-1.6.102-1.el5.noarch
Jan 17 12:46:47 Updated: libxml2-2.6.26-2.1.12.el5_7.2.x86_64
Jan 17 12:46:47 Updated: libxml2-python-2.6.26-2.1.12.el5_7.2.x86_64
Jan 17 12:46:48 Updated: spacewalk-taskomatic-1.6.102-1.el5.noarch
Jan 17 12:46:50 Updated: spacewalk-java-1.6.102-1.el5.noarch
Jan 17 12:46:51 Updated: kernel-headers-2.6.18-274.17.1.el5.x86_64
Jan 17 12:46:51 Updated: spacewalk-html-1.6.38-1.el5.noarch
Jan 17 12:47:04 Installed: kernel-2.6.18-274.17.1.el5.x86_64
Jan 17 12:47:11 Installed: kernel-devel-2.6.18-274.17.1.el5.x86_64
Jan 17 12:47:11 Updated: spacewalk-pxt-1.6.38-1.el5.noarch
Jan 17 12:47:12 Updated: spacewalk-base-1.6.38-1.el5.noarch
Jan 17 12:47:12 Updated: spacewalk-grail-1.6.38-1.el5.noarch
Jan 17 12:47:12 Updated: spacewalk-sniglets-1.6.38-1.el5.noarch
Jan 19 11:52:32 Installed: python-lxml-2.0.11-1.el5.x86_64

Comment 15 Jan Pazdziora (Red Hat) 2012-02-24 15:29:14 UTC

(In reply to comment #13)
> I have upgraded to spacewalk 1.6 both server and clients and I have the same
> issue.  The clients can be pinged but the osa dispatcher never notifies the
> clients.
> 
> I enabled debugging in my osa-dispatcher.log and I think I can see why nothing
> is being notified of pending actions:
> 
> 2012/01/19 14:47:23 -05:00 21008 0.0.0.0:
> rhnSQL/driver_postgresql._execute_wrapper('Executing SQL: "\n        select id,
> name, shared_key, jabber_id\n          from rhnPushClient\n         where
> state_id = %(online_id)s\n           and last_ping_time is not null\n          
> and next_action_time is null\n           and jabber_id is not null\n    " with
> bind params: {online_id: 1}',)
> 
> 
> I have looked in the rhnPushClient table and I see that all active clients have
> a last_ping_time set to NULL.  For a brief period of time after a manual ping
> from the web ui the last_ping_time value is present with a correct time stamp,
> but then this is reset to NULL shortly afterwards.  Whatever is setting the
> NULL value in last_ping_time is preventing the query from finding actions for
> clients to notify.
> 
> Also, scheduling an update is not setting next_action_time to a value other
> than NULL using the web ui.  Scheduling an action and then manually pinging put
> values into both the next_action_time and last_ping_time which were both
> cleared by some other process shortly afterwards.

I don't think your observation is correct. The SQL query you point out is

    _query_fetch_clients_to_be_pinged = rhnSQL.Statement("""
        select id, name, shared_key, jabber_id
          from rhnPushClient
         where state_id = :online_id
           and last_ping_time is not null
           and next_action_time is null
           and jabber_id is not null
    """)

So obviously the last_ping_time is set, and when the ping is sent, it is cleared -- that's why you see it there for a while, before osa-dispatcher picks it up and acts on it.

I don't think it's related to your notification problem.

Comment 16 Jan Pazdziora (Red Hat) 2012-02-24 15:30:58 UTC

What is the osad you have on the clients? Please make sure you have at least version 5.10.30-1.

Comment 17 mheldebr 2012-02-24 15:41:06 UTC

Clients are at osad-5.10.34-1.el5

Server is:

osa-dispatcher-5.10.34-1.el5
osa-dispatcher-selinux-5.10.34-1.el5

Comment 18 mheldebr 2012-03-15 20:08:30 UTC

Upgraded to spacewalk 1.7.  Clients are still not picking up actions but update status when pinged. I had to remove /var/lib/jabber/db/* due to sm complaining after my upgrade.

Mar 15 13:29:25 server jabberd/sm[31368]: db: couldn't open environment: DB_VERSION_MISMATCH: Database environment version mismatch
Mar 15 13:29:25 server jabberd/sm[31368]: initialisation of storage driver 'db' failed


Clients are at osad-5.10.41-1.el5

Server is:

osa-dispatcher-selinux-5.10.41-1.el5
osa-dispatcher-5.10.41-1.el5

Comment 19 Jan Pazdziora (Red Hat) 2012-03-23 10:46:12 UTC

What's in the osa-dispatcher.log file?

Comment 20 mheldebr 2012-03-23 14:00:35 UTC

Pinged server before update

2012/03/23 08:51:35 -05:00 4646 0.0.0.0: osad/dispatcher_client._message_callback('Ping response',)

I scheduled the update here and nothing is logged and the update is still waiting.  Pinged server after about 8 minutes to verify it was still listening.

2012/03/23 08:59:13 -05:00 4646 0.0.0.0: osad/dispatcher_client._message_callback('Ping response',)

This is the default log level, what log level would you like me to try logging at for more output?

Comment 21 JDavis4102 2012-03-29 15:34:17 UTC

I am not sure if this is related but I will go ahead and post if this helps.

With my added temp fix above I know get the following error when I upgraded to 1.6.



2012/03/28 16:26:08 -07:00 3785 0.0.0.0: osad/jabber_lib.main('ERROR', 'Error caught:')
2012/03/28 16:26:08 -07:00 3785 0.0.0.0: osad/jabber_lib.main('ERROR', 'Traceback (most recent call last):\n  File "/usr/share/rhn/osad/jabber_lib.py", line 120, in main\n    self.process_forever(c)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 178, in process_forever\n    self.process_once(client)\n  File "/usr/share/rhn/osad/osa_dispatcher.py", line 178, in process_once\n    client.retrieve_roster()\n  File "/usr/share/rhn/osad/jabber_lib.py", line 714, in retrieve_roster\n    stanza = self.get_one_stanza()\n  File "/usr/share/rhn/osad/jabber_lib.py", line 786, in get_one_stanza\n    self.process(timeout=tm)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 1048, in process\n    self._parser.Parse(data)\n  File "/usr/lib/python2.4/site-packages/jabber/xmlstream.py", line 269, in unknown_endtag\n    self.dispatch(self._mini_dom)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 814, in _orig_dispatch\n    jabber.Client.dispatch(self, stanza)\n  File "/usr/lib/python2.4/site-packages/jabber/jabber.py", line 290, in dispatch\n    else: handler[\'func\'](self,stanza)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 369, in dispatch\n    callback(client, stanza)\n  File "/usr/share/rhn/osad/dispatcher_client.py", line 145, in _message_callback\n    sig = self._check_signature_from_message(stanza, actions)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 1310, in _check_signature_from_message\n    sig = self._check_signature(stanza, actions=actions)\n  File "/usr/share/rhn/osad/dispatcher_client.py", line 69, in _check_signature\n    row = lookup_client_by_name(x_client_id)\n  File "/usr/share/rhn/osad/dispatcher_client.py", line 214, in lookup_client_by_name\n    raise InvalidClientError(client_name)\nInvalidClientError: d79c1f5d23dca47d\n')
2012/03/28 16:26:18 -07:00 3785 0.0.0.0: osad/jabber_lib.__init__
2012/03/28 16:26:18 -07:00 3785 0.0.0.0: osad/jabber_lib.setup_connection('Connected to jabber server', 'removed for security')
2012/03/28 16:26:18 -07:00 3785 0.0.0.0: osad/jabber_lib.register('ERROR', 'Invalid password')

Comment 22 JDavis4102 2012-03-29 15:50:47 UTC

know should be now :)  One thing I would like to add. I am unable to upgrade to latest as my company has strict restrictions as to when I can upgrade. I just upgraded to 1.6 as this is the release I had in both Dev and Test before 1.7 came out. I would really like to get this resolved some way. This issue has been seen since 1.4 and maybe even 1.2( can't remember that far back). Please let me know what logs you need and any other additional information and I will be glad to provide it do you.

Comment 23 Milan Zázrivec 2012-04-10 13:43:38 UTC

spacewalk.git master: c55818af8d22109b7e9d1310d694cae1561e0e41

From the commit message:

Previously, in a single loop pass, we only notified clients if there
    was no more data to read from clients:
    
        if (there_is_client_data_to_read):
            read_the_client_data()
        else:
            push_scheduled_actions_to_clients()
    
    This schema would allow for starvation of the else branch. All you
    needed for the starvation was a data to be read from clients, since
    it would always be prefered to the event pushing.
    
    We want to prevent the starvation using the following logic:
    
        if (there_is_client_data_to_read):
            read_the_client_data()
    
        if (the_client_socket_is_write_ready):
            push_scheduled_actions_to_clients()

Comment 24 Milan Zázrivec 2012-04-10 13:46:24 UTC

If interested in testing the above fix, please seek for osa-dispatcher-5.10.44-1
from latest Spacewalk nightly to see whether it fixes the issue described
in this bug report.

Comment 25 mheldebr 2012-04-10 19:03:01 UTC

I have installed and tested the nightly osa-dispatcher-5.10.44-1 package, installed it on the server, restarted the osa-dispatcher and a target system picked up an action shortly after scheduling one now.  I will do some more testing to confirm that all systems are working as expected now.

Comment 26 JDavis4102 2012-04-16 17:00:20 UTC

Would I be able to add the updated code to a 1.6 install of Spacewalk or is there other modifications that were made that I need to upgrade to 1.7?

Comment 27 Milan Zázrivec 2012-04-17 08:10:02 UTC

(In reply to comment #26)
> Would I be able to add the updated code to a 1.6 install of Spacewalk or is
> there other modifications that were made that I need to upgrade to 1.7?

Yes, the updated osa-dispatcher should work with Spacewalk 1.6 as well.

Comment 28 JDavis4102 2012-05-01 17:40:52 UTC

I have updated to the latest osa-dispatcher package and still having the same issue. The log that is generated is below. This was stable for 1 week almost to the hour. Not sure what could be going on.

/var/log/rhn/osa-dispatcher.log


2012/05/01 02:59:45 -07:00 5981 0.0.0.0: osad/jabber_lib.main('ERROR', 'Error caught:')
2012/05/01 02:59:45 -07:00 5981 0.0.0.0: osad/jabber_lib.main('ERROR', 'Traceback (most recent call last):\n  File "/usr/share/rhn/osad/jabber_lib.py", line 120, in main\n    self.process_forever(c)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 178, in process_forever\n    self.process_once(client)\n  File "/usr/share/rhn/osad/osa_dispatcher.py", line 182, in process_once\n    client.retrieve_roster()\n  File "/usr/share/rhn/osad/jabber_lib.py", line 714, in retrieve_roster\n    stanza = self.get_one_stanza()\n  File "/usr/share/rhn/osad/jabber_lib.py", line 786, in get_one_stanza\n    self.process(timeout=tm)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 1048, in process\n    self._parser.Parse(data)\n  File "/usr/lib/python2.4/site-packages/jabber/xmlstream.py", line 269, in unknown_endtag\n    self.dispatch(self._mini_dom)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 814, in _orig_dispatch\n    jabber.Client.dispatch(self, stanza)\n  File "/usr/lib/python2.4/site-packages/jabber/jabber.py", line 290, in dispatch\n    else: handler[\'func\'](self,stanza)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 369, in dispatch\n    callback(client, stanza)\n  File "/usr/share/rhn/osad/dispatcher_client.py", line 145, in _message_callback\n    sig = self._check_signature_from_message(stanza, actions)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 1310, in _check_signature_from_message\n    sig = self._check_signature(stanza, actions=actions)\n  File "/usr/share/rhn/osad/dispatcher_client.py", line 69, in _check_signature\n    row = lookup_client_by_name(x_client_id)\n  File "/usr/share/rhn/osad/dispatcher_client.py", line 214, in lookup_client_by_name\n    raise InvalidClientError(client_name)\nInvalidClientError: 910f7d384484c87b\n')
2012/05/01 02:59:55 -07:00 5981 0.0.0.0: osad/jabber_lib.__init__
2012/05/01 02:59:55 -07:00 5981 0.0.0.0: osad/jabber_lib.setup_connection('Connected to jabber server', 'Removed for security')
2012/05/01 02:59:55 -07:00 5981 0.0.0.0: osad/jabber_lib.register('ERROR', 'Invalid password')

Comment 29 Milan Zázrivec 2012-05-02 12:05:36 UTC

(In reply to comment #28)

This actually seems to be a different problem than what was reported
originally:

    osad/jabber_lib.register('ERROR', 'Invalid password')

Is your auth db OK? (/var/lib/jabberd/db/authreg.db in the default setup).

Does removing jabberd db and restarting jabberd & osa-dispatcher help
the issue?

Comment 30 JDavis4102 2012-05-02 15:20:23 UTC

(In reply to comment #29)
> (In reply to comment #28)
> 
> This actually seems to be a different problem than what was reported
> originally:
> 
>     osad/jabber_lib.register('ERROR', 'Invalid password')
> 
> Is your auth db OK? (/var/lib/jabberd/db/authreg.db in the default setup).
> 
> Does removing jabberd db and restarting jabberd & osa-dispatcher help
> the issue?

This maybe a different issue as of right now other than that error everything seems to be up. It is hard to tell as these both cause osad to not pick up actions.

How would I check to see if the auth db is ok?

When I get the "invalid password" error I have to stop jabber and osa-dispatcher then remove the jabberd db and then restart the services. This will get everything back up and going for a while until I have to do it again.

Comment 31 JDavis4102 2012-05-17 15:28:56 UTC

Any ideas as to comment #30? It seems that the original bug that was opened may have been resolved. It is hard to tell as both case osad to not pick up actions. Should I open a new bug regarding these new error logs that I am getting? Thank you for your time and have a great day!

Comment 32 Jan Pazdziora (Red Hat) 2012-05-18 07:28:21 UTC

(In reply to comment #31)
> Any ideas as to comment #30? It seems that the original bug that was opened may
> have been resolved. It is hard to tell as both case osad to not pick up
> actions. Should I open a new bug regarding these new error logs that I am
> getting?

Yes. We consider the original

    2010/12/13 10:56:38 +02:00 31368 0.0.0.0: osad/osa_dispatcher.process_once('Not notifying jabber nodes',)

issue resolved as per comment 23. And the bugzilla will be CLOSED as resolved when Spacewalk 1.8 is released.

Comment 33 Jan Pazdziora (Red Hat) 2012-10-30 19:23:47 UTC

Moving ON_QA. Packages that address this bugzilla should now be available in yum repos at http://yum.spacewalkproject.org/nightly/

Comment 34 Jan Pazdziora (Red Hat) 2012-11-01 16:19:03 UTC

Spacewalk 1.8 has been released: https://fedorahosted.org/spacewalk/wiki/ReleaseNotes18

Note You need to log in before you can comment on or make changes to this bug.