Bug 662593 - osa-dispatcher not notifying
Summary: osa-dispatcher not notifying
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Spacewalk
Classification: Community
Component: Server
Version: 1.2
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
Assignee: Milan Zázrivec
QA Contact: Red Hat Satellite QA List
URL:
Whiteboard:
Depends On:
Blocks: space14
TreeView+ depends on / blocked
 
Reported: 2010-12-13 10:05 UTC by David Hrbáč
Modified: 2011-06-23 16:05 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 716064 (view as bug list)
Environment:
Last Closed: 2011-04-26 09:10:56 UTC
Embargoed:


Attachments (Terms of Use)
Vlado's solution (2.92 KB, patch)
2011-02-15 12:24 UTC, David Hrbáč
no flags Details | Diff
Spacewalk logs (11.14 KB, application/zip)
2011-03-05 01:05 UTC, JDavis4102
no flags Details
New-Spacewalk-OSA-Logs (4.91 KB, application/zip)
2011-03-08 05:30 UTC, JDavis4102
no flags Details
Traceback error (5.57 KB, text/plain)
2011-03-11 22:17 UTC, JDavis4102
no flags Details

Description David Hrbáč 2010-12-13 10:05:23 UTC
Description of problem:
After 1.1->1.2 upgrade we are not able to notify osad enabled clients. Log is full of:
2010/12/13 10:56:38 +02:00 31368 0.0.0.0: osad/osa_dispatcher.process_once('Not notifying jabber nodes',)

The following steps don't resolve the issue:
sed -i "s/1\.1-client/1\.2-client/" /etc/yum.repos.d/spacewalk-client.repo
/etc/init.d/osad stop
rm -f /etc/sysconfig/rhn/osad-auth.conf
yum -y update osad yum-rhn-plugin rhnsd rhnlib rhn-setup rhn-client-tools rhn-check
/etc/init.d/osad restart
/etc/init.d/rhnsd restart

We had this issue also with 1.1 Spacewalk, when clients had re-registered. Our workaround was to delete all osad enabled clients from Spacewalk, jabber DBs, and re-register clients. But this is very time consuming solution.

Version-Release number of selected component (if applicable):
jabberd-2.2.11-2.el5
jabberpy-0.5-0.17.el5
osad-5.9.44-1.el5
osa-dispatcher-5.9.44-1.el5
spacewalk-setup-1.2.16-1.el5
spacewalk-setup-jabberd-1.3.1-1.el5

Actual results:
2010/12/13 10:56:38 +02:00 31368 0.0.0.0: osad/osa_dispatcher.process_once('Not notifying jabber nodes',)
Clients are not picking the tasks up.

Expected results:
Clients should pick the tasks up seamlessly.

Comment 1 Marcus Moeller 2011-01-11 13:56:24 UTC
Dear David,

are you able to ping the clients? As here, OSA ping works but actions are not picked up.

Kind Regards
Marcus

Comment 2 David Hrbáč 2011-01-11 14:59:47 UTC
Marcus,
we had been able to ping, but clients did not pick up the tasks one day. A few weeks we are not able to even ping. Clients are picking up with rhn_check only. BTW ping is not working even to spacewalk server itself. :o(
Regards,
David Hrbáč

Comment 3 Marcus Moeller 2011-01-13 14:41:00 UTC
Dear David,

it seems to depend on the number of clients registered to jabbered.

On a large number of systems jabber_connection.jid_available(jabber_id) seems to report that the Node is not available (even if it is).

Next,

 rfds, wfds, efds = select.select([client, self._tcp_server], [], [], npi)

detects the client as rfds which leads to 'Not notifying jabber nodes'

Kind Regards
Marcus

Comment 4 David Hrbáč 2011-01-14 07:12:07 UTC
Marcus,
Well, not sure about it. We have only 14 boxes registered within our testing Spacewalk instance. I'm very disappointed with Spacewalk so far. We are evaluating Spacewalk for a few weeks. I have found a lot of bugs, reported them, and found workarounds during this period.
Regards,
David

Comment 5 Marcus Moeller 2011-01-14 08:09:31 UTC
Dear David,

we have a production and a test environment, both running 1.2 and both upgraded from previous releases.

Within the production environment we manage about 500 clients, the test env is (as the name implies) just for testing purpose.

I have registered one of the clients to the test environment to make sure that it's not a general problem and ping/remote commands work there.

Unregistering a system from production env and re-registering does not seem to help so I tried to figure out the differences.

Our testing environment has just a few systems connected (2-3) so I this might be the reason.

Overall the connection handling is just brokenn atm. It does not correctly detect if a connection is established or not (as mentioned in Comment 3). I am not yet sure under what circumstances that happens but I guess it has something to do with the number of connected clients.

Kind Regards
Marcus

Comment 6 JDavis4102 2011-02-02 03:05:05 UTC
I am also having this issue.

I can ping but no clients are picking up the actions.

I have just added about 10 servers and it seems that this broke osad from picking up actions. It was working before this.

I have restarted everything. Removed the DB and auth configurations and restarted osad and nothing. The only thing I have not done is remove all clients. I would like to not do that.

Comment 7 JDavis4102 2011-02-02 17:35:46 UTC
Is there any ETA as to a fix for this? We use OSAD/OSA-dispatcher for our normal system management and would hate to lose this functionality. Thank you for your time and have a great day!

Comment 8 Milan Zázrivec 2011-02-03 18:38:38 UTC
(In reply to comment #0)
> Description of problem:
> After 1.1->1.2 upgrade we are not able to notify osad enabled clients. Log is
> full of:
> 2010/12/13 10:56:38 +02:00 31368 0.0.0.0: osad/osa_dispatcher.process_once('Not
> notifying jabber nodes',)
> 
> The following steps don't resolve the issue:
> sed -i "s/1\.1-client/1\.2-client/" /etc/yum.repos.d/spacewalk-client.repo
> /etc/init.d/osad stop
> rm -f /etc/sysconfig/rhn/osad-auth.conf
> yum -y update osad yum-rhn-plugin rhnsd rhnlib rhn-setup rhn-client-tools
> rhn-check
> /etc/init.d/osad restart
> /etc/init.d/rhnsd restart

Why the above procedure?

> We had this issue also with 1.1 Spacewalk, when clients had re-registered. Our
> workaround was to delete all osad enabled clients from Spacewalk, jabber DBs,
> and re-register clients. But this is very time consuming solution.

As far as system re-registration is concerned, there's a separate bug report
dealing with a situation where the push functionality stops working after
system's re-registration:

    https://bugzilla.redhat.com/show_bug.cgi?id=590608

Comment 9 Milan Zázrivec 2011-02-03 18:45:54 UTC
(In reply to comment #1)
> Dear David,
> 
> are you able to ping the clients? As here, OSA ping works but actions are not
> picked up.

What does 'OSA ping works' mean exactly?

In the Spacewalk webui, do you see the system online?

When pinging the system via webui, do you see the ping time stamp being updated?

Comment 10 Milan Zázrivec 2011-02-03 18:51:08 UTC
(In reply to comment #6)
> I am also having this issue.
> 
> I can ping but no clients are picking up the actions.
> 
> I have just added about 10 servers and it seems that this broke osad from
> picking up actions. It was working before this.
> 
> I have restarted everything. Removed the DB and auth configurations and
> restarted osad and nothing. The only thing I have not done is remove all
> clients. I would like to not do that.

Could you please try to do the following on your Spacewalk server:

# service osa-dispatcher stop
# service jabberd stop

Edit following three files:

/etc/jabberd/c2s.xml
/etc/jabberd/s2s.xml
/etc/jabberd/router.xml

and change the <max_fds>...</max_fds> value in each of them (e.g. double it).

# service jabberd start
# service osa-dispatcher start

Comment 11 JDavis4102 2011-02-04 00:38:29 UTC
Milan,

I have made the changes as requested. 

I performed a ping on all clients and the ping information has updated on the WEB UI. I then scheduled a remote command to run and the servers didn't pick up the action as expected.

It seems modifying the max_fds setting (from 1024 to 2048) has not resolved my issue.

Thank you for your time and have a great day!

Comment 12 David Hrbáč 2011-02-04 08:37:48 UTC
Milan,
those settings have nothing to do with this issue. We are experiencing this issue event with small spacewalk instances having about 15 clients.
Thanks,
David Hrbáč

Comment 13 David Hrbáč 2011-02-10 09:42:47 UTC
Hi,
upgrade to Spacewalk 1.3 nor the latest osad-5.9.55-1 don't solve the issue. I'm still not able to ping the clients. Osad-dispather is still not sending the notifies.
Regards,
David Hrbáč

Comment 14 JDavis4102 2011-02-14 19:53:59 UTC
Hello Milan, 

Has there been any progress with this issue? Thank you for your time and have a great day!

Comment 15 Vlado Motoska 2011-02-15 11:23:31 UTC
Hi, I've spend some time with this issue. After looking at osad/jabber_lib.py I just commented out the roster in jabber server and restarted the clietns and dispatcher. Now it works like a dream. 

It looks like the client can't get the subscription to dispatcher when the roster is enabled.

Comment 16 David Hrbáč 2011-02-15 12:23:40 UTC
Vlado,
thanks for the point. It really seems to help. Just for the record: Vlado is talking about /etc/jabberd/sm.xml and commenting out roster* within the file. I'm attaching the patch.
Thanks!
David

Comment 17 David Hrbáč 2011-02-15 12:24:46 UTC
Created attachment 478869 [details]
Vlado's solution

Comment 18 Tarun Reddy 2011-02-15 20:07:02 UTC
I can validate that removing the roster does indeed seem to "fix" the issue on 1.2. Off to try against 1.3

Comment 19 Tarun Reddy 2011-02-15 22:55:39 UTC
Works for 1.3 issues as well. I am happily back to the land of a work osad enabled spacewalk install.

Thanks Vlado!

Comment 20 JDavis4102 2011-02-15 22:59:01 UTC
I can validate that removing the roster modules does resolve this issue on 1.2.

Thank you so much!!! :)

Comment 21 Ron Helzer 2011-02-16 17:11:42 UTC
Works for me in 1.3; thank you!

I had to perform these additional steps after modifying sm.xml:

On the Spacewalk / Jabber server,

  service osa-dispatcher stop
  service jabberd stop
  rm -f /var/lib/jabberd/db/*
  service jabberd start
  service osa-dispatcher start

On the clients,

  service osad stop
  rm /etc/sysconfig/rhn/osad-auth.conf
  service osad start

Comment 22 Milan Zázrivec 2011-02-21 17:31:55 UTC
Fixed in spacewalk.git master: 2ca8629f4d2bd681bd1db48b4672059fb1cdc653

The fix above is to ensure presence subscription works with standard
Spacewalk jabberd setup as created by spacewalk-setup-jabberd (i.e.
no need to disable roster module at all).

Comment 23 JDavis4102 2011-02-24 23:31:59 UTC
Milan, 

I have tried the fix above by modifying my /usr/share/rhn/osad/osa_dispatcher.py with the changes noted in the latest patch. I have removed the module roster comments and restarted all services. It doesn't resolve this issue. 

I added the comments back to the modules and everything came up and started working again without modifying /usr/share/rhn/osad/osa_dispatcher.py.

So right now I have /usr/share/rhn/osad/osa_dispatcher.py modified as you submitted with <module>roster*</module> commented and osad is picking up actions as expected.

One issue that is now in my environment is that when a server reboots or osad is restarted it doesn't fully connect to the Spacewalk server (This issue was happening before modifying osa_dispatcher.py). In order to get OSAD to attach back to the server I have to remove /etc/sysconfig/rhn/osad-auth.conf and then restart OSAD and then everything starts as expected.

Comment 24 Milan Zázrivec 2011-02-25 14:35:08 UTC
(In reply to comment #23)
> Milan, 
> 
> I have tried the fix above by modifying my
> /usr/share/rhn/osad/osa_dispatcher.py with the changes noted in the latest
> patch. I have removed the module roster comments and restarted all services.
> It doesn't resolve this issue.

Actually, you'd need to do the following:

1. Stop osad(s) on the client(s)
2. Stop osa-dispatcher and jabberd on the Spacewalk server
3. Apply the patch from comment #22
4. On the Spacewalk server: rm -f /var/lib/jabberd/db/*
5. Make sure you're using standard Spacewalk configured sm.xml
6. Start jabberd
7. Start osa-dispatcher
8. Start osad(s) on your client(s)

Comment 25 JDavis4102 2011-02-25 17:43:03 UTC
Milan, 

I performed the steps listed and it seems to have resolved my issue in my Development and Test environment. Now this issue at first wasn't seen initially in these environments so I am not sure if this has totally resolved the issue. I will be pushing to the Production environment soon. Once in my Production environment I will let you know how it turns out.

Thank you for your work and hope you have a great day!

Kind regards,
JD

Comment 26 JDavis4102 2011-03-04 18:53:01 UTC
I have released to the Production environment and it seems that this patch has not resolved my issue. OSA ping works but actions are not being sent to the client. 

This has resolved the issue of being able to restart osad without having to remove /etc/sysconfig/rhn/osad-auth.conf.

Comment 27 Milan Zázrivec 2011-03-04 19:14:47 UTC
(In reply to comment #26)
> I have released to the Production environment and it seems that this patch has
> not resolved my issue. OSA ping works but actions are not being sent to the
> client. 
> 
> This has resolved the issue of being able to restart osad without having to
> remove /etc/sysconfig/rhn/osad-auth.conf.

This is quite odd actually. Could you please do the following for me?

1. Perform steps one to six from comment #24
2. On your Spacewalk as root:
    # osa-dispatcher -N -vvvvvvvvvv >& osa-dispatcher.log
3. On one of your clients as root:
    # osad -N -vvvvvvvvvv >& osad.log

Once you see the client system in question shows as online in webui, try
ping, then schedule some remote action. Have it running for a while
(e.g. a minute), then Crtl-C osad and osa-dispatcher and attach both
log files to this bug report (feel free to obfuscate hostnames in
the log files if you don't feel like exposing it).

Comment 28 JDavis4102 2011-03-04 20:30:07 UTC
Is there anyway to only do this on one of the clients. I have about 300+ servers connected and it takes a while to make the change on them all.

Comment 29 Milan Zázrivec 2011-03-04 20:57:55 UTC
(In reply to comment #28)
> Is there anyway to only do this on one of the clients. I have about 300+
> servers connected and it takes a while to make the change on them all.

OK, in that case just do steps 2. and 3. from comment #27 and attach both
log files please.

Comment 30 JDavis4102 2011-03-05 01:05:09 UTC
Created attachment 482404 [details]
Spacewalk logs

Comment 31 JDavis4102 2011-03-05 01:06:40 UTC
It seems that the issue has become intermittent. After performing the requested actions it seems to have started working. Not sure what is going on now.

Comment 32 Milan Zázrivec 2011-03-07 10:38:26 UTC
(In reply to comment #31)
> It seems that the issue has become intermittent. After performing the requested
> actions it seems to have started working. Not sure what is going on now.

OK, I see both osa-dispatcher and osad subscribed to each other's presence,
ping works, client picked up the scheduled action.

Seeing you have about 300+ systems connected to your Spacewalk, may I also
suggest to increase max_fds settings as suggested in comment #10 (I saw
cases where this was necessary in environments with many client systems).

Thanks.

Comment 33 JDavis4102 2011-03-08 05:30:27 UTC
Created attachment 482838 [details]
New-Spacewalk-OSA-Logs

I have performed the actions you have requested. It doesn't seem to resolve my issue. 

However, I was able to replicate the original issue even with max_fds modifications. Attached you will find the logs in question.

Comment 34 JDavis4102 2011-03-09 19:21:04 UTC
Any ideas as to what I could try next? I thank you for the assistance you have provided thus far.

Comment 35 JDavis4102 2011-03-11 22:06:30 UTC
I have some updates to this issue. I currently have OSAD actions working. I have restarted each client and removed the jabber DB and everything started working. Then about 24-36 hours later everything stopped receiving actions. I restarted osa-dispatcher and it started working again. I am not sure what may be causing this and if this is the same issue. Please let me know if I need to open another bug for this issue.

Comment 36 JDavis4102 2011-03-11 22:17:50 UTC
Created attachment 483834 [details]
Traceback error

I also started receiving traceback logs from a proxy server when I perform a remote command action.

Comment 37 Trent Johnson 2011-03-13 04:25:14 UTC
I was experiencing problems similar to others posting here.  Pings would work, but osad was not picking up actions.  Spacewalk was upgraded from 1.0->1.1->1.2->1.3  and osad stopped working somewhere along the way.

To get my jabber configuration back to normal, I ended up doing:
On spacewalk server:
service osa-dispatcher stop
rpm --nodeps --erase jabberd
mv /etc/jabberd /etc/jabberd.old
rm -f /var/lib/jabberd/db/*
yum install jabberd
spacewalk-setup-jabberd
service jabberd start
service osa-dispatcher start

Then on clients I had to run:
service osad stop
rm -f /etc/sysconfig/rhn/osad-auth.conf
service osad start

Now osad is working, and hopefully will still be working tomorrow.

Comment 38 Milan Zázrivec 2011-03-14 10:51:22 UTC
(In reply to comment #36)
> Created attachment 483834 [details]
> Traceback error

The important part from the traceback shown is:

    ORA-12519: TNS:no appropriate service handler found

which simply suggests osa-dispatcher had problems with database connection.

When setting up the Oracle DB (XE or 10g / 11g), did you alter system
processes as described in https://fedorahosted.org/spacewalk/wiki/OracleXeSetup ?

Specifically I mean that 

    alter system set processes = 400 scope=spfile;

part. If so, what value did you set the processes to? Our documentation
suggests 400, but if that's not enough, increasing the amount and restarting
the server may help.

> I also started receiving traceback logs from a proxy server when I perform a
> remote command action.

Are we talking Spacewalk proxy here? What do those tracebacks look like?

Comment 39 JDavis4102 2011-03-14 16:05:08 UTC
I have an Oracle Standalone 11g DB. I have not made the recommended change as I thought it was only for XE and not 11g. I will go ahead and make the change. One question before I do is the suggestion 400 for each client? If so I would need to bump that number up a lot.

When I said proxy server I was getting the tracebacks from the Application server that was having problems with the proxy. The traceback I have uploaded to this bug are these logs.

Could the issue with connections be the cause for the clients to not receive actions?

Comment 40 Milan Zázrivec 2011-03-14 16:26:45 UTC
(In reply to comment #39)
> I have an Oracle Standalone 11g DB. I have not made the recommended change
> as I thought it was only for XE and not 11g.

Sure. But there are several resources on the internet suggesting that increasing
the db processes helps with ORA-12519 (regardless of what Oracle version
you're using).

> I will go ahead and make the change.
> One question before I do is the suggestion 400 for each client? If so I would
> need to bump that number up a lot.

You mean altering the number of processes to (400 * no_of_clients) ? No.

Just check what the current settings are (in sqlplus, type:
show parameter processes) and try to increase the value.

> When I said proxy server I was getting the tracebacks from the Application
> server that was having problems with the proxy. The traceback I have uploaded
> to this bug are these logs.
> 
> Could the issue with connections be the cause for the clients to not receive
> actions?

Yes. osa-dispatcher is the component which retrieves the scheduled actions
from database (and therefore needs a db connection) and pushes them to the
client systems via jabber network.

Comment 41 JDavis4102 2011-03-14 17:18:34 UTC
Currently it is set to 150. I will go ahead and make the changes. Just a heads up. I have restarted osa-dispatcher service and after I have restarted everything seems to be working. It has continued to work over the weekend and today. Not sure what is going on now.

Comment 42 JDavis4102 2011-03-29 18:42:35 UTC
Issue appears to be resolved.

Comment 43 Miroslav Suchý 2011-04-11 07:45:19 UTC
Mass moving to ON_QA before release of Spacewalk 1.4

Comment 44 Miroslav Suchý 2011-04-26 09:10:56 UTC
Spacewalk 1.4 has been released

Comment 45 JDavis4102 2011-06-23 16:05:07 UTC
Issue seems to have reappeared after applying patch and performing steps above.


Note You need to log in before you can comment on or make changes to this bug.