Description of problem: After 1.1->1.2 upgrade we are not able to notify osad enabled clients. Log is full of: 2010/12/13 10:56:38 +02:00 31368 0.0.0.0: osad/osa_dispatcher.process_once('Not notifying jabber nodes',) The following steps don't resolve the issue: sed -i "s/1\.1-client/1\.2-client/" /etc/yum.repos.d/spacewalk-client.repo /etc/init.d/osad stop rm -f /etc/sysconfig/rhn/osad-auth.conf yum -y update osad yum-rhn-plugin rhnsd rhnlib rhn-setup rhn-client-tools rhn-check /etc/init.d/osad restart /etc/init.d/rhnsd restart We had this issue also with 1.1 Spacewalk, when clients had re-registered. Our workaround was to delete all osad enabled clients from Spacewalk, jabber DBs, and re-register clients. But this is very time consuming solution. Version-Release number of selected component (if applicable): jabberd-2.2.11-2.el5 jabberpy-0.5-0.17.el5 osad-5.9.44-1.el5 osa-dispatcher-5.9.44-1.el5 spacewalk-setup-1.2.16-1.el5 spacewalk-setup-jabberd-1.3.1-1.el5 Actual results: 2010/12/13 10:56:38 +02:00 31368 0.0.0.0: osad/osa_dispatcher.process_once('Not notifying jabber nodes',) Clients are not picking the tasks up. Expected results: Clients should pick the tasks up seamlessly.
Dear David, are you able to ping the clients? As here, OSA ping works but actions are not picked up. Kind Regards Marcus
Marcus, we had been able to ping, but clients did not pick up the tasks one day. A few weeks we are not able to even ping. Clients are picking up with rhn_check only. BTW ping is not working even to spacewalk server itself. :o( Regards, David Hrbáč
Dear David, it seems to depend on the number of clients registered to jabbered. On a large number of systems jabber_connection.jid_available(jabber_id) seems to report that the Node is not available (even if it is). Next, rfds, wfds, efds = select.select([client, self._tcp_server], [], [], npi) detects the client as rfds which leads to 'Not notifying jabber nodes' Kind Regards Marcus
Marcus, Well, not sure about it. We have only 14 boxes registered within our testing Spacewalk instance. I'm very disappointed with Spacewalk so far. We are evaluating Spacewalk for a few weeks. I have found a lot of bugs, reported them, and found workarounds during this period. Regards, David
Dear David, we have a production and a test environment, both running 1.2 and both upgraded from previous releases. Within the production environment we manage about 500 clients, the test env is (as the name implies) just for testing purpose. I have registered one of the clients to the test environment to make sure that it's not a general problem and ping/remote commands work there. Unregistering a system from production env and re-registering does not seem to help so I tried to figure out the differences. Our testing environment has just a few systems connected (2-3) so I this might be the reason. Overall the connection handling is just brokenn atm. It does not correctly detect if a connection is established or not (as mentioned in Comment 3). I am not yet sure under what circumstances that happens but I guess it has something to do with the number of connected clients. Kind Regards Marcus
I am also having this issue. I can ping but no clients are picking up the actions. I have just added about 10 servers and it seems that this broke osad from picking up actions. It was working before this. I have restarted everything. Removed the DB and auth configurations and restarted osad and nothing. The only thing I have not done is remove all clients. I would like to not do that.
Is there any ETA as to a fix for this? We use OSAD/OSA-dispatcher for our normal system management and would hate to lose this functionality. Thank you for your time and have a great day!
(In reply to comment #0) > Description of problem: > After 1.1->1.2 upgrade we are not able to notify osad enabled clients. Log is > full of: > 2010/12/13 10:56:38 +02:00 31368 0.0.0.0: osad/osa_dispatcher.process_once('Not > notifying jabber nodes',) > > The following steps don't resolve the issue: > sed -i "s/1\.1-client/1\.2-client/" /etc/yum.repos.d/spacewalk-client.repo > /etc/init.d/osad stop > rm -f /etc/sysconfig/rhn/osad-auth.conf > yum -y update osad yum-rhn-plugin rhnsd rhnlib rhn-setup rhn-client-tools > rhn-check > /etc/init.d/osad restart > /etc/init.d/rhnsd restart Why the above procedure? > We had this issue also with 1.1 Spacewalk, when clients had re-registered. Our > workaround was to delete all osad enabled clients from Spacewalk, jabber DBs, > and re-register clients. But this is very time consuming solution. As far as system re-registration is concerned, there's a separate bug report dealing with a situation where the push functionality stops working after system's re-registration: https://bugzilla.redhat.com/show_bug.cgi?id=590608
(In reply to comment #1) > Dear David, > > are you able to ping the clients? As here, OSA ping works but actions are not > picked up. What does 'OSA ping works' mean exactly? In the Spacewalk webui, do you see the system online? When pinging the system via webui, do you see the ping time stamp being updated?
(In reply to comment #6) > I am also having this issue. > > I can ping but no clients are picking up the actions. > > I have just added about 10 servers and it seems that this broke osad from > picking up actions. It was working before this. > > I have restarted everything. Removed the DB and auth configurations and > restarted osad and nothing. The only thing I have not done is remove all > clients. I would like to not do that. Could you please try to do the following on your Spacewalk server: # service osa-dispatcher stop # service jabberd stop Edit following three files: /etc/jabberd/c2s.xml /etc/jabberd/s2s.xml /etc/jabberd/router.xml and change the <max_fds>...</max_fds> value in each of them (e.g. double it). # service jabberd start # service osa-dispatcher start
Milan, I have made the changes as requested. I performed a ping on all clients and the ping information has updated on the WEB UI. I then scheduled a remote command to run and the servers didn't pick up the action as expected. It seems modifying the max_fds setting (from 1024 to 2048) has not resolved my issue. Thank you for your time and have a great day!
Milan, those settings have nothing to do with this issue. We are experiencing this issue event with small spacewalk instances having about 15 clients. Thanks, David Hrbáč
Hi, upgrade to Spacewalk 1.3 nor the latest osad-5.9.55-1 don't solve the issue. I'm still not able to ping the clients. Osad-dispather is still not sending the notifies. Regards, David Hrbáč
Hello Milan, Has there been any progress with this issue? Thank you for your time and have a great day!
Hi, I've spend some time with this issue. After looking at osad/jabber_lib.py I just commented out the roster in jabber server and restarted the clietns and dispatcher. Now it works like a dream. It looks like the client can't get the subscription to dispatcher when the roster is enabled.
Vlado, thanks for the point. It really seems to help. Just for the record: Vlado is talking about /etc/jabberd/sm.xml and commenting out roster* within the file. I'm attaching the patch. Thanks! David
Created attachment 478869 [details] Vlado's solution
I can validate that removing the roster does indeed seem to "fix" the issue on 1.2. Off to try against 1.3
Works for 1.3 issues as well. I am happily back to the land of a work osad enabled spacewalk install. Thanks Vlado!
I can validate that removing the roster modules does resolve this issue on 1.2. Thank you so much!!! :)
Works for me in 1.3; thank you! I had to perform these additional steps after modifying sm.xml: On the Spacewalk / Jabber server, service osa-dispatcher stop service jabberd stop rm -f /var/lib/jabberd/db/* service jabberd start service osa-dispatcher start On the clients, service osad stop rm /etc/sysconfig/rhn/osad-auth.conf service osad start
Fixed in spacewalk.git master: 2ca8629f4d2bd681bd1db48b4672059fb1cdc653 The fix above is to ensure presence subscription works with standard Spacewalk jabberd setup as created by spacewalk-setup-jabberd (i.e. no need to disable roster module at all).
Milan, I have tried the fix above by modifying my /usr/share/rhn/osad/osa_dispatcher.py with the changes noted in the latest patch. I have removed the module roster comments and restarted all services. It doesn't resolve this issue. I added the comments back to the modules and everything came up and started working again without modifying /usr/share/rhn/osad/osa_dispatcher.py. So right now I have /usr/share/rhn/osad/osa_dispatcher.py modified as you submitted with <module>roster*</module> commented and osad is picking up actions as expected. One issue that is now in my environment is that when a server reboots or osad is restarted it doesn't fully connect to the Spacewalk server (This issue was happening before modifying osa_dispatcher.py). In order to get OSAD to attach back to the server I have to remove /etc/sysconfig/rhn/osad-auth.conf and then restart OSAD and then everything starts as expected.
(In reply to comment #23) > Milan, > > I have tried the fix above by modifying my > /usr/share/rhn/osad/osa_dispatcher.py with the changes noted in the latest > patch. I have removed the module roster comments and restarted all services. > It doesn't resolve this issue. Actually, you'd need to do the following: 1. Stop osad(s) on the client(s) 2. Stop osa-dispatcher and jabberd on the Spacewalk server 3. Apply the patch from comment #22 4. On the Spacewalk server: rm -f /var/lib/jabberd/db/* 5. Make sure you're using standard Spacewalk configured sm.xml 6. Start jabberd 7. Start osa-dispatcher 8. Start osad(s) on your client(s)
Milan, I performed the steps listed and it seems to have resolved my issue in my Development and Test environment. Now this issue at first wasn't seen initially in these environments so I am not sure if this has totally resolved the issue. I will be pushing to the Production environment soon. Once in my Production environment I will let you know how it turns out. Thank you for your work and hope you have a great day! Kind regards, JD
I have released to the Production environment and it seems that this patch has not resolved my issue. OSA ping works but actions are not being sent to the client. This has resolved the issue of being able to restart osad without having to remove /etc/sysconfig/rhn/osad-auth.conf.
(In reply to comment #26) > I have released to the Production environment and it seems that this patch has > not resolved my issue. OSA ping works but actions are not being sent to the > client. > > This has resolved the issue of being able to restart osad without having to > remove /etc/sysconfig/rhn/osad-auth.conf. This is quite odd actually. Could you please do the following for me? 1. Perform steps one to six from comment #24 2. On your Spacewalk as root: # osa-dispatcher -N -vvvvvvvvvv >& osa-dispatcher.log 3. On one of your clients as root: # osad -N -vvvvvvvvvv >& osad.log Once you see the client system in question shows as online in webui, try ping, then schedule some remote action. Have it running for a while (e.g. a minute), then Crtl-C osad and osa-dispatcher and attach both log files to this bug report (feel free to obfuscate hostnames in the log files if you don't feel like exposing it).
Is there anyway to only do this on one of the clients. I have about 300+ servers connected and it takes a while to make the change on them all.
(In reply to comment #28) > Is there anyway to only do this on one of the clients. I have about 300+ > servers connected and it takes a while to make the change on them all. OK, in that case just do steps 2. and 3. from comment #27 and attach both log files please.
Created attachment 482404 [details] Spacewalk logs
It seems that the issue has become intermittent. After performing the requested actions it seems to have started working. Not sure what is going on now.
(In reply to comment #31) > It seems that the issue has become intermittent. After performing the requested > actions it seems to have started working. Not sure what is going on now. OK, I see both osa-dispatcher and osad subscribed to each other's presence, ping works, client picked up the scheduled action. Seeing you have about 300+ systems connected to your Spacewalk, may I also suggest to increase max_fds settings as suggested in comment #10 (I saw cases where this was necessary in environments with many client systems). Thanks.
Created attachment 482838 [details] New-Spacewalk-OSA-Logs I have performed the actions you have requested. It doesn't seem to resolve my issue. However, I was able to replicate the original issue even with max_fds modifications. Attached you will find the logs in question.
Any ideas as to what I could try next? I thank you for the assistance you have provided thus far.
I have some updates to this issue. I currently have OSAD actions working. I have restarted each client and removed the jabber DB and everything started working. Then about 24-36 hours later everything stopped receiving actions. I restarted osa-dispatcher and it started working again. I am not sure what may be causing this and if this is the same issue. Please let me know if I need to open another bug for this issue.
Created attachment 483834 [details] Traceback error I also started receiving traceback logs from a proxy server when I perform a remote command action.
I was experiencing problems similar to others posting here. Pings would work, but osad was not picking up actions. Spacewalk was upgraded from 1.0->1.1->1.2->1.3 and osad stopped working somewhere along the way. To get my jabber configuration back to normal, I ended up doing: On spacewalk server: service osa-dispatcher stop rpm --nodeps --erase jabberd mv /etc/jabberd /etc/jabberd.old rm -f /var/lib/jabberd/db/* yum install jabberd spacewalk-setup-jabberd service jabberd start service osa-dispatcher start Then on clients I had to run: service osad stop rm -f /etc/sysconfig/rhn/osad-auth.conf service osad start Now osad is working, and hopefully will still be working tomorrow.
(In reply to comment #36) > Created attachment 483834 [details] > Traceback error The important part from the traceback shown is: ORA-12519: TNS:no appropriate service handler found which simply suggests osa-dispatcher had problems with database connection. When setting up the Oracle DB (XE or 10g / 11g), did you alter system processes as described in https://fedorahosted.org/spacewalk/wiki/OracleXeSetup ? Specifically I mean that alter system set processes = 400 scope=spfile; part. If so, what value did you set the processes to? Our documentation suggests 400, but if that's not enough, increasing the amount and restarting the server may help. > I also started receiving traceback logs from a proxy server when I perform a > remote command action. Are we talking Spacewalk proxy here? What do those tracebacks look like?
I have an Oracle Standalone 11g DB. I have not made the recommended change as I thought it was only for XE and not 11g. I will go ahead and make the change. One question before I do is the suggestion 400 for each client? If so I would need to bump that number up a lot. When I said proxy server I was getting the tracebacks from the Application server that was having problems with the proxy. The traceback I have uploaded to this bug are these logs. Could the issue with connections be the cause for the clients to not receive actions?
(In reply to comment #39) > I have an Oracle Standalone 11g DB. I have not made the recommended change > as I thought it was only for XE and not 11g. Sure. But there are several resources on the internet suggesting that increasing the db processes helps with ORA-12519 (regardless of what Oracle version you're using). > I will go ahead and make the change. > One question before I do is the suggestion 400 for each client? If so I would > need to bump that number up a lot. You mean altering the number of processes to (400 * no_of_clients) ? No. Just check what the current settings are (in sqlplus, type: show parameter processes) and try to increase the value. > When I said proxy server I was getting the tracebacks from the Application > server that was having problems with the proxy. The traceback I have uploaded > to this bug are these logs. > > Could the issue with connections be the cause for the clients to not receive > actions? Yes. osa-dispatcher is the component which retrieves the scheduled actions from database (and therefore needs a db connection) and pushes them to the client systems via jabber network.
Currently it is set to 150. I will go ahead and make the changes. Just a heads up. I have restarted osa-dispatcher service and after I have restarted everything seems to be working. It has continued to work over the weekend and today. Not sure what is going on now.
Issue appears to be resolved.
Mass moving to ON_QA before release of Spacewalk 1.4
Spacewalk 1.4 has been released
Issue seems to have reappeared after applying patch and performing steps above.