Created attachment 985764 [details] A stacktrace when attempting to reactivate an existing system Description of problem: See [Spacewalk-list] thread https://www.redhat.com/archives/spacewalk-list/2015-January/msg00161.html for more discussion on this topic. In short, systems just plain do not pick up scheduled tasks. Tasks are scheduled through the UI (be them package installation/updates, configuration file publishing, or remote commands) and the API basically says "I've got nothing for you" when the remote system runs an rhn_check. I've validated that the system ID matches, and that the rhn_check succeeds, and nothing seems to convince Spacewalk that it should pass along the task to the checking in system. Spacewalk tools have been updated on effected systems. A packet capture shows this HTTP response coming back from Spacewalk: HTTP/1.1 200 OK Date: Mon, 26 Jan 2015 22:47:55 GMT Server: Apache Content-Length: 126 X-RHN-Server-Capability: registration.finish_message(1)=1 X-RHN-Server-Capability: registration.remaining_subscriptions(1)=1 X-RHN-Server-Capability: abrt(1)=1 X-RHN-Server-Capability: registration.update_contact_info(1)=1 X-RHN-Server-Capability: staging_content(1)=1 X-RHN-Server-Capability: applet.has_base_channel(1)=1 X-RHN-Server-Capability: registration.smbios(1)=1 X-RHN-Server-Capability: registration.extended_update_support(1)=1 X-RHN-Server-Capability: rhncfg.filetype.directory(1)=1 X-RHN-Server-Capability: registration.update_systemid(1)=1 X-RHN-Server-Capability: registration.register_osad(1)=1 X-RHN-Server-Capability: registration.delta_packages(1)=1 X-RHN-Server-Capability: cpu_sockets(1)=1 X-RHN-Server-Capability: ipv6(1)=1 X-RHN-Server-Capability: rhncfg.content.base64_decode(1)=1 X-RHN-Server-Capability: xmlrpc.packages.extended_profile(1-2)=1 X-RHN-Server-Capability: xmlrpc.errata.patch_names(1)=1 X-RHN-Server-Capability: xmlrpc.packages.checksums(1)=1 X-RHN-Server-Capability: xmlrpc.login.extra_data(1)=1 X-RHN-Proxy-Version: 0 X-Transport-Info: Extended Capabilities Transport (C) Red Hat, Inc (version 2.5.72-1.el6) X-RHN-Client-Version: 1 Connection: close Content-Type: text/xml <?xml version='1.0'?> <methodResponse> <params> <param> <value><string></string></value> </param> </params> </methodResponse> In addition, attempting to use the Spacewalk reactivation feature results in the error on rhnreg_ks: Error communicating with server. The message was: Internal Server Error on re-registration, and the attached stack trace generated by the server. We're seeing this on an assortment of servers and releases (CentOS 5 and 6), with no particular consistent relationship as to why it's broken. Version-Release number of selected component (if applicable): Spacewalk 2.2: spacewalk-base-2.2.33-1.el6.noarch On one affected client: rhnlib-2.5.72-1.el6.noarch spacewalk-backend-libs-2.2.43-1.el6.noarch rhncfg-client-5.10.73-1.el6.noarch rhn-setup-2.2.7-1.el6.noarch rhn-client-tools-2.2.7-1.el6.noarch rhnsd-5.0.14-1.el6.x86_64 rhn-check-2.2.7-1.el6.noarch rhncfg-actions-5.10.73-1.el6.noarch rhncfg-5.10.73-1.el6.noarch How reproducible: Very; but not reliably. Some systems just get stuck in this state, and I haven't figured out why or how to reproduce it. Steps to Reproduce: 1. Schedule task in Spacewalk UI 2. Attempt to run rhn_check on remote host. Actual results: Receive successful completion of rhn_check, but task remains at "Queued", "This action has not yet been picked up." Expected results: Receive successful completion of rhn_check, and the task completed. Additional info:
This is some kind of huge hint, but I don't know why yet: https://www.redhat.com/archives/spacewalk-list/2015-January/msg00175.html Setting the creator_id for the server to non-null and then reactivating it fixes the problem. spacewalk-sql -i alter table rhnserver disable trigger rhnserver_log_trig ; update rhnserver set creator_id = X where id = Y; alter table rhnserver enable trigger rhnserver_log_trig ;
Traceback from email in comment 1: Exception Handler Information Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/spacewalk/server/apacheRequest.py", line 122, in call_function response = func(*params) File "/usr/share/rhn/server/handlers/xmlrpc/registration.py", line 511, in new_system architecture, data) File "/usr/share/rhn/server/handlers/xmlrpc/registration.py", line 300, in create_system newserv.user.reload(newserv.server['creator_id']) File "/usr/lib/python2.6/site-packages/spacewalk/server/rhnUser.py", line 246, in reload raise rhnException("Could not find contact record for id", user_id) rhnException: ('Could not find contact record for id', None)
Okay, so after some investigation, I believe the traceback when reactivating is a different issue from whatever is causing the action to not get picked up, and reactivating just happens to fix whatever the not-picking-up-actions problem is. I'll look at fixing the reactivation problem in a minute, but I need more information about the not-picking-up-actions problem (I can't reproduce it). In addition to the normal cause for actions not getting picked up (client not calling home, something weird with system / database timestamp so Spacewalk does not thing the action should be executed yet, system id on client does not match System profile) I have identified a couple of other things I'd like you to check for: 1) Is the system profile locked? Some actions will not be scheduled if the system profile is locked. System Overview page -> Lock Status field 2) Does Spacewalk (incorrectly) think that this system is in the middle of a reboot? spacewalk-sql -i select 1 from rhnServerAction sa join rhnAction a on sa.action_id = a.id join rhnActionType at on a.action_type = at.id where sa.server_id = <sid> and sa.status = 1; (will return 1 if Spacewalk thinks the system is rebooting) If neither of those apply and the system is still not picking up actions, I'll need you to enable more thorough logging on the Spacewalk server and post the logs here. Enable debug logging by setting "server.debug = 4" in /etc/rhn/rhn.conf and restarting httpd. Then (as root): 1) tail -f /var/log/rhn/rhn_server_xmlrpc.log -n 0 > out.txt 2) run rhn_check on the client 3) Ctl-c the tail 4) attach out.txt to this bug or copy it into a comment. Thanks!
I have fixed the "traceback during re-activation" problem, but again, I think this is unrelated to the "not picking up actions" problem. Committing to Spacewalk master: e3400bd637f7686664ae29ff55f085b0723c1f6d Fixed in spacewalk-backend-2.3.37-1
Okay, so I went and scheduled a remote command on all of our systems in Spacewalk with a null creator ID. This task was scheduled at 1/30/15 10:47:00 AM CST. I finally went and checked on it today (over twelve days later). The remote command was to basically just email me when it executed: echo "Hello from $(hostname) at $(date)" | mail -s "Pingback from null-creator_id host $(hostname)" cswingler@ Results: Completed: 224 Failed: 5 Pending: 125 The five failed systems failed for various, valid reasons - three for not having a "mail" command, and two for not having the permissions set to allow execution. Of the 125 that are still pending (twelve days later), 48 are on our Inactive Systems list and should probably be cleaned out. :) That still leaves 77 systems that seem to be affected. A spot check shows that, indeed, they're still failing to pick up tasks. I think this might work against the hypothesis that a null creator-ID is to blame here, and that was simply coincidental But I (sort of) digress. To answer your questions: > 1) Is the system profile locked? Some actions will not be scheduled if the system profile is locked. Created an ssm in spacecmd, ran system_details against the 77 systems we've got here, and: [admincswingler@spwk-app-chi-01 ~]$ grep -c 'Locked.*False' stuck_executors.txt 77 [admincswingler@spwk-app-chi-01 ~]$ grep -ic 'Locked.*True' stuck_executors.txt 0 Nope. All unlocked. > 2) Does Spacewalk (incorrectly) think that this system is in the middle of a reboot? spaceschema-# where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815) spaceschema-# and sa.status = 1; ?column? ---------- 1 <snip> 1 (77 rows) Bingo!
Sorry, "rebooting" was not entirely accurate on my part. The comment I was looking for called out rebooting specifically, but actually that sql is looking for any action that is currently in progress. Which makes sense, you can't start on the next action until you're done with the one you're already working on. You can find out what type of action is "stuck" with this query: select sa.server_id, at.label from rhnServerAction sa join rhnAction a on sa.action_id = a.id join rhnActionType at on a.action_type = at.id where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815) and sa.status = 1; The fact that the problem is actions that are suck makes this difficult to debug, since I have no idea where to look or how they got in this state. If you're sure that they're not actually in the middle of something you can unstick by just changing all their states to failed: update rhnServerAction sa set sa.status = 3 where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815) and sa.status = 1; However this doesn't fix the larger problem of how they got stuck in the first place. If any other information that can shed some light becomes available (error messages in logs, reproducible situations where an action never completes, etc.) please report it here so I can fix it.
(In reply to Stephen Herr from comment #6) > Sorry, "rebooting" was not entirely accurate on my part. The comment I was > looking for called out rebooting specifically, but actually that sql is > looking for any action that is currently in progress. Which makes sense, you > can't start on the next action until you're done with the one you're already > working on. Hmm, but maybe this is the problem. _query_queue_get explicitly collect queued _and_ pickedUp actions. There is also a column "remaining_tries" in rhnServerAction. So it seems that already picked-up actions are executed a second time if the client ask again. But with the current __reboot_in_progress() method which do not check for reboot, all picked-up but "failed" actions will never be triggered again.
Yeah I think you're right. Accepting your pull request: https://github.com/spacewalkproject/spacewalk/pull/230
Moving bugs to ON_QA as we move to release Spacewalk 2.3
Spacewalk 2.3 has been released. See https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23