Bug 1187358
| Summary: | Spacewalk does not announce tasks over RPC-XML during rhn_check | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Community] Spacewalk | Reporter: | Christopher Swingler <cswingler> | ||||
| Component: | Server | Assignee: | Stephen Herr <sherr> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Satellite QA List <satqe-list> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 2.2 | CC: | chris+rhbugzilla, flodense, mc, ryan.clough | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | spacewalk-backend-2.3.46-1 osa-dispatcher-5.11.55-1 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2015-04-14 19:03:01 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1207293 | ||||||
| Attachments: | 
            
  | 
      ||||||
| 
 
        
          Description
        
        
          Christopher Swingler
        
        
        
        
        
          2015-01-29 20:51:21 UTC
        
       
      
      
      
    This is some kind of huge hint, but I don't know why yet: https://www.redhat.com/archives/spacewalk-list/2015-January/msg00175.html Setting the creator_id for the server to non-null and then reactivating it fixes the problem. spacewalk-sql -i alter table rhnserver disable trigger rhnserver_log_trig ; update rhnserver set creator_id = X where id = Y; alter table rhnserver enable trigger rhnserver_log_trig ; Traceback from email in comment 1: Exception Handler Information Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/spacewalk/server/apacheRequest.py", line 122, in call_function response = func(*params) File "/usr/share/rhn/server/handlers/xmlrpc/registration.py", line 511, in new_system architecture, data) File "/usr/share/rhn/server/handlers/xmlrpc/registration.py", line 300, in create_system newserv.user.reload(newserv.server['creator_id']) File "/usr/lib/python2.6/site-packages/spacewalk/server/rhnUser.py", line 246, in reload raise rhnException("Could not find contact record for id", user_id) rhnException: ('Could not find contact record for id', None) Okay, so after some investigation, I believe the traceback when reactivating is a different issue from whatever is causing the action to not get picked up, and reactivating just happens to fix whatever the not-picking-up-actions problem is. I'll look at fixing the reactivation problem in a minute, but I need more information about the not-picking-up-actions problem (I can't reproduce it). In addition to the normal cause for actions not getting picked up (client not calling home, something weird with system / database timestamp so Spacewalk does not thing the action should be executed yet, system id on client does not match System profile) I have identified a couple of other things I'd like you to check for: 1) Is the system profile locked? Some actions will not be scheduled if the system profile is locked. System Overview page -> Lock Status field 2) Does Spacewalk (incorrectly) think that this system is in the middle of a reboot? spacewalk-sql -i select 1 from rhnServerAction sa join rhnAction a on sa.action_id = a.id join rhnActionType at on a.action_type = at.id where sa.server_id = <sid> and sa.status = 1; (will return 1 if Spacewalk thinks the system is rebooting) If neither of those apply and the system is still not picking up actions, I'll need you to enable more thorough logging on the Spacewalk server and post the logs here. Enable debug logging by setting "server.debug = 4" in /etc/rhn/rhn.conf and restarting httpd. Then (as root): 1) tail -f /var/log/rhn/rhn_server_xmlrpc.log -n 0 > out.txt 2) run rhn_check on the client 3) Ctl-c the tail 4) attach out.txt to this bug or copy it into a comment. Thanks! I have fixed the "traceback during re-activation" problem, but again, I think this is unrelated to the "not picking up actions" problem. Committing to Spacewalk master: e3400bd637f7686664ae29ff55f085b0723c1f6d Fixed in spacewalk-backend-2.3.37-1 Okay, so I went and scheduled a remote command on all of our systems in Spacewalk with a null creator ID. This task was scheduled at 1/30/15 10:47:00 AM CST. I finally went and checked on it today (over twelve days later). The remote command was to basically just email me when it executed: echo "Hello from $(hostname) at $(date)" | mail -s "Pingback from null-creator_id host $(hostname)" cswingler@ Results: Completed: 224 Failed: 5 Pending: 125 The five failed systems failed for various, valid reasons - three for not having a "mail" command, and two for not having the permissions set to allow execution. Of the 125 that are still pending (twelve days later), 48 are on our Inactive Systems list and should probably be cleaned out. :) That still leaves 77 systems that seem to be affected. A spot check shows that, indeed, they're still failing to pick up tasks. I think this might work against the hypothesis that a null creator-ID is to blame here, and that was simply coincidental But I (sort of) digress. To answer your questions: > 1) Is the system profile locked? Some actions will not be scheduled if the system profile is locked. Created an ssm in spacecmd, ran system_details against the 77 systems we've got here, and: [admincswingler@spwk-app-chi-01 ~]$ grep -c 'Locked.*False' stuck_executors.txt 77 [admincswingler@spwk-app-chi-01 ~]$ grep -ic 'Locked.*True' stuck_executors.txt 0 Nope. All unlocked. > 2) Does Spacewalk (incorrectly) think that this system is in the middle of a reboot? spaceschema-# where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815) spaceschema-# and sa.status = 1; ?column? ---------- 1 <snip> 1 (77 rows) Bingo! Sorry, "rebooting" was not entirely accurate on my part. The comment I was looking for called out rebooting specifically, but actually that sql is looking for any action that is currently in progress. Which makes sense, you can't start on the next action until you're done with the one you're already working on. You can find out what type of action is "stuck" with this query: select sa.server_id, at.label from rhnServerAction sa join rhnAction a on sa.action_id = a.id join rhnActionType at on a.action_type = at.id where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815) and sa.status = 1; The fact that the problem is actions that are suck makes this difficult to debug, since I have no idea where to look or how they got in this state. If you're sure that they're not actually in the middle of something you can unstick by just changing all their states to failed: update rhnServerAction sa set sa.status = 3 where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815) and sa.status = 1; However this doesn't fix the larger problem of how they got stuck in the first place. If any other information that can shed some light becomes available (error messages in logs, reproducible situations where an action never completes, etc.) please report it here so I can fix it. (In reply to Stephen Herr from comment #6) > Sorry, "rebooting" was not entirely accurate on my part. The comment I was > looking for called out rebooting specifically, but actually that sql is > looking for any action that is currently in progress. Which makes sense, you > can't start on the next action until you're done with the one you're already > working on. Hmm, but maybe this is the problem. _query_queue_get explicitly collect queued _and_ pickedUp actions. There is also a column "remaining_tries" in rhnServerAction. So it seems that already picked-up actions are executed a second time if the client ask again. But with the current __reboot_in_progress() method which do not check for reboot, all picked-up but "failed" actions will never be triggered again. Yeah I think you're right. Accepting your pull request: https://github.com/spacewalkproject/spacewalk/pull/230 Moving bugs to ON_QA as we move to release Spacewalk 2.3 Spacewalk 2.3 has been released. See https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23  |