Bug 1187358

Summary:

Spacewalk does not announce tasks over RPC-XML during rhn_check

Product:

[Community] Spacewalk

Reporter:

Christopher Swingler <cswingler>

Component:

Server

Assignee:

Stephen Herr <sherr>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Red Hat Satellite QA List <satqe-list>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

2.2

CC:

chris+rhbugzilla, flodense, mc, ryan.clough

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

spacewalk-backend-2.3.46-1 osa-dispatcher-5.11.55-1

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-04-14 19:03:01 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1207293

Attachments:

Description	Flags
A stacktrace when attempting to reactivate an existing system	none

Description Christopher Swingler 2015-01-29 20:51:21 UTC

Created attachment 985764 [details]
A stacktrace when attempting to reactivate an existing system

Description of problem:

See [Spacewalk-list] thread https://www.redhat.com/archives/spacewalk-list/2015-January/msg00161.html for more discussion on this topic.

In short, systems just plain do not pick up scheduled tasks. Tasks are scheduled through the UI (be them package installation/updates, configuration file publishing, or remote commands) and the API basically says "I've got nothing for you" when the remote system runs an rhn_check.

I've validated that the system ID matches, and that the rhn_check succeeds, and nothing seems to convince Spacewalk that it should pass along the task to the checking in system.  Spacewalk tools have been updated on effected systems. 

A packet capture shows this HTTP response coming back from Spacewalk:
HTTP/1.1 200 OK
Date: Mon, 26 Jan 2015 22:47:55 GMT
Server: Apache
Content-Length: 126
X-RHN-Server-Capability: registration.finish_message(1)=1
X-RHN-Server-Capability: registration.remaining_subscriptions(1)=1
X-RHN-Server-Capability: abrt(1)=1
X-RHN-Server-Capability: registration.update_contact_info(1)=1
X-RHN-Server-Capability: staging_content(1)=1
X-RHN-Server-Capability: applet.has_base_channel(1)=1
X-RHN-Server-Capability: registration.smbios(1)=1
X-RHN-Server-Capability: registration.extended_update_support(1)=1
X-RHN-Server-Capability: rhncfg.filetype.directory(1)=1
X-RHN-Server-Capability: registration.update_systemid(1)=1
X-RHN-Server-Capability: registration.register_osad(1)=1
X-RHN-Server-Capability: registration.delta_packages(1)=1
X-RHN-Server-Capability: cpu_sockets(1)=1
X-RHN-Server-Capability: ipv6(1)=1
X-RHN-Server-Capability: rhncfg.content.base64_decode(1)=1
X-RHN-Server-Capability: xmlrpc.packages.extended_profile(1-2)=1
X-RHN-Server-Capability: xmlrpc.errata.patch_names(1)=1
X-RHN-Server-Capability: xmlrpc.packages.checksums(1)=1
X-RHN-Server-Capability: xmlrpc.login.extra_data(1)=1
X-RHN-Proxy-Version: 0
X-Transport-Info: Extended Capabilities Transport (C) Red Hat, Inc (version 2.5.72-1.el6)
X-RHN-Client-Version: 1
Connection: close
Content-Type: text/xml
 
<?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><string></string></value>
</param>
</params>
</methodResponse>


In addition, attempting to use the Spacewalk reactivation feature results in the error on rhnreg_ks:

Error communicating with server. The message was:
Internal Server Error

on re-registration, and the attached stack trace generated by the server.

We're seeing this on an assortment of servers and releases (CentOS 5 and 6), with no particular consistent relationship as to why it's broken. 


Version-Release number of selected component (if applicable):

Spacewalk 2.2:
spacewalk-base-2.2.33-1.el6.noarch

On one affected client:
rhnlib-2.5.72-1.el6.noarch
spacewalk-backend-libs-2.2.43-1.el6.noarch
rhncfg-client-5.10.73-1.el6.noarch
rhn-setup-2.2.7-1.el6.noarch
rhn-client-tools-2.2.7-1.el6.noarch
rhnsd-5.0.14-1.el6.x86_64
rhn-check-2.2.7-1.el6.noarch
rhncfg-actions-5.10.73-1.el6.noarch
rhncfg-5.10.73-1.el6.noarch


How reproducible:
Very; but not reliably. Some systems just get stuck in this state, and I haven't figured out why or how to reproduce it.


Steps to Reproduce:
1. Schedule task in Spacewalk UI
2. Attempt to run rhn_check on remote host.


Actual results:
Receive successful completion of rhn_check, but task remains at "Queued", "This action has not yet been picked up."


Expected results:
Receive successful completion of rhn_check, and the task completed. 

Additional info:

Comment 1 Stephen Herr 2015-01-29 22:18:01 UTC

This is some kind of huge hint, but I don't know why yet:
https://www.redhat.com/archives/spacewalk-list/2015-January/msg00175.html

Setting the creator_id for the server to non-null and then reactivating it fixes the problem.

spacewalk-sql -i
alter table rhnserver disable trigger rhnserver_log_trig ;
update rhnserver set creator_id = X where id = Y;
alter table rhnserver enable trigger rhnserver_log_trig ;

Comment 2 Stephen Herr 2015-01-29 22:19:39 UTC

Traceback from email in comment 1:
Exception Handler Information
Traceback (most recent call last):
 File "/usr/lib/python2.6/site-packages/spacewalk/server/apacheRequest.py", line 122, in call_function
   response = func(*params)
 File "/usr/share/rhn/server/handlers/xmlrpc/registration.py", line 511, in new_system
   architecture, data)
 File "/usr/share/rhn/server/handlers/xmlrpc/registration.py", line 300, in create_system
   newserv.user.reload(newserv.server['creator_id'])
 File "/usr/lib/python2.6/site-packages/spacewalk/server/rhnUser.py", line 246, in reload
   raise rhnException("Could not find contact record for id", user_id)
rhnException: ('Could not find contact record for id', None)

Comment 3 Stephen Herr 2015-01-30 18:36:35 UTC

Okay, so after some investigation, I believe the traceback when reactivating is a different issue from whatever is causing the action to not get picked up, and reactivating just happens to fix whatever the not-picking-up-actions problem is. I'll look at fixing the reactivation problem in a minute, but I need more information about the not-picking-up-actions problem (I can't reproduce it).

In addition to the normal cause for actions not getting picked up (client not calling home, something weird with system / database timestamp so Spacewalk does not thing the action should be executed yet, system id on client does not match System profile) I have identified a couple of other things I'd like you to check for:

1) Is the system profile locked? Some actions will not be scheduled if the system profile is locked.
System Overview page -> Lock Status field

2) Does Spacewalk (incorrectly) think that this system is in the middle of a reboot?
spacewalk-sql -i
select 1
  from rhnServerAction sa
  join rhnAction a on sa.action_id = a.id
  join rhnActionType at on a.action_type = at.id
 where sa.server_id = <sid>
   and sa.status = 1;

(will return 1 if Spacewalk thinks the system is rebooting)

If neither of those apply and the system is still not picking up actions, I'll need you to enable more thorough logging on the Spacewalk server and post the logs here.

Enable debug logging by setting "server.debug = 4" in /etc/rhn/rhn.conf and restarting httpd. Then (as root):
1) tail -f /var/log/rhn/rhn_server_xmlrpc.log -n 0 > out.txt
2) run rhn_check on the client
3) Ctl-c the tail
4) attach out.txt to this bug or copy it into a comment.

Thanks!

Comment 4 Stephen Herr 2015-01-30 22:17:40 UTC

I have fixed the "traceback during re-activation" problem, but again, I think this is unrelated to the "not picking up actions" problem.

Committing to Spacewalk master:
e3400bd637f7686664ae29ff55f085b0723c1f6d

Fixed in spacewalk-backend-2.3.37-1

Comment 5 Christopher Swingler 2015-02-11 16:32:15 UTC

Okay, so I went and scheduled a remote command on all of our systems in Spacewalk with a null creator ID.  This task was scheduled at 1/30/15 10:47:00 AM CST. I finally went and checked on it today (over twelve days later).

The remote command was to basically just email me when it executed:

echo "Hello from $(hostname) at $(date)" | mail -s "Pingback from null-creator_id host $(hostname)" cswingler@

Results:

Completed: 224
Failed:      5
Pending:   125

The five failed systems failed for various, valid reasons - three for not having a "mail" command, and two for not having the permissions set to allow execution.

Of the 125 that are still pending (twelve days later), 48 are on our Inactive Systems list and should probably be cleaned out. :)

That still leaves 77 systems that seem to be affected. A spot check shows that, indeed, they're still failing to pick up tasks.

I think this might work against the hypothesis that a null creator-ID is to blame here, and that was simply coincidental

But I (sort of) digress. 

To answer your questions:

> 1) Is the system profile locked? Some actions will not be scheduled if the system profile is locked.

Created an ssm in spacecmd, ran system_details against the 77 systems we've got here, and:

[admincswingler@spwk-app-chi-01 ~]$ grep -c 'Locked.*False' stuck_executors.txt
77
[admincswingler@spwk-app-chi-01 ~]$ grep -ic 'Locked.*True' stuck_executors.txt
0

Nope. All unlocked. 

> 2) Does Spacewalk (incorrectly) think that this system is in the middle of a reboot?
spaceschema-#  where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815)
spaceschema-#    and sa.status = 1;
 ?column?
----------
       1
<snip>
       1
(77 rows)

Bingo!

Comment 6 Stephen Herr 2015-02-13 19:51:28 UTC

Sorry, "rebooting" was not entirely accurate on my part. The comment I was looking for called out rebooting specifically, but actually that sql is looking for any action that is currently in progress. Which makes sense, you can't start on the next action until you're done with the one you're already working on.

You can find out what type of action is "stuck" with this query:

select sa.server_id, at.label
  from rhnServerAction sa
  join rhnAction a on sa.action_id = a.id
  join rhnActionType at on a.action_type = at.id
 where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815)
   and sa.status = 1;

The fact that the problem is actions that are suck makes this difficult to debug, since I have no idea where to look or how they got in this state. If you're sure that they're not actually in the middle of something you can unstick by just changing all their states to failed:

update rhnServerAction sa
   set sa.status = 3
 where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815)
   and sa.status = 1;

However this doesn't fix the larger problem of how they got stuck in the first place. If any other information that can shed some light becomes available (error messages in logs, reproducible situations where an action never completes, etc.) please report it here so I can fix it.

Comment 7 Michael Calmer 2015-02-14 14:05:26 UTC

(In reply to Stephen Herr from comment #6)
> Sorry, "rebooting" was not entirely accurate on my part. The comment I was
> looking for called out rebooting specifically, but actually that sql is
> looking for any action that is currently in progress. Which makes sense, you
> can't start on the next action until you're done with the one you're already
> working on.

Hmm, but maybe this is the problem. 

_query_queue_get explicitly collect queued _and_ pickedUp actions. There is also
a column "remaining_tries" in rhnServerAction. So it seems that already picked-up actions are executed a second time if the client ask again.

But with the current __reboot_in_progress() method which do not check for reboot, all picked-up but "failed" actions will never be triggered again.

Comment 8 Stephen Herr 2015-03-05 21:25:58 UTC

Yeah I think you're right. Accepting your pull request:
https://github.com/spacewalkproject/spacewalk/pull/230

Comment 9 Grant Gainey 2015-03-23 16:59:00 UTC

Moving bugs to ON_QA as we move to release Spacewalk 2.3

Comment 10 Grant Gainey 2015-04-14 19:03:01 UTC

Spacewalk 2.3 has been released. See

https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23