Bug 1284494 - Timed out agent/capsule tasks are not cancelled
Summary: Timed out agent/capsule tasks are not cancelled
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Satellite 6
Classification: Red Hat
Component: katello-agent
Version: 6.1.2
Hardware: Unspecified
OS: Unspecified
unspecified
high vote
Target Milestone: Unspecified
Assignee: Justin Sherrill
QA Contact: Sachin Ghai
URL: http://projects.theforeman.org/issues...
Whiteboard:
Depends On:
Blocks: 1315285
TreeView+ depends on / blocked
 
Reported: 2015-11-23 13:34 UTC by Justin Sherrill
Modified: 2019-09-12 09:22 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1315285 (view as bug list)
Environment:
Last Closed: 2016-09-28 18:17:43 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Foreman Issue Tracker 12692 None None None 2016-04-26 17:05:25 UTC

Description Justin Sherrill 2015-11-23 13:34:49 UTC
Description of problem:

Currently when a client times out because it is down or has taken too long the task is not cancelled and a user may reattempt the same action.  Whenever the agent does come online it will still pick up the timed-out tasks and process multiple tasks unexpectedly.  

Version-Release number of selected component (if applicable):
6.1.3

How reproducible:
Always

Steps to Reproduce:
1.  Register a client with katello-agent installed
2.  Make sure goferd is shut down on the client
3.  Dispatch a package install or errata apply action
4.  Wait for it to timeout
5.  Start goferd

Actual results:
Client picks up timedout task

Expected results:
Client does not pick up any tasks, as the task is cancelled.  


Additional info:

Comment 1 Justin Sherrill 2015-11-23 13:35:23 UTC
Multiple requests by customers here:  https://bugzilla.redhat.com/show_bug.cgi?id=1269509

Comment 3 Pavel Moravec 2015-11-30 14:31:41 UTC
Same applies also to capsule sync (and goferd down on the capsule).

Another scenarios are *any* kind of connectivity loss within the chain pulp->qpidd->qdrouterd->goferd - and there are bunch of possibilities how the chain can be broken. So the reproducer isn't theoretical but seen in field.

Comment 4 Jeff Ortel 2015-12-01 23:09:47 UTC
Is katello canceling the task in Pulp?

Comment 5 Justin Sherrill 2015-12-02 11:55:18 UTC
Jeff,

No this is a katello issue, we need to cancel the task in pulp when we 'time it out'.

Not a pulp issue, at least so far :)

Justin

Comment 6 Justin Sherrill 2015-12-03 16:44:59 UTC
Created redmine issue http://projects.theforeman.org/issues/12692 from this bug

Comment 7 Bryan Kearney 2015-12-03 17:02:49 UTC
Upstream bug assigned to jsherril@redhat.com

Comment 8 Bryan Kearney 2015-12-03 17:02:52 UTC
Upstream bug component is Content Management

Comment 9 Bryan Kearney 2015-12-03 19:02:12 UTC
Upstream bug component is Client & Agent

Comment 10 Bryan Kearney 2015-12-23 17:02:13 UTC
Moving to POST since upstream bug http://projects.theforeman.org/issues/12692 has been closed
-------------
Justin Sherrill
Applied in changeset commit:katello|459c151b76d95ab5e8715780cfc4ae96b6f53e2e.

Comment 13 Sachin Ghai 2016-02-04 11:40:20 UTC
I was trying to reproduce this issue before verifying in the latest 6.1.7 compose.

Steps taken:

1) installed client with katello-agent and registered it with sat server
2) stopped goferd on client
3) triggered package install from sat6 for registered client
4) got warning 'host did not respond within 20 seconds. Is katello-agent installed and goferd running on the Host?'

Now how can I check whether a task is really cancelled or not ? or how can I validate on restarting goferd, the task is picked by client ?

Do I need to check messaging queue ? if so, how can I check that ?

Comment 14 Justin Sherrill 2016-02-04 14:11:27 UTC
Sachin,

A couple things you can do:

a) tail /var/log/messages and look for gofer output performing an action (you may want to run one first and familiarize yourself with its output).

b) Check to see if  the action was actually performed.  

  * shut down goferd
  * schedule an install of the 'screen' package (assuming its not already installed) 
  * wait for the timeout
  * start goferd
  * wait ~60 seconds
  * verify screen is not installed

Comment 15 Sachin Ghai 2016-02-04 17:27:47 UTC
Thank you Justin. This bz is reproducible with Sat 6.1.7 compose1. I tried to install zsh on client.

Logs from server:
=================
 2016-02-04 12:19:10 [I] Processing by Katello::Api::V2::SystemPackagesController#install as JSON
2016-02-04 12:19:10 [I]   Parameters: {"uuid"=>"cb80d298-c094-4df6-b5ca-4138acb0fb62", "packages"=>["zsh"], "api_version"=>"v2", "system_id"=>"cb80d298-c094-4df6-b5ca-4138acb0fb62", "system_package"=>{"uuid"=>"cb80d298-c094-4df6-b5ca-4138acb0fb62", "packages"=>["zsh"]}}

==> /var/log/messages <==
Feb  4 12:19:10 cloud-qe-7 pulp: kombu.transport.qpid:INFO: Connected to qpid with SASL mechanism ANONYMOUS


==> /var/log/messages <==
Feb  4 12:19:31 cloud-qe-7 pulp: pulp.server.async.tasks:INFO: Task canceled: 596cdad7-bdba-4672-99bd-5717719a380d.

==> /var/log/foreman/production.log <==
2016-02-04 12:19:31 [E] Host did not respond within 20 seconds. The task has been cancelled. Is katello-agent installed and goferd running on the Host? (RuntimeError)
/opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.83/app/lib/actions/pulp/consumer/abstract_content_action.rb:35:in `process_timeout'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.9/lib/dynflow/action/polling.rb:23:in `run'



As soon as I restarted the goferd on client, I got following:

==> /var/log/messages <==
Feb  4 12:20:38 cloud-qe-7 pulp: pulp.server.agent.direct.services:INFO: (3037-66400) Task RMI (succeeded): Succeeded
Feb  4 12:20:38 cloud-qe-7 pulp: pulp.server.agent.direct.services:INFO: (3037-66400)   sn : f71038b1-6101-43aa-ae23-c523504b4664
Feb  4 12:20:38 cloud-qe-7 pulp: pulp.server.agent.direct.services:INFO: (3037-66400)   origin : None
Feb  4 12:20:38 cloud-qe-7 pulp: pulp.server.agent.direct.services:INFO: (3037-66400)   timestamp : 2016-02-04T17:20:37Z
Feb  4 12:20:38 cloud-qe-7 pulp: pulp.server.agent.direct.services:INFO: (3037-66400)   user data : {'task_id': '596cdad7-bdba-4672-99bd-5717719a380d', 'consumer_id': 'cb80d298-c094-4df6-b5ca-4138acb0fb62'}
Feb  4 12:20:38 cloud-qe-7 pulp: pulp.server.agent.direct.services:INFO: (3037-66400)   retval:
Feb  4 12:20:38 cloud-qe-7 pulp: pulp.server.agent.direct.services:INFO: (3037-66400) {'reboot': {'scheduled': False, 'details': {}}, 'details': {'rpm': {'details': {'resolved': [{'name': 'zsh', 'qname': 'zsh-5.0.2-14.el7.x86_64', 'epoch': '0', 'version': '5.0.2', 'release': '14.el7', 'arch': 'x86_64', 'repoid': 'Default_Organization_RHEL7_RHEL7_x86_64'}], 'failed': [], 'deps': []}, 'succeeded': True}}, 'succeeded': True, 'num_changes': 1}

Comment 18 Sachin Ghai 2016-02-04 17:31:12 UTC
Installed packages on server:

 rpm -qa | grep katello
rubygem-hammer_cli_katello-0.0.7.21-1.el7sat.noarch
ruby193-rubygem-katello-2.2.0.83-1.el7sat.noarch
katello-2.2.0.18-1.el7sat.noarch
pulp-katello-0.5-1.el7sat.noarch
katello-installer-2.3.25-1.el7sat.noarch
katello-common-2.2.0.18-1.el7sat.noarch
katello-server-ca-1.0-1.noarch
katello-debug-2.2.0.18-1.el7sat.noarch
katello-installer-base-2.3.25-1.el7sat.noarch
katello-default-ca-1.0-1.noarch
katello-certs-tools-2.2.1-1.el7sat.noarch
katello-service-2.2.0.18-1.el7sat.noarch

Comment 19 Justin Sherrill 2016-02-04 18:10:51 UTC
Hrmm it looks like the cancel request went through to pulp:

10.16.96.108 - - [04/Feb/2016:12:19:31 -0500] "DELETE /pulp/api/v2/tasks/596cdad7-bdba-4672-99bd-5717719a380d/ HTTP/1.1" 200 4 "-" "Ruby"



yet the client still picked it up:

Feb  4 17:20:34 shost goferd: [INFO][worker-0] gofer.rmi.dispatcher:600 - call: Content.install() sn=f71038b1-6101-43aa-ae23-c523504b4664 data={u'task_id': u'596cdad7-bdba-4672-99bd-5717719a380d', u'consumer_id': u'cb80d298-c094-4df6-b5ca-4138acb0fb62'}


I even waited ~3 minutes and it still occured.

michael, is it expected that cancelling agent tasks in pulp does not work in current 6.1?

Comment 20 Michael Hrivnak 2016-02-05 12:05:37 UTC
Jeff, tag you're it.

Comment 25 Stuart Auchterlonie 2016-02-10 13:10:49 UTC
(In reply to Justin Sherrill from comment #19)
> Hrmm it looks like the cancel request went through to pulp:
> 
> 10.16.96.108 - - [04/Feb/2016:12:19:31 -0500] "DELETE
> /pulp/api/v2/tasks/596cdad7-bdba-4672-99bd-5717719a380d/ HTTP/1.1" 200 4 "-"
> "Ruby"
> 
> 
> 
> yet the client still picked it up:
> 
> Feb  4 17:20:34 shost goferd: [INFO][worker-0] gofer.rmi.dispatcher:600 -
> call: Content.install() sn=f71038b1-6101-43aa-ae23-c523504b4664
> data={u'task_id': u'596cdad7-bdba-4672-99bd-5717719a380d', u'consumer_id':
> u'cb80d298-c094-4df6-b5ca-4138acb0fb62'}
> 
> 
> I even waited ~3 minutes and it still occured.
> 
> michael, is it expected that cancelling agent tasks in pulp does not work in
> current 6.1?

I'm pretty sure it's because their is still a message on the message
queue that the remote goferd picks up.

Perhaps pulp is failing to remove the message from the msg queue?

Comment 26 Jeff Ortel 2016-02-10 20:22:27 UTC
When the the pulp task runs, a request message is queued to the agent (goferd).  When the task is canceled, a cancellation message is queued.  The request message cannot be *deleted* from the agent's queue for several reasons.  The cancellation message is received by goferd and the corresponding request is flagged as canceled.  It's the plugin's (for satellite, it's the katelloplugin) responsibility to discard canceled requests or abort processing of requests already in progress.  In goferd 2.7 an improvement was added whereby it discards canceled requests before they are dispatched to the plugin.

The original fix for this BZ addressed cancellation of capsule sync.  When the task is canceled in pulp, the cancel message is sent and processed by goferd.  The pulp code running within the katello plugin checks to see if it has been canceled and aborts.  Since the capsule sync runs relatively long and slow, the cancellation message has ample time to be processed by goferd and the sync logic running in the plugin aborts in it's early stages.

However, when testing this bug as it relates to canceling package installs, the results will be different (as observed).  The pulp code for installing packages used in the katello plugin does not check for cancellation.  Even if it did, it's likely that the request execution logic (in the plugin) will have already past the cancellation check and called into YUM.  By then, it's too late.

The solution involves creating a cancellation window.  Adding a 0.5 second of latency to each request creates an opportunity for the cancellation to be received and processed before the request has been dispatched to the plugin for execution.  Gofer 2.7 provides a plugin configuration property that is intended to provide throttling and to adjust the cancellation window.

The complete resolution of this BZ requires gofer 2.7.4+ and an additional property added to /etc/gofer/plugins/katelloplugin.conf.

Comment 27 Stuart Auchterlonie 2016-02-11 12:04:27 UTC
Jeff,

Is there any mileage in utilizing the message expiry feature of qpid/amqp?

eg. For package installs, the request is placed on the queue with an
expiry time = timeout. If the client is offline and the timeout is
exceeded the message will be discarded by the broker rather than delivered.

Comment 28 Jeff Ortel 2016-02-11 21:25:58 UTC
Stuart,

I gave using TTL (time-to-live) strong consideration.  It would support the use case you described.  However, goferd reads (batch of) messages from the AMQP queue and queues them in memory using a persistent queue [1].  A message /could/ be consumed from the AMQP queue before the TTL expires but then goferd stopped prior to executing the request.  On start up, the request would resume execution.

But, perhaps, the main use case is - when requests pile up (and timeout in satellite) because the host/agent is offline and perhaps having the messages expire and discarded by the broker is good enough?

Let me give a TTL based solution more thought.

In any case, support for a caller define TTL for agent related API calls would need to be added to Pulp.

Comment 30 Stuart Auchterlonie 2016-02-17 11:44:51 UTC
Hi Jeff,

Yes, the main use case I see is for the client system to be offline
when the operation was requested.

As one of our customers put it, "having satellite patch systems once the
maintenance window is over, and potentially during the working day,
is a career limiting move"

Especially as this could be some hours after the window.


Regards
Stuart

Comment 33 Mike McCune 2016-03-28 22:49:28 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions

Comment 34 Mike McCune 2016-05-10 14:19:55 UTC
There is no known patch or resolution to this bug and should not be in the POST state. Moving to NEW

Comment 35 Jeff Ortel 2016-09-28 14:12:35 UTC
Looks like this was fixed on Feb 6.

https://github.com/Katello/katello-agent/commit/7e81dc80c03ae1a0b0b850221e9da9924e76ae11

Comment 36 Chris Duryee 2016-09-28 18:17:43 UTC
This bz has been verified as fixed in 6.2. I am closing out the 6.1.z clone since we currently do not plan on doing another 6.1.z release. If this is an issue, feel free to re-open.


Note You need to log in before you can comment on or make changes to this bug.