Bug 1368103

Summary: ruby consumes 43GB RSS when there is lots of stucked errata apply tasks
Product: Red Hat Satellite Reporter: Jan Hutař <jhutar>
Component: Tasks PluginAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 6.2.0CC: aruzicka, bbuckingham, bkearney, inecas, jcallaha, jhutar, jsherril, psuriset
Target Milestone: UnspecifiedKeywords: Reopened, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: scale_lab
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-29 02:55:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1352526    

Description Jan Hutař 2016-08-18 12:27:22 UTC
I'm doing some ruby mem experiments and sometime around 07:59:07 (+-10s) ruby went from:

  3290628 ruby

to:

  2149180 ruby

this is in the production.log:

  [...]
  2016-08-18 07:57:28 [app] [I] Completed 200 OK in 998ms (Views: 257.5ms | ActiveRecord: 730.5ms)
  2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating throttle_limiter...
  2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating throttle_limiter...
  2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating client dispatcher...
  2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] stop listening for new events...
  2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating client dispatcher...
  2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] stop listening for new events...
  2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating clock...
  2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating clock...
  2016-08-18 07:59:43 [app] [I] Started GET "/foreman_tasks/tasks/3cee987c-7266-4532-a222-bac6b8355029/sub_tasks" for 213.211.43.129 at 2016-08-18 07:59:43 -0400
  [...]

Any idea on what it might be?

I use this crazy stuff to "measure" ruby RSS usage (well, not only ruby,
but it is in top 10):

  ps --no-headers -eo rss,comm>a; for comm in $( sed 's/^\s*[0-9]\+\s*\(.*\)$/\1/' a | sort -u ); do size=$( grep "\s$comm" a | sed 's/^\s*\([0-9]\+\)\s*.*$/\1/' | paste -sd+ - | bc ); echo "$size $comm"; done | sort -n | tail

and here is what I have done:

  1. I have clients registered through capsules with gofferd/katello-agent running
  2. I have shut down capsules, so clients can not reach satellite
  3. I have scheduled errata update on 400 clients
  4. I was sitting and looking at the ruby mem usage

Around that 07:59:07, I have not touched satellite at all.

I'm looking into this, because Pradeep noticed at some point ruby was
consuming 43GB RSS and we believe this is connected with some stuck
tasks (had some problems with qpidd/qdrouterd/clients itself - so our
10k errata update task was a bit mess at first two tries).

Comment 2 Ivan Necas 2016-08-18 15:12:24 UTC
I'm putting the component on tasks. The issues might be connected to the fact, that the Dynflow runtime keeps the data of currently-running tasks in memory, while waiting for the events: the could correspond to the amount of memory consumption, if the tasks get stuck.

How the thing might be solved would require dynflow being able to drop some inline data and load that from database again on demand.

Another thing, that might help would be the Katello bulk actions setting some upper limit on concurrent tasks (dynflow has support for this), so we would not get into the level of concurrency, where the memory issues appear. Another question, if the timeout limit for the stack actions could be lower to finish the task sooner (to not take up that much memory)

Comment 3 Ivan Necas 2016-08-18 15:13:45 UTC
/cc Justin to spread awareness of this issue

Comment 5 Ivan Necas 2016-09-08 12:15:22 UTC
Created redmine issue http://projects.theforeman.org/issues/16488 from this bug

Comment 6 Ivan Necas 2016-11-01 20:45:01 UTC
Would it be possible to re-reproduce the issue, but with using `ps aux` or something like that so that we have better granularity of the source of the growth, or even just periodically tracking the `ps aux`.

From the subject, it seemed as dynflow was consuming 43GB, but if I understand it correctly, it was the sum of all ruby (including Passnger with foreman and puppet) and the fact tat ruby went from 3290628 to 2149180 means the dynflow had around 1GB at the moment, right?

Why I'm asking is we're thinking about implementing some recycling at some memory threshold and the more inputs we have about the growth in more real time scenarios can give us some hints on how to set thresholds on exiting the process.

Comment 7 Ivan Necas 2016-11-01 22:42:29 UTC
Another thing I would ask for is tar from `foreman-rake foreman_tasks:export_tasks tasks=all days=1` - days updated based on the number of days the tasks were running while the memory grew

Comment 10 Ivan Necas 2016-12-07 19:33:26 UTC
I'm closing this issue, as we don't have enough information for diagnosis, and there can be multiple reasons for this happening (some of them already identified in other BZs mentioned in [1]). The reason for closing is avoiding misunderstandings when dealing with several independent issues in one BZ.

Besides possible causes of the issues, [1] mentions as well how to collect data for faster analysis and provide additional information.

If the issue turns out again when performing errata installation, please reopen the BZ, together with attaching requested data. If it's not related to errata installation, and doesn't seem to be any of the existing bug mentioned in [1], please file a new BZ for it.

[1] - https://access.redhat.com/site/solutions/2785811

Comment 11 Pradeep Kumar Surisetty 2017-03-29 02:14:25 UTC
We have registered till 9K with katello-agent. have plans to go larger scale

just for 9K itself, Ruby @ 50G, passenger-foreman @ 46G,  Dynflow_executor @ 9G
As shown in attached screenshots