I'm doing some ruby mem experiments and sometime around 07:59:07 (+-10s) ruby went from: 3290628 ruby to: 2149180 ruby this is in the production.log: [...] 2016-08-18 07:57:28 [app] [I] Completed 200 OK in 998ms (Views: 257.5ms | ActiveRecord: 730.5ms) 2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating throttle_limiter... 2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating throttle_limiter... 2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating client dispatcher... 2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] stop listening for new events... 2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating client dispatcher... 2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] stop listening for new events... 2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating clock... 2016-08-18 07:59:03 [foreman-tasks/dynflow] [I] start terminating clock... 2016-08-18 07:59:43 [app] [I] Started GET "/foreman_tasks/tasks/3cee987c-7266-4532-a222-bac6b8355029/sub_tasks" for 213.211.43.129 at 2016-08-18 07:59:43 -0400 [...] Any idea on what it might be? I use this crazy stuff to "measure" ruby RSS usage (well, not only ruby, but it is in top 10): ps --no-headers -eo rss,comm>a; for comm in $( sed 's/^\s*[0-9]\+\s*\(.*\)$/\1/' a | sort -u ); do size=$( grep "\s$comm" a | sed 's/^\s*\([0-9]\+\)\s*.*$/\1/' | paste -sd+ - | bc ); echo "$size $comm"; done | sort -n | tail and here is what I have done: 1. I have clients registered through capsules with gofferd/katello-agent running 2. I have shut down capsules, so clients can not reach satellite 3. I have scheduled errata update on 400 clients 4. I was sitting and looking at the ruby mem usage Around that 07:59:07, I have not touched satellite at all. I'm looking into this, because Pradeep noticed at some point ruby was consuming 43GB RSS and we believe this is connected with some stuck tasks (had some problems with qpidd/qdrouterd/clients itself - so our 10k errata update task was a bit mess at first two tries).
I'm putting the component on tasks. The issues might be connected to the fact, that the Dynflow runtime keeps the data of currently-running tasks in memory, while waiting for the events: the could correspond to the amount of memory consumption, if the tasks get stuck. How the thing might be solved would require dynflow being able to drop some inline data and load that from database again on demand. Another thing, that might help would be the Katello bulk actions setting some upper limit on concurrent tasks (dynflow has support for this), so we would not get into the level of concurrency, where the memory issues appear. Another question, if the timeout limit for the stack actions could be lower to finish the task sooner (to not take up that much memory)
/cc Justin to spread awareness of this issue
Created redmine issue http://projects.theforeman.org/issues/16488 from this bug
Would it be possible to re-reproduce the issue, but with using `ps aux` or something like that so that we have better granularity of the source of the growth, or even just periodically tracking the `ps aux`. From the subject, it seemed as dynflow was consuming 43GB, but if I understand it correctly, it was the sum of all ruby (including Passnger with foreman and puppet) and the fact tat ruby went from 3290628 to 2149180 means the dynflow had around 1GB at the moment, right? Why I'm asking is we're thinking about implementing some recycling at some memory threshold and the more inputs we have about the growth in more real time scenarios can give us some hints on how to set thresholds on exiting the process.
Another thing I would ask for is tar from `foreman-rake foreman_tasks:export_tasks tasks=all days=1` - days updated based on the number of days the tasks were running while the memory grew
I'm closing this issue, as we don't have enough information for diagnosis, and there can be multiple reasons for this happening (some of them already identified in other BZs mentioned in [1]). The reason for closing is avoiding misunderstandings when dealing with several independent issues in one BZ. Besides possible causes of the issues, [1] mentions as well how to collect data for faster analysis and provide additional information. If the issue turns out again when performing errata installation, please reopen the BZ, together with attaching requested data. If it's not related to errata installation, and doesn't seem to be any of the existing bug mentioned in [1], please file a new BZ for it. [1] - https://access.redhat.com/site/solutions/2785811
We have registered till 9K with katello-agent. have plans to go larger scale just for 9K itself, Ruby @ 50G, passenger-foreman @ 46G, Dynflow_executor @ 9G As shown in attached screenshots