Bug 1412307 - dynflow_executor memory leak when publishing CV triggers Actions::Katello::ContentView::CapsuleGenerateAndSync on multiple Capsules
Summary: dynflow_executor memory leak when publishing CV triggers Actions::Katello::Co...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Tasks Plugin
Version: 6.2.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: Unspecified
Assignee: Justin Sherrill
QA Contact: Lukas Pramuk
URL:
Whiteboard:
: 1412308 (view as bug list)
Depends On:
Blocks: 1353215
TreeView+ depends on / blocked
 
Reported: 2017-01-11 17:26 UTC by Pavel Moravec
Modified: 2021-06-10 11:48 UTC (History)
17 users (show)

Fixed In Version: tfm-rubygem-foreman-tasks-0.7.14.13-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1416241 1446721 (view as bug list)
Environment:
Last Closed: 2017-06-20 17:22:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Dynflow patch to enforce garbage collection after action is run (438 bytes, patch)
2017-03-13 15:54 UTC, Ivan Necas
no flags Details | Diff
Script for exporting task data into csv (1.34 KB, application/x-shellscript)
2017-03-13 16:01 UTC, Ivan Necas
no flags Details
A simple profiler for dynflow actions (7.05 KB, text/plain)
2017-03-22 17:02 UTC, Ivan Necas
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Foreman Issue Tracker 18060 0 High Closed dynflow_executor memory leak when publishing CV triggers Actions::Katello::ContentView::CapsuleGenerateAndSync on multip... 2021-02-05 19:09:12 UTC
Red Hat Product Errata RHBA-2017:1553 0 normal SHIPPED_LIVE Satellite 6.2.10 Async Bug Release 2017-06-20 21:19:07 UTC

Description Pavel Moravec 2017-01-11 17:26:45 UTC
Description of problem:
Setup: Sat6.2.6 with 4 Capsules in Library environment

Publishing a Content View (with some RPM repo in it) triggers Actions::Katello::ContentView::CapsuleGenerateAndSync task to sync the repo to all Capsules. This step causes memory leak in dynflow_executor process.

When I remove the Capsules from Library LE, the CapsuleGenerateAndSync task is not triggere and the leak is not present.

When I add the Capsules to the LE back and manually invoke in rake console:

ForemanTasks.sync_task(::Actions::Katello::ContentView::CapsuleGenerateAndSync, ::Katello::ContentView.find(3), ::Katello::KTEnvironment.find(1)) while true

(so the task is kicked off repeatedly one after another), the leak is _not_ present.


Version-Release number of selected component (if applicable):
Sat 6.2.6
tfm-rubygem-dynflow-0.8.13.3-2.el7sat.noarch
tfm-rubygem-hammer_cli_foreman_tasks-0.0.10.3-1.el7sat.noarch
rubygem-smart_proxy_dynflow-0.1.3-1.el7sat.noarch
tfm-rubygem-smart_proxy_dynflow_core-0.1.3-1.el7sat.noarch
tfm-rubygem-foreman-tasks-0.7.14.11-1.el7sat.noarch


How reproducible:
100%


Steps to Reproduce:
1. Have 4 Capsules (imho the more the more evident leak) associated to Library LE

2. Run below script that does in a loop:
- creates a CV
- adds a repo (use a small one there)
- publishes the CV
- removes the CV in 2 step (technically, previous CV to let Satellite to sync the content to Caps first)

hmr="hammer -u admin -p redhat "

i=0
while true; do
	echo "$(date): working with CV cv_test_$i"
	$hmr content-view create --organization=RedHat --name=cv_test_$i
	$hmr content-view add-repository --repository-id=1 --organization=RedHat --name=cv_test_$i
        $hmr content-view publish --organization=RedHat --name=cv_test_$i
	if [ $i -gt 0 ]; then
		j=$((i-1))
	        $hmr content-view remove-from-environment --lifecycle-environment=Library --organization=RedHat --name=cv_test_$j
        	$hmr content-view delete --organization=RedHat --name=cv_test_$j
	fi
	i=$((i+1))
	sleep 10
done

3. monitor memory usage of dynflow_executor process


Actual results:
memory usage grows over the time


Expected results:
no memory growth


Additional info:
I *think* same leak is present by triggered Actions::Katello::Repository::CapsuleGenerateAndSync task (seen at a customer, havent reproduced yet) - we shall check that as well

Comment 1 Justin Sherrill 2017-01-12 20:08:55 UTC
I've been investigating this over the course of a few days and have not been able to locate any specific code paths that can be optimized to reduce this memory usage.  It seems that this is likely a combination of Memory Fragmentation and infrequent Garbage Collection.

I have been able to tune ruby garbage collection for foreman tasks to cause it to run the garbage collection more often.  This seems to have reduced the amount of memory consumed during large capsule syncs considerably.  

I'd be interested in know if this helps customers at all.  It may cause some tasks to run a bit slower but in practice I have not seen that so far.

To apply the tunings simply add the following to /etc/sysconfig/foreman-tasks

RUBY_GC_MALLOC_LIMIT=4000100
RUBY_GC_MALLOC_LIMIT_MAX=16000100
RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR=1.1
RUBY_GC_OLDMALLOC_LIMIT=16000100
RUBY_GC_OLDMALLOC_LIMIT_MAX=16000100

and restart the foreman-tasks service.  

Please do let us know how this helps.  These are a good set of default tuning parameters to help reduce memory consumption but its possible these could be changed even more.  I didn't want to hold up getting this into the customer's hands though.

Comment 2 Justin Sherrill 2017-01-12 21:15:12 UTC
Created redmine issue http://projects.theforeman.org/issues/18060 from this bug

Comment 4 Pavel Moravec 2017-01-19 07:57:24 UTC
(In reply to Justin Sherrill from comment #1)
> I've been investigating this over the course of a few days and have not been
> able to locate any specific code paths that can be optimized to reduce this
> memory usage.  It seems that this is likely a combination of Memory
> Fragmentation and infrequent Garbage Collection.

I heard that theory and dont object against it. The pitty is customers dont care where the root cause is :-/ (gladly, the increase isnt huge here but I *think* it is proportional to the size of synced data / number of capsules / time spent in capsule sync)

> 
> I have been able to tune ruby garbage collection for foreman tasks to cause
> it to run the garbage collection more often.  This seems to have reduced the
> amount of memory consumed during large capsule syncs considerably.  
> 
> I'd be interested in know if this helps customers at all.  It may cause some
> tasks to run a bit slower but in practice I have not seen that so far.
> 
> To apply the tunings simply add the following to /etc/sysconfig/foreman-tasks
> 
> RUBY_GC_MALLOC_LIMIT=4000100
> RUBY_GC_MALLOC_LIMIT_MAX=16000100
> RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR=1.1
> RUBY_GC_OLDMALLOC_LIMIT=16000100
> RUBY_GC_OLDMALLOC_LIMIT_MAX=16000100
> 
> and restart the foreman-tasks service.  
> 
> Please do let us know how this helps.  These are a good set of default
> tuning parameters to help reduce memory consumption but its possible these
> could be changed even more.  I didn't want to hold up getting this into the
> customer's hands though.

Agreed, this helped to some extend - approx. 1/4 of the increase is gone.

Running the script(*) for 24 hours on original Sat6.2.6 and collecting RSS of dynflow_executor every 5 minutes:

(*) I increased "sleep 10" -> "sleep 60" to prevent frequent situations when capsule sync tasks were waiting on older instance. Even though, this situation happened several times during each test.

$ grep dynflow_executor$ monitor_ps.txt.orig | awk '{ print $6 }'
435380
455960
473972
496308
506300
549196
586948
591472
599692
610968
611464
611528
621640
629720
642172
644992
672036
699072
716744
717000
725368
725424
726252
726352
726488
726492
726744
727620
729692
729668
729704
732400
731408
731400
731628
731728
731684
731588
731764
730180
730184
736788
736788
736816
738068
739104
739040
740516
740764
749280
750448
749396
749424
749392
752296
752276
752208
752228
752292
752268
752324
756592
759884
760188
760168
760296
760256
760232
760184
760108
760156
760200
760300
760132
764300
764456
764496
765408
765308
765356
765384
765336
766176
767168
767180
766352
766432
766396
766304
766344
766316
766272
766436
767396
770828
773068
773108
773360
774712
777876
789172
790056
789148
789408
790432
790396
789420
789444
789364
790252
795292
796980
800776
800720
801180
801632
801648
801572
800348
800244
800232
800272
802872
804628
804696
805668
805624
805596
806360
808892
809068
808892
809100
806696
806744
807860
808020
807920
807908
807016
806936
807024
806968
806944
806956
806876
806892
806996
806896
806988
806980
812132
814464
814516
814656
815632
815656
814628
814708
814584
814664
815632
813564
814020
813904
816948
816828
816944
817024
816948
817048
817064
816992
816964
817456
817444
817492
817572
816616
816660
816728
816700
816740
816788
818264
818320
818352
818328
818432
818400
818460
818440
818432
818352
818416
818500
818396
818496
818484
818436
817848
817876
817912
820916
821092
822132
821136
821132
821384
820748
820728
820736
820940
820728
821736
822132
822852
822952
823012
823120
823032
822936
822912
822080
824672
823688
823772
823760
823708
823708
823792
823700
823800
827304
828152
827256
827308
827276
827272
827360
829848
829652
829688
830828
830768
830984
830932
830968
830072
831852
829460
829508
829400
829424
830372
829424
829724
830320
830248
830224
830464
830304
830220
830368
831360
831384
830428
830332
830360
830448
830492
830380
839864
840728
840792
840756
840600
840712
840668
840792
839620
839808
$

$ grep dynflow_executor$ monitor_ps.txt.new | awk '{ print $6 }'
380664
415092
435420
445008
463296
496364
521328
530568
531196
531324
531824
528656
540364
559972
562184
563964
564064
567488
567616
567508
571900
582044
587448
590516
591952
596880
597088
599916
600136
602348
604184
606260
606272
606288
606400
606464
610872
610952
611632
611752
611656
609648
612692
618284
618696
618668
618636
618612
619028
619156
619208
619264
619788
619804
619792
619808
621188
621076
621756
620728
620772
620652
626472
627028
627176
627336
628888
628960
630848
630936
630948
630884
630984
628996
629596
629560
630612
630592
629080
630676
632520
633208
633348
633328
633360
632280
632432
632432
632456
632776
635264
635236
635348
636256
636256
636708
636876
637024
640096
650472
652688
650996
651068
651432
651412
658388
658484
659540
659708
658620
660324
660304
660948
662452
661616
661516
661524
661632
661592
663340
662592
662484
662460
660392
661900
661808
661784
661744
660840
663888
663516
663688
663836
663808
667312
667280
666348
666472
666484
666504
666460
666444
666412
666616
667420
667360
667296
667460
666468
666580
666544
666556
666488
671236
672184
672224
672160
671180
671348
671392
671608
671944
671960
672116
671956
672048
672100
672024
672156
675444
672124
672004
671940
672980
673040
672984
672004
672020
675636
675484
675568
683920
684956
685504
685348
685256
685212
685044
685092
685072
685164
685240
685272
686220
688540
688420
685516
685640
685528
687584
686472
685588
685720
685680
686300
686072
686840
686840
686692
686732
686836
686776
686724
686892
686976
687636
688376
688932
688936
689052
688952
689004
688008
688264
687304
687348
687268
686344
686272
686412
686364
686392
686316
686296
687356
687544
687432
687496
687496
687596
687584
687460
687488
688404
693828
693004
693012
693144
693164
693156
693036
693208
693104
694252
693316
693404
693320
693332
693244
693316
693276
693284
693256
693412
694240
694264
693352
693996
692752
692748
692808
694352
694300
694292
695240
694256
694272
694228
694256
694284
$

In original test (without GC tuning), 850 CVs were created & published & synced-to-capsules & deleted, and memory grew from 435380 to 839808, i.e. by 404 MB.

In the test with tuned GC, 807 CVs were created & published etc., and memory grew from 380664 to 694284, i.e. by 313 MB.


I can try to find out relationship / dependency between memory growth on one side and (# capsules or repo size or sync duration) on the other side, if that helps - let me know if it would be valuable.

As already stated, if # capsules tosync is zero (i.e. if I remove capsules from Library L.E.), no CapsuleSync task and no mem.growth is present.

Also if I invoke the CapsuleSync task directly (i.e. not triggered by CV task), no leak is present. That sounds like the leak is present only for task triggered by another task?

Comment 5 Pavel Moravec 2017-01-22 14:30:01 UTC
(In reply to Pavel Moravec from comment #0)
> Additional info:
> I *think* same leak is present by triggered
> Actions::Katello::Repository::CapsuleGenerateAndSync task (seen at a
> customer, havent reproduced yet) - we shall check that as well

Confirmed it leads to mem.leak as well. "Minimalistic" reproducer:

- Have some Capsules assigned to Library LE

- to speedup reproducer, force post sync actions (such that if repo sync does not fetch a new content, post-sync actions like capsule sync will be triggered)
  - In WebUI: Administer -> Settings -> katello -> force_post_sync_actions = true

- run below script (replace sat-6.2-tools repo by any enabled repo):

hmr="hammer -u admin -p redhat "

i=1
while true; do
	echo "$(date): running $i-th iteration"
	$hmr repository synchronize --name "Red Hat Satellite Tools 6.2 for RHEL 7 Server RPMs x86_64" --product "Red Hat Enterprise Linux Server" --organization=RedHat

	echo "$(date): finished sync, waiting for Capsule Sync to finish"
	while true; do
		sleep 60
		if [ $(hammer -u admin -p redhat task list --search "label = Actions::Katello::Repository::CapsuleGenerateAndSync AND state = running" | grep -c running) -eq 0 ]; then
			break
		fi
		echo "$(date): still sleeping.."
	done

	echo "$(date): memory utilization: $(ps aux | grep -v grep | grep dynflow_executor$)"
	i=$((i+1))
done

- check in output for the ps outputs. In my case it was (for 4 capsules):

Sat Jan 21 12:26:25 CET 2017: memory utilization: foreman  11588  2.1  4.5 1722828 550344 ?      Sl   11:23   1:23 dynflow_executor
Sat Jan 21 12:31:28 CET 2017: memory utilization: foreman  11588  2.1  4.5 1722828 551768 ?      Sl   11:23   1:28 dynflow_executor
Sat Jan 21 12:36:49 CET 2017: memory utilization: foreman  11588  2.1  4.6 1722828 564308 ?      Sl   11:23   1:33 dynflow_executor
Sat Jan 21 12:41:53 CET 2017: memory utilization: foreman  11588  2.0  4.7 1722828 573584 ?      Sl   11:23   1:38 dynflow_executor
Sat Jan 21 12:46:22 CET 2017: memory utilization: foreman  11588  2.0  4.7 1722828 578236 ?      Sl   11:23   1:43 dynflow_executor
Sat Jan 21 12:51:25 CET 2017: memory utilization: foreman  11588  2.0  4.7 1722828 581728 ?      Sl   11:23   1:48 dynflow_executor
Sat Jan 21 12:55:18 CET 2017: memory utilization: foreman  11588  2.0  4.9 1722828 597552 ?      Sl   11:23   1:53 dynflow_executor
Sat Jan 21 12:58:44 CET 2017: memory utilization: foreman  11588  2.0  4.9 1722828 601092 ?      Sl   11:23   1:58 dynflow_executor
Sat Jan 21 13:02:19 CET 2017: memory utilization: foreman  11588  2.0  5.0 1722828 607380 ?      Sl   11:23   2:03 dynflow_executor
Sat Jan 21 13:05:46 CET 2017: memory utilization: foreman  11588  2.0  5.0 1722828 610532 ?      Sl   11:23   2:08 dynflow_executor
Sat Jan 21 13:09:09 CET 2017: memory utilization: foreman  11588  2.0  5.0 1722828 610768 ?      Sl   11:23   2:13 dynflow_executor
Sat Jan 21 13:12:40 CET 2017: memory utilization: foreman  11588  2.1  5.0 1722828 610480 ?      Sl   11:23   2:18 dynflow_executor
Sat Jan 21 13:16:12 CET 2017: memory utilization: foreman  11588  2.1  5.0 1722828 610664 ?      Sl   11:23   2:23 dynflow_executor
Sat Jan 21 13:19:57 CET 2017: memory utilization: foreman  11588  2.1  5.1 1722964 619764 ?      Sl   11:23   2:28 dynflow_executor
Sat Jan 21 13:23:31 CET 2017: memory utilization: foreman  11588  2.1  5.1 1724072 620628 ?      Sl   11:23   2:33 dynflow_executor
Sat Jan 21 13:26:53 CET 2017: memory utilization: foreman  11588  2.1  5.1 1724072 620544 ?      Sl   11:23   2:38 dynflow_executor
Sat Jan 21 13:30:26 CET 2017: memory utilization: foreman  11588  2.1  5.1 1724072 620468 ?      Sl   11:23   2:43 dynflow_executor
Sat Jan 21 13:33:52 CET 2017: memory utilization: foreman  11588  2.1  5.1 1724072 625964 ?      Sl   11:23   2:48 dynflow_executor
Sat Jan 21 13:37:18 CET 2017: memory utilization: foreman  11588  2.1  5.1 1724072 627980 ?      Sl   11:23   2:53 dynflow_executor
Sat Jan 21 13:41:03 CET 2017: memory utilization: foreman  11588  2.1  5.2 1724072 637348 ?      Sl   11:23   2:58 dynflow_executor
Sat Jan 21 13:44:28 CET 2017: memory utilization: foreman  11588  2.1  5.3 1724072 645788 ?      Sl   11:23   3:03 dynflow_executor
Sat Jan 21 13:47:54 CET 2017: memory utilization: foreman  11588  2.1  5.3 1724072 649088 ?      Sl   11:23   3:09 dynflow_executor
Sat Jan 21 13:51:29 CET 2017: memory utilization: foreman  11588  2.1  5.3 1724072 648840 ?      Sl   11:23   3:14 dynflow_executor
Sat Jan 21 13:54:56 CET 2017: memory utilization: foreman  11588  2.1  5.3 1724072 650556 ?      Sl   11:23   3:19 dynflow_executor
Sat Jan 21 13:58:23 CET 2017: memory utilization: foreman  11588  2.1  5.3 1724072 652144 ?      Sl   11:23   3:24 dynflow_executor
Sat Jan 21 14:01:47 CET 2017: memory utilization: foreman  11588  2.1  5.3 1724072 654948 ?      Sl   11:23   3:29 dynflow_executor
Sat Jan 21 14:05:11 CET 2017: memory utilization: foreman  11588  2.2  5.3 1724072 654944 ?      Sl   11:23   3:34 dynflow_executor
Sat Jan 21 14:08:39 CET 2017: memory utilization: foreman  11588  2.2  5.3 1724072 654868 ?      Sl   11:23   3:39 dynflow_executor


- .. and for only 1 Capsule in Library LE:

Sat Jan 21 14:22:13 CET 2017: memory utilization: foreman  11588  2.1  5.3 1724072 654964 ?      Sl   11:23   3:47 dynflow_executor
Sat Jan 21 14:25:36 CET 2017: memory utilization: foreman  11588  2.0  5.3 1724072 654648 ?      Sl   11:23   3:49 dynflow_executor
Sat Jan 21 14:28:58 CET 2017: memory utilization: foreman  11588  2.0  5.3 1724072 654820 ?      Sl   11:23   3:52 dynflow_executor
Sat Jan 21 14:33:27 CET 2017: memory utilization: foreman  11588  2.0  5.3 1724072 654972 ?      Sl   11:23   3:54 dynflow_executor
Sat Jan 21 14:37:50 CET 2017: memory utilization: foreman  11588  2.0  5.3 1724072 654888 ?      Sl   11:23   3:56 dynflow_executor
Sat Jan 21 14:41:16 CET 2017: memory utilization: foreman  11588  2.0  5.3 1724072 655128 ?      Sl   11:23   3:58 dynflow_executor
Sat Jan 21 14:44:37 CET 2017: memory utilization: foreman  11588  1.9  5.3 1724072 655028 ?      Sl   11:23   4:00 dynflow_executor
Sat Jan 21 14:49:01 CET 2017: memory utilization: foreman  11588  1.9  5.3 1724072 655132 ?      Sl   11:23   4:03 dynflow_executor
Sat Jan 21 14:52:24 CET 2017: memory utilization: foreman  11588  1.9  5.3 1724072 655236 ?      Sl   11:23   4:05 dynflow_executor
Sat Jan 21 14:55:48 CET 2017: memory utilization: foreman  11588  1.9  5.3 1724072 654880 ?      Sl   11:23   4:07 dynflow_executor
Sat Jan 21 14:59:15 CET 2017: memory utilization: foreman  11588  1.9  5.3 1724072 654880 ?      Sl   11:23   4:09 dynflow_executor


- .. adding 3 Capsules back to Library (some time later):

Sun Jan 22 12:33:44 CET 2017: memory utilization: foreman  25712  2.2  6.1 1725808 745148 ?      Sl   Jan21  28:06 dynflow_executor
Sun Jan 22 12:37:15 CET 2017: memory utilization: foreman  25712  2.2  6.1 1725808 745256 ?      Sl   Jan21  28:11 dynflow_executor
Sun Jan 22 12:40:46 CET 2017: memory utilization: foreman  25712  2.2  6.1 1725808 745236 ?      Sl   Jan21  28:16 dynflow_executor
Sun Jan 22 12:44:18 CET 2017: memory utilization: foreman  25712  2.2  6.1 1725808 746932 ?      Sl   Jan21  28:21 dynflow_executor
Sun Jan 22 12:47:49 CET 2017: memory utilization: foreman  25712  2.2  6.1 1725808 747308 ?      Sl   Jan21  28:26 dynflow_executor
Sun Jan 22 12:51:18 CET 2017: memory utilization: foreman  25712  2.2  6.1 1725808 747540 ?      Sl   Jan21  28:31 dynflow_executor
Sun Jan 22 12:54:47 CET 2017: memory utilization: foreman  25712  2.2  6.1 1725808 747484 ?      Sl   Jan21  28:36 dynflow_executor



Another observation: the CapsuleGenerateAndSync task must be running, not only planned&cancelled. I.e. if I stopped the Capsules, the task detects it in Planning phase and did not run the task - and no mem.leak is observed.


Yet another observation: I was taking RSS usage after each "sync loop". Often there was no memory increase in the latest iteration, sometimes there was small, sometimes quite evident. That suggests either some race condition in concurrency (that sometimes occurs and sometimes not) is behind the leak, or something in ruby garbage collector (how to enable GC logs? some verbose ones? that might shed some more light..).

Comment 6 Pavel Moravec 2017-01-22 21:20:31 UTC
(and again, manually triggering from foreman-rake console:

ForemanTasks.sync_task(::Actions::Katello::Repository::CapsuleGenerateAndSync, Katello::Repository.find(1)) while true

does not lead to memory consumption growth

)

Comment 7 Justin Sherrill 2017-01-24 21:03:22 UTC
Pavel, 

After examining the ruby code related to this action, i believe that this has to do with memory fragmentation and I don't know that there is much more we can do for 'easy wins'.   Adding support for restarting the executor should solve the situation completely.  This is what pulp has done recently.

The GC tuning variables seemed to help customers greatly, going from ~16GB of memory used by the dynflow executor down to ~7GB.   In my testing with 10 capsules and ~15 repos, this patched also helped quite a bit reduce overall memory consumption and the rate of growth.

I'm going to use this bz for adding the tuning variables and clone it to handle the fragmentation.

Comment 8 Mike McCune 2017-01-24 21:21:45 UTC
Customers reported similar success with limiting memory growth over time in production environments with the GC tunings.

IMO we should ship with these on by default

Comment 9 Ivan Necas 2017-02-27 08:37:44 UTC
Could you provide a task export (foreman-rake foreman_tasks:export_tasks TASK_DAYS=1), together with the sampled ps aux output to match the growth with the tasks, especially for cases where it happens or doesn't happen.

Also, just to be sure: the tests where you compare it happens or not were always perofrmed after the tasks service is restarted: the jumps in the growth at the beginning are not that unexpected, before the process heats up.

Comment 10 Satellite Program 2017-02-27 09:15:41 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/18060 has been resolved.

Comment 16 Ivan Necas 2017-03-13 08:03:23 UTC
I did various checks and tunings in the code to eliminate some unnecessary objects creations, that seems to contribute the most the the continuous growth.

After few of them, the memory growth seemed to stabilize (on 150th iteration of the reproducer, in 150-th iteration getting from 480MB through 460MB to 387MB.
I still need to watch the behavior for longer time + identify the changes that contributed the most to the improvement: some of them were quite large and might take longer time to get to upstream/downstream, some of them are just tweaks to avoid some unnecessary allocations.

Additionally, I've advised garbage collector to do the cleanup more often: although it might help with the overall memory growth, this on it's own didn't seem to have influence on the growth itself.

Comment 21 Ivan Necas 2017-03-13 15:54:54 UTC
Created attachment 1262495 [details]
Dynflow patch to enforce garbage collection after action is run

Additional thing to try is to apply a patch to enforce garbage collection after each dynflow action is run. To apply:

    cd /opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-0.8.*
    git init; git add -A .; git commit -m 'Original source'
    patch -p1 < ~/dynflow-gc.patch
    service foreman-tasks restart

This will turn on more aggressive garbage collection to advise the Ruby interpreter when is a good time to clean some old data, that participate in the
memory growth.

Comment 22 Ivan Necas 2017-03-13 16:01:47 UTC
Created attachment 1262500 [details]
Script for exporting task data into csv

I've attached a script that can be used to export the tasks data into csv.

Having that would be useful for further investigation and deeper analysis/finding correlations between the tasks and memory growth. Also, if there are latest data from the `ps` output, it would be useful to have them.

Comment 24 Ivan Necas 2017-03-21 10:50:42 UTC
@pmoravec I can't confirm the observerd memory leak: after running the reproducer for a week, it converged to 642MB, oscillating between the values. What could influence the results would be other tasks run in the mean time. The measures I did in between also don't suggest a memory leak would be in place (no growth of managed objects). The only think I can think of is the frequency of full garbage collection: the less frequent the full garbage collection is the higher the memory can grow (but still it should reach some top level). on the other hand, the more frequent the gc, the higher time the Ruby interpreter spends just on GC, that could lead to performance degradation (tasks taking longer time to process) that could lead to accumulation of backlog of tasks.

I'm preparing a sampler to get more data about the garbage collection and memory allocation from the customer. I'm also waiting for results about the memory consumption after going with aggressive garbage collection that was turned on by patch https://bugzilla.redhat.com/attachment.cgi?id=1262495&action=diff

Comment 25 Ivan Necas 2017-03-22 17:02:38 UTC
Created attachment 1265447 [details]
A simple profiler for dynflow actions

I'm still interested into the data about the memory behavior after applying
the aggresive garbage collection.

After sharing those, I would like to ask to revert the patch
before we proceed with more data collection

  cd /opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-0.8.*
  git checkout -- .

I'm attaching a simple profiler to collect information about Ruby garbage
collection in respect the the actions that were run, to be able to suggest
better tuning for the garbage collection.

Usage

1) copy the profiler to be loaded by satellite

   cp dynflow_profiler.rb /usr/share/foreman/config/initializers
   service foreman-tasks restart

2) leave the satellite operate


3) grab the collected data to be shared with engineering:

   tar cvf dynflow_gc.tar /var/log/foreman/dynflow_gc.log*

4) remove the profiler after done collecting the data

   rm /usr/share/foreman/config/initializers/dynflow_profiler.rb
   service foreman-tasks restart

5) share the data with us

(more details about additional features of the profiler and how to use
it to extract some data are included inside the script)

Comment 33 Ivan Necas 2017-04-10 15:21:04 UTC
A note: there might be a relation between hits and https://bugzilla.redhat.com/show_bug.cgi?id=1440235, where there might be some unreleased memory as part of processing candlepin events, and it could participate to memory fragementation: I'm asking for more inforamtion there to check if we could run to this situations with other users that experience high memory usage

Comment 35 Lukas Pramuk 2017-06-13 07:47:04 UTC
VERIFIED.

@satellite-6.2.10-3.0.el7sat.noarch
tfm-rubygem-foreman-tasks-0.7.14.14-1.el7sat.noarch

using following manual steps:

# grep -r _GC_ /etc
/etc/sysconfig/foreman-tasks:RUBY_GC_MALLOC_LIMIT=4000100
/etc/sysconfig/foreman-tasks:RUBY_GC_MALLOC_LIMIT_MAX=16000100
/etc/sysconfig/foreman-tasks:RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR=1.1
/etc/sysconfig/foreman-tasks:RUBY_GC_OLDMALLOC_LIMIT=16000100
/etc/sysconfig/foreman-tasks:RUBY_GC_OLDMALLOC_LIMIT_MAX=16000100

>>> RUBY_GC_* vars are present in the config file


# ps -efH | grep dynflow[_]executor
foreman  29059     1  0 Jun09 ?        00:05:11   dynflow_executor
foreman  29062     1  0 Jun09 ?        00:00:01   dynflow_executor_monitor

# ps eww 29059 29062
  PID TTY      STAT   TIME COMMAND
29059 ?        Sl     5:11 dynflow_executor MANPATH=/opt/theforeman/tfm/root/usr/share/man:/opt/rh/rh-ruby22/root/usr/share/man:/opt/rh/v8314/root/usr/share/man:/opt/rh/rh-ror41/root/usr/share/man: FOREMAN_LOG_DIR=/var/log/foreman SHELL=/bin/false LIBRARY_PATH=/opt/theforeman/tfm/root/usr/lib64dd:/opt/rh/v8314/root/usr/lib64 RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR=1.1 X_SCLS=rh-ruby22 v8314 rh-ror41 tfm  BUNDLER_EXT_HOME=/usr/share/foreman USER=foreman LD_LIBRARY_PATH=/opt/theforeman/tfm/root/usr/lib64:/opt/rh/rh-ruby22/root/usr/lib64:/opt/rh/v8314/root/usr/lib64:/opt/rh/rh-ror41/root/usr/lib64 FOREMAN_TASK_PARAMS=-p foreman FOREMAN_LOGGING_SQL=warn CPATH=/opt/theforeman/tfm/root/usr/include:/opt/rh/v8314/root/usr/include RAILS_RELATIVE_URL_ROOT=$FOREMAN_PREFIX PATH=/opt/theforeman/tfm/root/usr/bin:/opt/rh/rh-ruby22/root/usr/bin:/opt/rh/v8314/root/usr/bin:/opt/rh/rh-ror41/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin _=/opt/rh/rh-ruby22/root/usr/bin/ruby PWD=/usr/share/foreman LANG=en_US.UTF-8 RUBY_GC_OLDMALLOC_LIMIT=16000100 HOME=/usr/share/foreman SHLVL=4 RAILS_ENV=production RUBY_GC_OLDMALLOC_LIMIT_MAX=16000100 LOGNAME=foreman RUBY_GC_MALLOC_LIMIT_MAX=16000100 PYTHONPATH=/opt/rh/v8314/root/usr/lib/python2.7/site-packages GEM_PATH=/opt/theforeman/tfm/root/usr/share/gems:/opt/rh/rh-ror41/root/usr/share/gems:/usr/share/foreman/.gem/ruby:/opt/rh/rh-ruby22/root/usr/share/gems:/opt/rh/rh-ruby22/root/usr/local/share/gems/opt/rh/rh-ror41/root/usr/share/gems:/usr/share/foreman/.gem/ruby:/opt/rh/rh-ruby22/root/usr/share/gems:/opt/rh/rh-ruby22/root/usr/local/share/gems XDG_DATA_DIRS=/opt/rh/rh-ruby22/root/usr/share PKG_CONFIG_PATH=/opt/theforeman/tfm/root/usr/lib64/pkgconfig:/opt/rh/rh-ruby22/root/usr/lib64/pkgconfig:/opt/rh/v8314/root/usr/lib64/pkgconfig:/opt/rh/rh-ror41/root/usr/lib64/pkgconfig FOREMAN_LOGGING=warn RUBY_GC_MALLOC_LIMIT=4000100 FOREMAN_USER=foreman
29062 ?        Sl     0:01 dynflow_executor_monitor MANPATH=/opt/theforeman/tfm/root/usr/share/man:/opt/rh/rh-ruby22/root/usr/share/man:/opt/rh/v8314/root/usr/share/man:/opt/rh/rh-ror41/root/usr/share/man: FOREMAN_LOG_DIR=/var/log/foreman SHELL=/bin/false LIBRARY_PATH=/opt/theforeman/tfm/root/usr/lib64dd:/opt/rh/v8314/root/usr/lib64 RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR=1.1 X_SCLS=rh-ruby22 v8314 rh-ror41 tfm  BUNDLER_EXT_HOME=/usr/share/foreman USER=foreman LD_LIBRARY_PATH=/opt/theforeman/tfm/root/usr/lib64:/opt/rh/rh-ruby22/root/usr/lib64:/opt/rh/v8314/root/usr/lib64:/opt/rh/rh-ror41/root/usr/lib64 FOREMAN_TASK_PARAMS=-p foreman FOREMAN_LOGGING_SQL=warn CPATH=/opt/theforeman/tfm/root/usr/include:/opt/rh/v8314/root/usr/include RAILS_RELATIVE_URL_ROOT=$FOREMAN_PREFIX PATH=/opt/theforeman/tfm/root/usr/bin:/opt/rh/rh-ruby22/root/usr/bin:/opt/rh/v8314/root/usr/bin:/opt/rh/rh-ror41/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin _=/opt/rh/rh-ruby22/root/usr/bin/ruby PWD=/usr/share/foreman LANG=en_US.UTF-8 RUBY_GC_OLDMALLOC_LIMIT=16000100 HOME=/usr/share/foreman SHLVL=4 RAILS_ENV=production RUBY_GC_OLDMALLOC_LIMIT_MAX=16000100 LOGNAME=foreman RUBY_GC_MALLOC_LIMIT_MAX=16000100 PYTHONPATH=/opt/rh/v8314/root/usr/lib/python2.7/site-packages GEM_PATH=/opt/theforeman/tfm/root/usr/share/gems:/opt/rh/rh-ror41/root/usr/share/gems:/usr/share/foreman/.gem/ruby:/opt/rh/rh-ruby22/root/usr/share/gems:/opt/rh/rh-ruby22/root/usr/local/share/gems/opt/rh/rh-ror41/root/usr/share/gems:/usr/share/foreman/.gem/ruby:/opt/rh/rh-ruby22/root/usr/share/gems:/opt/rh/rh-ruby22/root/usr/local/share/gems XDG_DATA_DIRS=/opt/rh/rh-ruby22/root/usr/share PKG_CONFIG_PATH=/opt/theforeman/tfm/root/usr/lib64/pkgconfig:/opt/rh/rh-ruby22/root/usr/lib64/pkgconfig:/opt/rh/v8314/root/usr/lib64/pkgconfig:/opt/rh/rh-ror41/root/usr/lib64/pkgconfig FOREMAN_LOGGING=warn RUBY_GC_MALLOC_LIMIT=4000100 FOREMAN_USER=foreman

>>> dynflow_executor processes are running with RUBY_GC_* vars applied

Comment 36 Pavel Moravec 2017-06-15 18:33:45 UTC
*** Bug 1412308 has been marked as a duplicate of this bug. ***

Comment 38 errata-xmlrpc 2017-06-20 17:22:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1553


Note You need to log in before you can comment on or make changes to this bug.