+++ This bug was initially created as a clone of Bug #1628638 +++
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Occasionally
Steps to Reproduce:
1. set a memory threshold on dynflow executor (EXECUTOR_MEMORY_LIMIT=2gb in /etc/sysconfig/foreman-tasks in sat 6.3, /etc/sysconfig/dynflowd in 6.4
2. run some substantial load on the tasking system continuous resyncing and publishing of CVs might be a good start
3. wait for the memory limit of the dynflow process to cross the limit - watch production.log for 'Memory level exceeded' message
Actual results:
The process still keeps running, termination is not finished, the tasks are not proceeding anymore
Expected results:
The process exits in timely manner and new one gets started
Additional info:
I've written the reproduced based on customer observations, haven't reproducer it locally in production just yet, we however know about places where we wait in the termination phase without any timeout
--- Additional comment from on 2018-09-14T15:14:51Z
Since this issue was entered in Red Hat Bugzilla, the pm_ack has been
set to + automatically for the next planned release
--- Additional comment from on 2018-09-14T15:31:56Z
Increasing the severity.
--- Additional comment from on 2018-09-14T15:32:57Z
This is found in Credit Suisse production environment (6.3.2), we will be sending hotfix once the fix is available upstream. I am adding 6.3.z? flag to indicate this.
--- Additional comment from on 2018-09-24T12:23:47Z
Created redmine issue https://projects.theforeman.org/issues/25021 from this bug
--- Additional comment from on 2018-10-11T16:22:43Z
Fixed upstream in https://github.com/Dynflow/dynflow/pull/297
--- Additional comment from on 2018-10-12T13:44:45Z
Proposing for 6.4.z as well, as it didn't get to 6.4.0
--- Additional comment from on 2018-10-12T13:49:01Z
Created attachment 1493335[details]
hotfix against Satellite 6.3.4
scratch build from https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18757333, the source branch is here https://gitlab.sat.engineering.redhat.com/satellite6/dynflow/tree/HOTFIX-1628638-6.3.4.
Instructions to apply:
yum install -y ./tfm-rubygem-dynflow-0.8.34-2.HOTFIXRHBZ1628638.fm1_15.el7sat.noarch.rpm
katello-service restart
--- Additional comment from on 2018-10-12T13:50:26Z
Since the actual issue is hard to reproduce, I recommend verifying just the memory threshold works as expected (restarting the dynflow_executor process after crossing the limit).
--- Additional comment from on 2018-10-16T18:14:10Z
Hotfix verified:
Test box used 16MB RAM, 4vcpu
To hit a 2GB dynflow executor limit, you will have to generate a lot of load. To verify this bug in an easier way, reduce the executor memory limit to 600mb. Update the parameters in /etc/sysconfig/foreman-tasks as shown below and restart foreman-tasks:
EXECUTOR_MEMORY_LIMIT=600mb
EXECUTOR_MEMORY_MONITOR_DELAY=60
EXECUTOR_MEMORY_MONITOR_INTERVAL=60
Restart foreman-tasks:
- Sync rhel7 kickstart, rhel 7server and satellite tools repo in a loop:
for i in `seq 1 10`;do
hammer -u admin -p changeme product synchronize --name "Red Hat Enterprise Linux Server" --organization-id 1
done
- 7 content views published at the same time (like the following). Each content view had rhel7 kickstart, rhel7 server and satellite tools repos:
for i in {0..10}; do hammer content-view publish --name="testcv1" --organization-id="1"; done
- watch production.log for `Memory`:
watch -n 1 "cat /var/log/foreman/production.log | grep 'Memory' | tail -n 10"
- If you need more load, you can also kickoff mass content host registration or a lot of client activity using content-host-d.
After few minutes, I saw the foreman tasks growing in size and it hit the limit and foreman-tasks restarted:
2018-10-16 13:49:24 [foreman-tasks] [E] Memory level exceeded, registered 696573952 bytes,
which is greater than 629145600.0 limit.
At this time, I stopped all sync, publish, other loads from the above list, so the currently running tasks went to paused.
After foreman-tasks restarted, the paused tasks (specifically the content view publish and repo sync tasks) continued automatically and ran to completion. I retriggered some of the load from the above-mentioned list and they all ran fine.
- I saw that the content view publish tasks failed with Required locks already taken by another foreman task error
- The above also happened for sync task.
- I cleared the tasks which held the lock and after that publish/sync/content host traffic all started going fine.
--- Additional comment from on 2018-10-16T18:16:38Z
> - I saw that the content view publish tasks failed with Required locks
> already taken by another foreman task error
> - The above also happened for sync task.
> - I cleared the tasks which held the lock and after that
> publish/sync/content host traffic all started going fine.
Please ignore the last 5 lines above from my previous comment. They are created because of my hammer commands running in infinite loops.
--- Additional comment from on 2018-10-16T18:27:01Z
(In reply to sthirugn from comment #9)
> Hotfix verified:
Hotfix verified in Satellite 6.3.4.
--- Additional comment from on 2018-10-31T17:43:14Z
The code fix for this is in Satellite 6.5 SNAP1; therefore, adding sat-6.5.0 flag and moving to ON_DEV.
--- Additional comment from on 2018-11-06T16:05:02Z
Note on hotfix: the same hotifx delivered against 6.3.4 applies against 6.3.5 as well.
--- Additional comment from on 2018-11-21T10:45:43Z
Mass setting of ON_QA bugs to target milestone 6.5.0.
--- Additional comment from on 2018-11-30T10:13:43Z
Verification on 6.4 and 6.5 blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1654217
--- Additional comment from on 2019-01-03T12:49:30Z
Accidentally changed the status, putting back to ON_QA
--- Additional comment from on 2019-02-20T11:28:22Z
So I was able to get some of these:
2019-02-17T07:47:31 [E|dyn|] Memory level exceeded, registered 566837248 bytes, which is greater than 524288000.0 limit.
2019-02-17T07:52:20 [E|dyn|] Memory level exceeded, registered 542158848 bytes, which is greater than 524288000.0 limit.
2019-02-17T07:58:11 [E|dyn|] Memory level exceeded, registered 633757696 bytes, which is greater than 524288000.0 limit.
2019-02-17T08:02:04 [E|dyn|] Memory level exceeded, registered 598867968 bytes, which is greater than 524288000.0 limit.
With this settings:
# tail -n 3 /etc/sysconfig/dynflowd
EXECUTOR_MEMORY_LIMIT=500mb
EXECUTOR_MEMORY_MONITOR_DELAY=60
EXECUTOR_MEMORY_MONITOR_INTERVAL=60
I have been registering in the loop, syncing and promoting ("x" axis is a timestamp, "y" is a RSS memory used by dynflow executor):
while true; do echo "$( date +%s ) $( ps -eo rss,pid,user,command --sort -size | grep ' dynflow_executor$' )"; sleep 10; done
gnuplot -e "set key autotitle columnhead; set datafile separator ' '; set terminal dumb; plot '/tmp/data' using 1:2 notitle;"
700000 +-+------+---------+--------+---------+--------+---------+------+-+
+ + + + + + + +
| AAA |
600000 +-+ AAAA +-+
| AA |
500000 +-+ AAAAA AAAAAA A A +-+
| AAAA A AAAA AA A |
| AAA AAA A |
400000 +-+ A A AA AA +-+
| A A |
| A A A |
300000 +-+ AAA AAA AAA +-+
| A AA |
| AA A |
200000 +-+ A A A +-+
| A A A |
100000 +-+ A +-+
| |
+ + + +A + + + +
0 +-+------+---------+--------+---------+--------+---------+------+-+
1.55039e 1.55039e+ 1.55039e 1.55039e+ 1.55039e 1.55039e+ 1.5503 1.55039e+09
I have used verification procedure from bug 1661291.
I have not noticed more than one dynflow_executor processes. Also after resuming/cancelling some tasks, I was able to sync RH repo and publish CV again.
=> VERIFIED
--- Additional comment from on 2019-02-27T05:24:55Z
Moving this back to ASSIGNED as it is fairly easy to induce task failure during recycling events under high traffic (takes ~10min or less).
I used the 6.3.5 hotfix mentioned above to test a high-traffic Satellite Host::Update calls. I setup 20 Docker containers running a loop. I need to get this to a customer who is seeing 20-30G dynflow_executor memory usage on 6.3.5.
The containers run this loop:
for i in `seq 1 1000`;
do
subscription-manager repos --disable rhel-7-server-satellite-tools-6.3-rpms
subscription-manager repos --enable rhel-7-server-satellite-tools-6.3-rpms
katello-enabled-repos-upload -f
katello-package-upload -f
done
Started with 2 dynflow_executors and the following settings:
#Set the number of executors you want to run
EXECUTORS_COUNT=2
EXECUTOR_MEMORY_LIMIT=512mb
EXECUTOR_MEMORY_MONITOR_DELAY=60 #default: 2 hours
EXECUTOR_MEMORY_MONITOR_INTERVAL=60
After hitting the first Memory level threshold, we ended up with 3 dynflow_executor processes:
Every 2.0s: ps aux |grep dynflow_executor |grep -v monitor | grep -v grep Wed Feb 27 00:12:22 2019
foreman 2728 1.7 3.7 2852784 1235748 ? Sl Feb26 3:33 dynflow_executor
foreman 11896 1.1 0.0 3014820 48 ? Sl Feb26 3:13 dynflow_executor
foreman 11915 10.7 4.6 3708828 1508272 ? Sl Feb26 30:43 dynflow_executor
At this point, I could see some tasks that were Paused during recycling:
https://sat-r220-07.lab.eng.rdu2.redhat.com/foreman_tasks/dynflow/343faaef-bb0c-42e8-874b-374137f99b6e
After running for a few hours, the number of dynflow_executors gradually goes up:
Every 2.0s: ps aux |grep dynflow_executor |grep -v monitor | grep -v grep Wed Feb 27 00:12:22 2019
foreman 2728 1.7 3.7 2852784 1235748 ? Sl Feb26 3:33 dynflow_executor
foreman 11896 1.1 0.0 3014820 48 ? Sl Feb26 3:13 dynflow_executor
foreman 11915 10.7 4.6 3708828 1508272 ? Sl Feb26 30:43 dynflow_executor
foreman 13314 1.9 2.2 5632552 749144 ? Sl Feb26 3:31 dynflow_executor
foreman 15433 2.9 3.7 2900676 1227756 ? Sl Feb26 6:28 dynflow_executor
foreman 19872 3.8 4.9 3315084 1624928 ? Sl Feb26 8:13 dynflow_executor
foreman 25447 1.5 2.1 5354732 684932 ? Sl Feb26 2:42 dynflow_executor
There are 57360 tasks in the database after a few hours of execution but at this point, there was an issue with Pulp and we had many failed tasks. This part is secondary to the issues seen during the start of the test where we see failed tasks and growing executors.
Recycling occurred during the test:
2019-02-26 20:30:28 [foreman-tasks] [E] Memory level exceeded, registered 543936512 bytes, which is greater than 536870912.0 limit.
2019-02-26 20:36:46 [foreman-tasks] [E] Memory level exceeded, registered 537161728 bytes, which is greater than 536870912.0 limit.
2019-02-26 20:49:54 [foreman-tasks] [E] Memory level exceeded, registered 542969856 bytes, which is greater than 536870912.0 limit.
2019-02-26 20:56:26 [foreman-tasks] [E] Memory level exceeded, registered 553697280 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:00:41 [foreman-tasks] [E] Memory level exceeded, registered 560275456 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:05:01 [foreman-tasks] [E] Memory level exceeded, registered 585371648 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:06:06 [foreman-tasks] [E] Memory level exceeded, registered 555577344 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:10:14 [foreman-tasks] [E] Memory level exceeded, registered 565657600 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:13:23 [foreman-tasks] [E] Memory level exceeded, registered 556138496 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:15:37 [foreman-tasks] [E] Memory level exceeded, registered 556552192 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:19:09 [foreman-tasks] [E] Memory level exceeded, registered 567934976 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:19:35 [foreman-tasks] [E] Memory level exceeded, registered 549634048 bytes, which is greater than 536870912.0 limit.
The test setup is:
satellite: https://sat-r220-07.lab.eng.rdu2.redhat.com/
(admin/changeme)
docker host: sat-r220-09.lab.eng.rdu2.redhat.com
/root/content-host-d/startup.sh
you can run the containers via: /root/content-host-d/launch-07.sh
--- Additional comment from on 2019-02-27T05:39:26Z
follow on:
if you all would like me to demonstrate this *from scratch* with 0 tasks in the DB and a freshly started Satellite. Let me know and I can Bluejeans it tomorrow if anyone is interested.
--- Additional comment from on 2019-02-27T07:56:27Z
Hello Mike. This bugs if for Satellite 6.5 and I see your Satellite is 6.3.5. I can surely re-test with your loop, but would like to first ensure this version mismatch intentional.
--- Additional comment from on 2019-02-27T09:45:51Z
I agree there should be a clone of the bug where we would talk about the 6.3 variant.
I was looking at the reproducer, and even though I was not able to get the state where multiple executors were running, I've seen strange behavior with the executors getting stuck.
I think the problem is with setting EXECUTOR_MEMORY_MONITOR_DELAY too short: it seems like that leads the executors to not start properly. Setting it to something bigger (such as EXECUTOR_MEMORY_MONITOR_DELAY=480) seems like helping with the issue.
For production setup, I would definitely recommend much higher number (such as the default 2 hours).
--- Additional comment from on 2019-02-27T14:44:35Z
I'll do the following:
* re-run my test with a longer MONITOR_DELAY
* Repeat the test on my 6.5 server as well so we can track this here accurately
* Will clone this to 6.3.z
--- Additional comment from on 2019-02-28T04:27:58Z
Hello Mike,
Thank you for your work on this bugzilla and sharing your test on the 6.3.5. This is just to ask you if there is any communication we can pass on to the cu we dealt in the case 02153654 regarding the fix/hotfix in Satellite 6.3.z.
Or shall I continue to share an interim update till we reach the fix and wait the bugzilla to move to a closed state.
--- Additional comment from on 2019-02-28T05:16:06Z
Please no communication to customers about my test results as there is likely a flaw in the configuration I used for the test. Will provide updates to this BZ here shortly after re-running the test with feedback from Ivan.
--- Additional comment from on 2019-02-28T05:24:29Z
Hello Mike,
Sure, I was not asking if I can share your test results but wanted to ask if in general, we can share anything on the current status.
I will give him an interim update saying that the engineering team is still working on it without mentioning any of the internal communication. Thank you for your response.
--- Additional comment from on 2019-02-28T22:04:54Z
Moving this back to VERIFIED. After testing with the proper setting in MONITOR_DELAY I experienced none of the effects seen in Comment #18. The dynflow_executors recycled properly, there were a few tasks with 'abnormal termination' but they are able to be resumed without error.
Content View publishes that were running during a recycle resumed and completed without issues.
Memory usage remained stable. Thumbs up! This is a great improvement to Satellite 6's memory management.
I tested this on 6.3.5+ hotfixes as well as 6.5 Snap 17
--- Additional comment from on 2019-02-28T22:06:03Z
To all cases attached to this bug that have Hotfixes for 6.3.5 and 6.4.2, Engineering continues to endorse their release and use in customer environments feeling memory pressure from dynflow_executor.