Bug 1684687

Summary: The termination procedure after memory threshold exceeded can get stuck, waiting infinitely for some events to occur
Product: Red Hat Satellite Reporter: Mike McCune <mmccune>
Component: Tasks PluginAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED CURRENTRELEASE QA Contact: Jan Hutaƙ <jhutar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: UnspecifiedCC: andrew.schofield, aruzicka, gapatil, inecas, jhutar, kabbott, ktordeur, mmccune, pcreech, pmoravec, sthirugn
Target Milestone: ReleasedKeywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1628638 Environment:
Last Closed: 2019-03-06 20:22:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike McCune 2019-03-01 21:40:19 UTC
+++ This bug was initially created as a clone of Bug #1628638 +++

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
Occasionally

Steps to Reproduce:
1. set a memory threshold on dynflow executor (EXECUTOR_MEMORY_LIMIT=2gb in /etc/sysconfig/foreman-tasks in sat 6.3, /etc/sysconfig/dynflowd in 6.4

2. run some substantial load on the tasking system continuous resyncing and publishing of CVs might be a good start
3. wait for the memory limit of the dynflow process to cross the limit - watch production.log for 'Memory level exceeded' message

Actual results:
The process still keeps running, termination is not finished, the tasks are not proceeding anymore

Expected results:
The process exits in timely manner and new one gets started

Additional info:
I've written the reproduced based on customer observations, haven't reproducer it locally in production just yet, we however know about places where we wait in the termination phase without any timeout

--- Additional comment from  on 2018-09-14T15:14:51Z 

Since this issue was entered in Red Hat Bugzilla, the pm_ack has been
set to + automatically for the next planned release

--- Additional comment from  on 2018-09-14T15:31:56Z 

Increasing the severity.

--- Additional comment from  on 2018-09-14T15:32:57Z 

This is found in Credit Suisse production environment (6.3.2), we will be sending hotfix once the fix is available upstream. I am adding 6.3.z? flag to indicate this.

--- Additional comment from  on 2018-09-24T12:23:47Z 

Created redmine issue https://projects.theforeman.org/issues/25021 from this bug

--- Additional comment from  on 2018-10-11T16:22:43Z 

Fixed upstream in https://github.com/Dynflow/dynflow/pull/297

--- Additional comment from  on 2018-10-12T13:44:45Z 

Proposing for 6.4.z as well, as it didn't get to 6.4.0

--- Additional comment from  on 2018-10-12T13:49:01Z 

Created attachment 1493335 [details]
hotfix against Satellite 6.3.4

scratch build from https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18757333,  the source branch is here https://gitlab.sat.engineering.redhat.com/satellite6/dynflow/tree/HOTFIX-1628638-6.3.4.

Instructions to apply:

yum install -y ./tfm-rubygem-dynflow-0.8.34-2.HOTFIXRHBZ1628638.fm1_15.el7sat.noarch.rpm
katello-service restart

--- Additional comment from  on 2018-10-12T13:50:26Z 

Since the actual issue is hard to reproduce, I recommend verifying just the memory threshold works as expected (restarting the dynflow_executor process after crossing the limit).

--- Additional comment from  on 2018-10-16T18:14:10Z 

Hotfix verified:
Test box used 16MB RAM, 4vcpu

To hit a 2GB dynflow executor limit, you will have to generate a lot of load.  To verify this bug in an easier way, reduce the executor memory limit to 600mb. Update the parameters in /etc/sysconfig/foreman-tasks as shown below and restart foreman-tasks:
 EXECUTOR_MEMORY_LIMIT=600mb
 EXECUTOR_MEMORY_MONITOR_DELAY=60
 EXECUTOR_MEMORY_MONITOR_INTERVAL=60

Restart foreman-tasks:
- Sync rhel7 kickstart, rhel 7server and satellite tools repo in a loop: 
   for i in `seq 1 10`;do
     hammer -u admin -p changeme product synchronize --name "Red Hat Enterprise Linux Server" --organization-id 1
   done
- 7 content views published at the same time (like the following). Each content view had rhel7 kickstart, rhel7 server and satellite tools repos:
   for i in {0..10}; do hammer content-view publish --name="testcv1" --organization-id="1"; done
- watch production.log for `Memory`:
   watch -n 1 "cat /var/log/foreman/production.log | grep 'Memory' | tail -n 10"
- If you need more load, you can also kickoff mass content host registration or a lot of client activity using content-host-d.

After few minutes, I saw the foreman tasks growing in size and it hit the limit and foreman-tasks restarted:
2018-10-16 13:49:24  [foreman-tasks] [E] Memory level exceeded, registered 696573952 bytes,
 which is greater than 629145600.0 limit.

At this time, I stopped all sync, publish, other loads from the above list, so the currently running tasks went to paused.

After foreman-tasks restarted, the paused tasks (specifically the content view publish and repo sync tasks) continued automatically and ran to completion.  I retriggered some of the load from the above-mentioned list and they all ran fine.
- I saw that the content view publish tasks failed with Required locks already taken by another foreman task error
- The above also happened for sync task.
- I cleared the tasks which held the lock and after that publish/sync/content host traffic all started going fine.

--- Additional comment from  on 2018-10-16T18:16:38Z 

> - I saw that the content view publish tasks failed with Required locks
> already taken by another foreman task error
> - The above also happened for sync task.
> - I cleared the tasks which held the lock and after that
> publish/sync/content host traffic all started going fine.

Please ignore the last 5 lines above from my previous comment.  They are created because of my hammer commands running in infinite loops.

--- Additional comment from  on 2018-10-16T18:27:01Z 

(In reply to sthirugn from comment #9)
> Hotfix verified:

Hotfix verified in Satellite 6.3.4.

--- Additional comment from  on 2018-10-31T17:43:14Z 

The code fix for this is in Satellite 6.5 SNAP1; therefore, adding sat-6.5.0 flag and moving to ON_DEV.

--- Additional comment from  on 2018-11-06T16:05:02Z 

Note on hotfix: the same hotifx delivered against 6.3.4 applies against 6.3.5 as well.

--- Additional comment from  on 2018-11-21T10:45:43Z 

Mass setting of ON_QA bugs to target milestone 6.5.0.

--- Additional comment from  on 2018-11-30T10:13:43Z 

Verification on 6.4 and 6.5 blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1654217

--- Additional comment from  on 2019-01-03T12:49:30Z 

Accidentally changed the status, putting back to ON_QA

--- Additional comment from  on 2019-02-20T11:28:22Z 

So I was able to get some of these:

2019-02-17T07:47:31 [E|dyn|] Memory level exceeded, registered 566837248 bytes, which is greater than 524288000.0 limit.
2019-02-17T07:52:20 [E|dyn|] Memory level exceeded, registered 542158848 bytes, which is greater than 524288000.0 limit.
2019-02-17T07:58:11 [E|dyn|] Memory level exceeded, registered 633757696 bytes, which is greater than 524288000.0 limit.
2019-02-17T08:02:04 [E|dyn|] Memory level exceeded, registered 598867968 bytes, which is greater than 524288000.0 limit.

With this settings:

# tail -n 3 /etc/sysconfig/dynflowd 
EXECUTOR_MEMORY_LIMIT=500mb
EXECUTOR_MEMORY_MONITOR_DELAY=60
EXECUTOR_MEMORY_MONITOR_INTERVAL=60

I have been registering in the loop, syncing and promoting ("x" axis is a timestamp, "y" is a RSS memory used by dynflow executor):

while true; do echo "$( date +%s ) $( ps -eo rss,pid,user,command --sort -size | grep ' dynflow_executor$' )"; sleep 10; done

gnuplot -e "set key autotitle columnhead; set datafile separator ' '; set terminal dumb; plot '/tmp/data' using 1:2 notitle;"

  700000 +-+------+---------+--------+---------+--------+---------+------+-+   
         +        +         +        +         +        +         +        +   
         |                                         AAA                     |   
  600000 +-+                                                  AAAA       +-+   
         |          AA                                                     |   
  500000 +-+    AAAAA           AAAAAA             A         A           +-+   
         |   AAAA A         AAAA                 AA          A             |   
         |               AAA                   AAA          A              |   
  400000 +-+            A                     A           AA         AA  +-+   
         |                                   A                       A     |   
         |             A                     A                      A      |   
  300000 +-+                              AAA           AAA       AAA    +-+   
         |             A                 AA                                |   
         |                              AA              A                  |   
  200000 +-+          A                 A                         A      +-+   
         |                             A               A          A        |   
  100000 +-+                           A                                 +-+   
         |                                                                 |   
         +        +         +        +A        +        +         +        +   
       0 +-+------+---------+--------+---------+--------+---------+------+-+   
    1.55039e 1.55039e+ 1.55039e 1.55039e+ 1.55039e 1.55039e+ 1.5503 1.55039e+09

I have used verification procedure from bug 1661291.

I have not noticed more than one dynflow_executor processes. Also after resuming/cancelling some tasks, I was able to sync RH repo and publish CV again.

=> VERIFIED

--- Additional comment from  on 2019-02-27T05:24:55Z 

Moving this back to ASSIGNED as it is fairly easy to induce task failure during recycling events under high traffic (takes ~10min or less).

I used the 6.3.5 hotfix mentioned above to test a high-traffic Satellite Host::Update calls. I setup 20 Docker containers running a loop. I need to get this to a customer who is seeing 20-30G dynflow_executor memory usage on 6.3.5.

The containers run this loop:

    for i in `seq 1 1000`;
    do
        subscription-manager repos --disable rhel-7-server-satellite-tools-6.3-rpms
        subscription-manager repos --enable rhel-7-server-satellite-tools-6.3-rpms 
        katello-enabled-repos-upload -f
        katello-package-upload -f
    done

Started with 2 dynflow_executors and the following settings:

#Set the number of executors you want to run
EXECUTORS_COUNT=2
EXECUTOR_MEMORY_LIMIT=512mb
EXECUTOR_MEMORY_MONITOR_DELAY=60 #default: 2 hours
EXECUTOR_MEMORY_MONITOR_INTERVAL=60

After hitting the first Memory level threshold, we ended up with 3 dynflow_executor processes:

Every 2.0s: ps aux |grep dynflow_executor |grep -v monitor | grep -v grep                                        Wed Feb 27 00:12:22 2019

foreman   2728  1.7  3.7 2852784 1235748 ?     Sl   Feb26   3:33 dynflow_executor
foreman  11896  1.1  0.0 3014820   48 ?        Sl   Feb26   3:13 dynflow_executor
foreman  11915 10.7  4.6 3708828 1508272 ?     Sl   Feb26  30:43 dynflow_executor

At this point, I could see some tasks that were Paused during recycling:

https://sat-r220-07.lab.eng.rdu2.redhat.com/foreman_tasks/dynflow/343faaef-bb0c-42e8-874b-374137f99b6e

After running for a few hours, the number of dynflow_executors gradually goes up:

Every 2.0s: ps aux |grep dynflow_executor |grep -v monitor | grep -v grep                                        Wed Feb 27 00:12:22 2019

foreman   2728  1.7  3.7 2852784 1235748 ?     Sl   Feb26   3:33 dynflow_executor
foreman  11896  1.1  0.0 3014820   48 ?        Sl   Feb26   3:13 dynflow_executor
foreman  11915 10.7  4.6 3708828 1508272 ?     Sl   Feb26  30:43 dynflow_executor
foreman  13314  1.9  2.2 5632552 749144 ?      Sl   Feb26   3:31 dynflow_executor
foreman  15433  2.9  3.7 2900676 1227756 ?     Sl   Feb26   6:28 dynflow_executor
foreman  19872  3.8  4.9 3315084 1624928 ?     Sl   Feb26   8:13 dynflow_executor
foreman  25447  1.5  2.1 5354732 684932 ?      Sl   Feb26   2:42 dynflow_executor

There are 57360 tasks in the database after a few hours of execution but at this point, there was an issue with Pulp and we had many failed tasks. This part is secondary to the issues seen during the start of the test where we see failed tasks and growing executors.

Recycling occurred during the test:

2019-02-26 20:30:28  [foreman-tasks] [E] Memory level exceeded, registered 543936512 bytes, which is greater than 536870912.0 limit.
2019-02-26 20:36:46  [foreman-tasks] [E] Memory level exceeded, registered 537161728 bytes, which is greater than 536870912.0 limit.
2019-02-26 20:49:54  [foreman-tasks] [E] Memory level exceeded, registered 542969856 bytes, which is greater than 536870912.0 limit.
2019-02-26 20:56:26  [foreman-tasks] [E] Memory level exceeded, registered 553697280 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:00:41  [foreman-tasks] [E] Memory level exceeded, registered 560275456 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:05:01  [foreman-tasks] [E] Memory level exceeded, registered 585371648 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:06:06  [foreman-tasks] [E] Memory level exceeded, registered 555577344 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:10:14  [foreman-tasks] [E] Memory level exceeded, registered 565657600 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:13:23  [foreman-tasks] [E] Memory level exceeded, registered 556138496 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:15:37  [foreman-tasks] [E] Memory level exceeded, registered 556552192 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:19:09  [foreman-tasks] [E] Memory level exceeded, registered 567934976 bytes, which is greater than 536870912.0 limit.
2019-02-26 21:19:35  [foreman-tasks] [E] Memory level exceeded, registered 549634048 bytes, which is greater than 536870912.0 limit.

The test setup is:

satellite: https://sat-r220-07.lab.eng.rdu2.redhat.com/
(admin/changeme)
docker host: sat-r220-09.lab.eng.rdu2.redhat.com
/root/content-host-d/startup.sh 

you can run the containers via: /root/content-host-d/launch-07.sh

--- Additional comment from  on 2019-02-27T05:39:26Z 

follow on:

if you all would like me to demonstrate this *from scratch* with 0 tasks in the DB and a freshly started Satellite. Let me know and I can Bluejeans it tomorrow if anyone is interested.

--- Additional comment from  on 2019-02-27T07:56:27Z 

Hello Mike. This bugs if for Satellite 6.5 and I see your Satellite is 6.3.5. I can surely re-test with your loop, but would like to first ensure this version mismatch intentional.

--- Additional comment from  on 2019-02-27T09:45:51Z 

I agree there should be a clone of the bug where we would talk about the 6.3 variant.

I was looking at the reproducer, and even though I was not able to get the state where multiple executors were running, I've seen strange behavior with the executors getting stuck.

I think the problem is with setting EXECUTOR_MEMORY_MONITOR_DELAY too short: it seems like that leads the executors to not start properly. Setting it to something bigger (such as EXECUTOR_MEMORY_MONITOR_DELAY=480) seems like helping with the issue.

For production setup, I would definitely recommend much higher number (such as the default 2 hours).

--- Additional comment from  on 2019-02-27T14:44:35Z 

I'll do the following:

 * re-run my test with a longer MONITOR_DELAY
 * Repeat the test on my 6.5 server as well so we can track this here accurately 
 * Will clone this to 6.3.z

--- Additional comment from  on 2019-02-28T04:27:58Z 

Hello Mike,

Thank you for your work on this bugzilla and sharing your test on the 6.3.5. This is just to ask you if there is any communication we can pass on to the cu we dealt in the case 02153654 regarding the fix/hotfix in Satellite 6.3.z.
Or shall I continue to share an interim update till we reach the fix and wait the bugzilla to move to a closed state.

--- Additional comment from  on 2019-02-28T05:16:06Z 

Please no communication to customers about my test results as there is likely a flaw in the configuration I used for the test. Will provide updates to this BZ here shortly after re-running the test with feedback from Ivan.

--- Additional comment from  on 2019-02-28T05:24:29Z 

Hello Mike,

Sure, I was not asking if I can share your test results but wanted to ask if in general, we can share anything on the current status.
I will give him an interim update saying that the engineering team is still working on it without mentioning any of the internal communication. Thank you for your response.

--- Additional comment from  on 2019-02-28T22:04:54Z 

Moving this back to VERIFIED. After testing with the proper setting in MONITOR_DELAY I experienced none of the effects seen in Comment #18. The dynflow_executors recycled properly, there were a few tasks with 'abnormal termination' but they are able to be resumed without error.

Content View publishes that were running during a recycle resumed and completed without issues.

Memory usage remained stable. Thumbs up! This is a great improvement to Satellite 6's memory management.

I tested this on 6.3.5+ hotfixes as well as 6.5 Snap 17

--- Additional comment from  on 2019-02-28T22:06:03Z 

To all cases attached to this bug that have Hotfixes for 6.3.5 and 6.4.2, Engineering continues to endorse their release and use in customer environments feeling memory pressure from dynflow_executor.

Comment 1 Mike McCune 2019-03-06 20:22:08 UTC
this was bug was closed in 6.4.2:

https://bugzilla.redhat.com/show_bug.cgi?id=1661291#c9

improper cloning to 6.4.3