Bug 1659037

Summary: running complex remote command via Ansible on 48k hosts causes load almost 300
Product: Red Hat Satellite Reporter: Jan Hutař <jhutar>
Component: Ansible - Configuration ManagementAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED WONTFIX QA Contact: Jan Hutař <jhutar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.5.0CC: apuch, aruzicka
Target Milestone: UnspecifiedKeywords: Performance, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-03 16:30:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
output of `passenger-status --show=requests` as advised by pmoravec none

Description Jan Hutař 2018-12-13 12:19:32 UTC
Created attachment 1514031 [details]
output of `passenger-status --show=requests` as advised by pmoravec

Description of problem:
Running remote command via Ansible on 48k hosts causes load almost 300
Satellite is a VM which have:
    20 CPUs, 48 GB RAM, 24 GB swap
KVM host (no other VM/load on it) have:
    64 CPUs, 256 GB RAM
    Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz


Version-Release number of selected component (if applicable):
satellite-6.5.0-5.beta.el7sat.noarch


How reproducible:
tried once, reproduced once


Steps to Reproduce:
1. Have Satellite with 48k registered hosts with ReX ssh keys deployed
2. In the evening start this remote command:
        subscription-manager refresh
        yum repolist
        yum -y install katello-host-tools
        katello-package-upload --force
3. Check how is it going 9 hours later


Actual results:
After 4 hours, Satellite started swapping and used almost full swap. At maximum, it had a load of 296.

I have `killall ansible-playbook` (about 1200 of these was running), then `systemctl stop dynflowd` and `systemctl stop smart_proxy_dynflow_core`.

I was still unable to login, because passenger queue was full:

# passenger-status
Version : 4.0.18
Date    : 2018-12-13 05:59:22 -0500
Instance: 23192
----------- General information -----------
Max pool size : 30
Processes     : 30
Requests in top-level queue : 0

----------- Application groups -----------
/usr/share/foreman#default:
  App root: /usr/share/foreman
  Requests in queue: 226
  * PID: 11592   Sessions: 1       Processed: 8439    Uptime: 8h 26m 25s
    CPU: 1%      Memory  : 478M    Last used: 19m 28s 
  * PID: 16023   Sessions: 1       Processed: 8836    Uptime: 8h 13m 23s
    CPU: 1%      Memory  : 466M    Last used: 54m 5s a
  * PID: 5118    Sessions: 1       Processed: 8133    Uptime: 8h 8m 7s
    CPU: 0%      Memory  : 474M    Last used: 27m 15s ag
  * PID: 24817   Sessions: 1       Processed: 8483    Uptime: 7h 18m 15s
    CPU: 1%      Memory  : 416M    Last used: 36m 36s 
  * PID: 25050   Sessions: 1       Processed: 7952    Uptime: 7h 18m 14s
    CPU: 1%      Memory  : 473M    Last used: 27m 15s 
  * PID: 5688    Sessions: 1       Processed: 5362    Uptime: 7h 14m 36s
    CPU: 0%      Memory  : 467M    Last used: 9m 51s a
  * PID: 21617   Sessions: 1       Processed: 7543    Uptime: 6h 33m 5s
    CPU: 1%      Memory  : 475M    Last used: 26m 14s a
  * PID: 15468   Sessions: 1       Processed: 6394    Uptime: 5h 55m 1s
    CPU: 1%      Memory  : 409M    Last used: 26m 15s a
  * PID: 1827    Sessions: 1       Processed: 2948    Uptime: 5h 27m 39s
    CPU: 0%      Memory  : 395M    Last used: 26m 55s 
  * PID: 17077   Sessions: 1       Processed: 6645    Uptime: 5h 5m 2s
    CPU: 1%      Memory  : 471M    Last used: 25m 14s ag
  * PID: 28042   Sessions: 1       Processed: 6178    Uptime: 4h 49m 40s
    CPU: 1%      Memory  : 479M    Last used: 25m 15s 
  * PID: 12071   Sessions: 1       Processed: 4847    Uptime: 4h 24m 31s
    CPU: 1%      Memory  : 463M    Last used: 59m 57s 
  * PID: 19001   Sessions: 1       Processed: 4668    Uptime: 4h 14m 45s
    CPU: 1%      Memory  : 454M    Last used: 50m 44s 
  * PID: 23899   Sessions: 1       Processed: 6356    Uptime: 4h 9m 26s
    CPU: 1%      Memory  : 469M    Last used: 26m 14s a
  * PID: 23494   Sessions: 1       Processed: 6015    Uptime: 3h 20m 41s
    CPU: 1%      Memory  : 462M    Last used: 27m 15s 
  * PID: 30527   Sessions: 1       Processed: 5819    Uptime: 3h 11m 1s
    CPU: 1%      Memory  : 416M    Last used: 14m 23s a
  * PID: 31541   Sessions: 1       Processed: 4141    Uptime: 1h 50m 7s
    CPU: 2%      Memory  : 454M    Last used: 27m 18s a
  * PID: 5637    Sessions: 1       Processed: 4589    Uptime: 1h 46m 48s
    CPU: 2%      Memory  : 461M    Last used: 27m 15s 
  * PID: 30119   Sessions: 1       Processed: 6190    Uptime: 1h 33m 11s
    CPU: 3%      Memory  : 465M    Last used: 27m 15s 
  * PID: 9316    Sessions: 1       Processed: 2913    Uptime: 1h 26m 26s
    CPU: 1%      Memory  : 433M    Last used: 52m 49s 
  * PID: 10003   Sessions: 1       Processed: 4297    Uptime: 1h 26m 21s
    CPU: 2%      Memory  : 459M    Last used: 27m 14s 
  * PID: 10146   Sessions: 1       Processed: 4344    Uptime: 1h 26m 20s
    CPU: 2%      Memory  : 450M    Last used: 28m 21s 
  * PID: 18244   Sessions: 1       Processed: 5157    Uptime: 1h 25m 0s
    CPU: 3%      Memory  : 458M    Last used: 15m 23s a
  * PID: 21215   Sessions: 1       Processed: 4710    Uptime: 1h 24m 31s
    CPU: 2%      Memory  : 407M    Last used: 28m 42s 
  * PID: 10210   Sessions: 1       Processed: 319     Uptime: 1h 19m 11s
    CPU: 0%      Memory  : 400M    Last used: 1m 23s a
  * PID: 29089   Sessions: 1       Processed: 3611    Uptime: 1h 1m 42s
    CPU: 3%      Memory  : 403M    Last used: 26m 14s a
  * PID: 11198   Sessions: 1       Processed: 3140    Uptime: 58m 40s
    CPU: 2%      Memory  : 472M    Last used: 28m 21s ago
  * PID: 24448   Sessions: 1       Processed: 2082    Uptime: 49m 43s
    CPU: 2%      Memory  : 430M    Last used: 28m 21s ago
  * PID: 26261   Sessions: 1       Processed: 1781    Uptime: 49m 27s
    CPU: 1%      Memory  : 423M    Last used: 27m 15s ago
  * PID: 4676    Sessions: 1       Processed: 165     Uptime: 29m 10s
    CPU: 0%      Memory  : 350M    Last used: 26m 14s ago

After about an hour, passenger queue got unstuck and I could login again.


Expected results:
I agree these are bit extreme circumstances, but still, maybe Satellite should handle it better.


Additional info:
Only 3 changes compared to default config I did are these changes:

/etc/httpd/conf.d/passenger.conf
   PassengerMaxPoolSize 30
   PassengerMaxRequestQueueSize 400
   PassengerStatThrottleRate 120

Comment 3 Adam Ruzicka 2018-12-13 12:30:20 UTC
Notes:
We run a single ansible-playbook process per host, starting 48k ansible-playbook generates huge load, we need to eventually start running ansible-playbook per a group (let's say 1k) hosts.

Comment 4 Andrew Puch 2019-05-24 21:32:37 UTC
Upstream ticket foreman said 
     This should improve dramatically now that we have support for ansible-runner



For sat6.5 work-around , a better idea may be a  for ansible to deploy a cron job to run a RND time ? 
So maybe a better idea is to use the /etc/crond.d/ 

Use ansible to push out the sat-killer tasks to run a rnd time 

NOTE: I would make sure there was no-op hour for sat maintenance backed in , as the load was high 
 

#/bin/bash
#sat-killer using cron to run the tasks at rnd time  

 subscription-manager refresh
 yum repolist
 yum -y install katello-host-tools
 katello-package-upload --force



The old sat5 sat-sync 

0 1 * * * perl -le 'sleep rand 9000' && satellite-sync --email >/dev/null \
2>/dev/null

This particular job will run randomly between 1:00 a.m. and 3:30 a.m. system time each night and redirect stdout and stderr from cron to prevent duplicating the more easily read message from satellite-sync. Options other than --email can also be included. Refer to Table 6.2, “Satellite Import/Sync Options” for the full list of options. Once you exit from the editor, the modified crontab is installed immediately. 






https://github.com/taw00/howto/blob/master/howto-schedule-cron-jobs-to-run-at-random-intervals.md

https://access.redhat.com/solutions/3013801

Comment 5 Bryan Kearney 2020-01-15 21:01:09 UTC
The Satellite Team is attempting to provide an accurate backlog of bugzilla requests which we feel will be resolved in the next few releases. We do not believe this bugzilla will meet that criteria, and have plans to close it out in 1 month. This is not a reflection on the validity of the request, but a reflection of the many priorities for the product. If you have any concerns about this, feel free to contact Red Hat Technical Support or your account team. If we do not hear from you, we will close this bug out. Thank you.

Comment 6 Bryan Kearney 2020-02-03 16:30:27 UTC
Thank you for your interest in Satellite 6. We have evaluated this request, and while we recognize that it is a valid request, we do not expect this to be implemented in the product in the foreseeable future. This is due to other priorities for the product, and not a reflection on the request itself. We are therefore closing this out as WONTFIX. If you have any concerns about this, please do not reopen. Instead, feel free to contact Red Hat Technical Support. Thank you.