1659037 – running complex remote command via Ansible on 48k hosts causes load almost 300

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1659037 - running complex remote command via Ansible on 48k hosts causes load almost 300

Summary: running complex remote command via Ansible on 48k hosts causes load almost 300

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Ansible - Configuration Management
Sub Component:
Version:	6.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	Jan Hutař
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-13 12:19 UTC by Jan Hutař
Modified:	2020-02-03 16:30 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-02-03 16:30:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
output of `passenger-status --show=requests` as advised by pmoravec (334.37 KB, text/plain) 2018-12-13 12:19 UTC, Jan Hutař	no flags	Details
View All

Description Jan Hutař 2018-12-13 12:19:32 UTC

Created attachment 1514031 [details]
output of `passenger-status --show=requests` as advised by pmoravec

Description of problem:
Running remote command via Ansible on 48k hosts causes load almost 300
Satellite is a VM which have:
    20 CPUs, 48 GB RAM, 24 GB swap
KVM host (no other VM/load on it) have:
    64 CPUs, 256 GB RAM
    Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz


Version-Release number of selected component (if applicable):
satellite-6.5.0-5.beta.el7sat.noarch


How reproducible:
tried once, reproduced once


Steps to Reproduce:
1. Have Satellite with 48k registered hosts with ReX ssh keys deployed
2. In the evening start this remote command:
        subscription-manager refresh
        yum repolist
        yum -y install katello-host-tools
        katello-package-upload --force
3. Check how is it going 9 hours later


Actual results:
After 4 hours, Satellite started swapping and used almost full swap. At maximum, it had a load of 296.

I have `killall ansible-playbook` (about 1200 of these was running), then `systemctl stop dynflowd` and `systemctl stop smart_proxy_dynflow_core`.

I was still unable to login, because passenger queue was full:

# passenger-status
Version : 4.0.18
Date    : 2018-12-13 05:59:22 -0500
Instance: 23192
----------- General information -----------
Max pool size : 30
Processes     : 30
Requests in top-level queue : 0

----------- Application groups -----------
/usr/share/foreman#default:
  App root: /usr/share/foreman
  Requests in queue: 226
  * PID: 11592   Sessions: 1       Processed: 8439    Uptime: 8h 26m 25s
    CPU: 1%      Memory  : 478M    Last used: 19m 28s 
  * PID: 16023   Sessions: 1       Processed: 8836    Uptime: 8h 13m 23s
    CPU: 1%      Memory  : 466M    Last used: 54m 5s a
  * PID: 5118    Sessions: 1       Processed: 8133    Uptime: 8h 8m 7s
    CPU: 0%      Memory  : 474M    Last used: 27m 15s ag
  * PID: 24817   Sessions: 1       Processed: 8483    Uptime: 7h 18m 15s
    CPU: 1%      Memory  : 416M    Last used: 36m 36s 
  * PID: 25050   Sessions: 1       Processed: 7952    Uptime: 7h 18m 14s
    CPU: 1%      Memory  : 473M    Last used: 27m 15s 
  * PID: 5688    Sessions: 1       Processed: 5362    Uptime: 7h 14m 36s
    CPU: 0%      Memory  : 467M    Last used: 9m 51s a
  * PID: 21617   Sessions: 1       Processed: 7543    Uptime: 6h 33m 5s
    CPU: 1%      Memory  : 475M    Last used: 26m 14s a
  * PID: 15468   Sessions: 1       Processed: 6394    Uptime: 5h 55m 1s
    CPU: 1%      Memory  : 409M    Last used: 26m 15s a
  * PID: 1827    Sessions: 1       Processed: 2948    Uptime: 5h 27m 39s
    CPU: 0%      Memory  : 395M    Last used: 26m 55s 
  * PID: 17077   Sessions: 1       Processed: 6645    Uptime: 5h 5m 2s
    CPU: 1%      Memory  : 471M    Last used: 25m 14s ag
  * PID: 28042   Sessions: 1       Processed: 6178    Uptime: 4h 49m 40s
    CPU: 1%      Memory  : 479M    Last used: 25m 15s 
  * PID: 12071   Sessions: 1       Processed: 4847    Uptime: 4h 24m 31s
    CPU: 1%      Memory  : 463M    Last used: 59m 57s 
  * PID: 19001   Sessions: 1       Processed: 4668    Uptime: 4h 14m 45s
    CPU: 1%      Memory  : 454M    Last used: 50m 44s 
  * PID: 23899   Sessions: 1       Processed: 6356    Uptime: 4h 9m 26s
    CPU: 1%      Memory  : 469M    Last used: 26m 14s a
  * PID: 23494   Sessions: 1       Processed: 6015    Uptime: 3h 20m 41s
    CPU: 1%      Memory  : 462M    Last used: 27m 15s 
  * PID: 30527   Sessions: 1       Processed: 5819    Uptime: 3h 11m 1s
    CPU: 1%      Memory  : 416M    Last used: 14m 23s a
  * PID: 31541   Sessions: 1       Processed: 4141    Uptime: 1h 50m 7s
    CPU: 2%      Memory  : 454M    Last used: 27m 18s a
  * PID: 5637    Sessions: 1       Processed: 4589    Uptime: 1h 46m 48s
    CPU: 2%      Memory  : 461M    Last used: 27m 15s 
  * PID: 30119   Sessions: 1       Processed: 6190    Uptime: 1h 33m 11s
    CPU: 3%      Memory  : 465M    Last used: 27m 15s 
  * PID: 9316    Sessions: 1       Processed: 2913    Uptime: 1h 26m 26s
    CPU: 1%      Memory  : 433M    Last used: 52m 49s 
  * PID: 10003   Sessions: 1       Processed: 4297    Uptime: 1h 26m 21s
    CPU: 2%      Memory  : 459M    Last used: 27m 14s 
  * PID: 10146   Sessions: 1       Processed: 4344    Uptime: 1h 26m 20s
    CPU: 2%      Memory  : 450M    Last used: 28m 21s 
  * PID: 18244   Sessions: 1       Processed: 5157    Uptime: 1h 25m 0s
    CPU: 3%      Memory  : 458M    Last used: 15m 23s a
  * PID: 21215   Sessions: 1       Processed: 4710    Uptime: 1h 24m 31s
    CPU: 2%      Memory  : 407M    Last used: 28m 42s 
  * PID: 10210   Sessions: 1       Processed: 319     Uptime: 1h 19m 11s
    CPU: 0%      Memory  : 400M    Last used: 1m 23s a
  * PID: 29089   Sessions: 1       Processed: 3611    Uptime: 1h 1m 42s
    CPU: 3%      Memory  : 403M    Last used: 26m 14s a
  * PID: 11198   Sessions: 1       Processed: 3140    Uptime: 58m 40s
    CPU: 2%      Memory  : 472M    Last used: 28m 21s ago
  * PID: 24448   Sessions: 1       Processed: 2082    Uptime: 49m 43s
    CPU: 2%      Memory  : 430M    Last used: 28m 21s ago
  * PID: 26261   Sessions: 1       Processed: 1781    Uptime: 49m 27s
    CPU: 1%      Memory  : 423M    Last used: 27m 15s ago
  * PID: 4676    Sessions: 1       Processed: 165     Uptime: 29m 10s
    CPU: 0%      Memory  : 350M    Last used: 26m 14s ago

After about an hour, passenger queue got unstuck and I could login again.


Expected results:
I agree these are bit extreme circumstances, but still, maybe Satellite should handle it better.


Additional info:
Only 3 changes compared to default config I did are these changes:

/etc/httpd/conf.d/passenger.conf
   PassengerMaxPoolSize 30
   PassengerMaxRequestQueueSize 400
   PassengerStatThrottleRate 120

Comment 3 Adam Ruzicka 2018-12-13 12:30:20 UTC

Notes:
We run a single ansible-playbook process per host, starting 48k ansible-playbook generates huge load, we need to eventually start running ansible-playbook per a group (let's say 1k) hosts.

Comment 4 Andrew Puch 2019-05-24 21:32:37 UTC

Upstream ticket foreman said 
     This should improve dramatically now that we have support for ansible-runner



For sat6.5 work-around , a better idea may be a  for ansible to deploy a cron job to run a RND time ? 
So maybe a better idea is to use the /etc/crond.d/ 

Use ansible to push out the sat-killer tasks to run a rnd time 

NOTE: I would make sure there was no-op hour for sat maintenance backed in , as the load was high 
 

#/bin/bash
#sat-killer using cron to run the tasks at rnd time  

 subscription-manager refresh
 yum repolist
 yum -y install katello-host-tools
 katello-package-upload --force



The old sat5 sat-sync 

0 1 * * * perl -le 'sleep rand 9000' && satellite-sync --email >/dev/null \
2>/dev/null

This particular job will run randomly between 1:00 a.m. and 3:30 a.m. system time each night and redirect stdout and stderr from cron to prevent duplicating the more easily read message from satellite-sync. Options other than --email can also be included. Refer to Table 6.2, “Satellite Import/Sync Options” for the full list of options. Once you exit from the editor, the modified crontab is installed immediately. 






https://github.com/taw00/howto/blob/master/howto-schedule-cron-jobs-to-run-at-random-intervals.md

https://access.redhat.com/solutions/3013801

Comment 5 Bryan Kearney 2020-01-15 21:01:09 UTC

The Satellite Team is attempting to provide an accurate backlog of bugzilla requests which we feel will be resolved in the next few releases. We do not believe this bugzilla will meet that criteria, and have plans to close it out in 1 month. This is not a reflection on the validity of the request, but a reflection of the many priorities for the product. If you have any concerns about this, feel free to contact Red Hat Technical Support or your account team. If we do not hear from you, we will close this bug out. Thank you.

Comment 6 Bryan Kearney 2020-02-03 16:30:27 UTC

Thank you for your interest in Satellite 6. We have evaluated this request, and while we recognize that it is a valid request, we do not expect this to be implemented in the product in the foreseeable future. This is due to other priorities for the product, and not a reflection on the request itself. We are therefore closing this out as WONTFIX. If you have any concerns about this, please do not reopen. Instead, feel free to contact Red Hat Technical Support. Thank you.

Note You need to log in before you can comment on or make changes to this bug.