Bug 1659037
Summary: | running complex remote command via Ansible on 48k hosts causes load almost 300 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Jan Hutař <jhutar> | ||||
Component: | Ansible - Configuration Management | Assignee: | satellite6-bugs <satellite6-bugs> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Jan Hutař <jhutar> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.5.0 | CC: | apuch, aruzicka | ||||
Target Milestone: | Unspecified | Keywords: | Performance, Triaged | ||||
Target Release: | Unused | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-02-03 16:30:27 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Notes: We run a single ansible-playbook process per host, starting 48k ansible-playbook generates huge load, we need to eventually start running ansible-playbook per a group (let's say 1k) hosts. Upstream ticket foreman said This should improve dramatically now that we have support for ansible-runner For sat6.5 work-around , a better idea may be a for ansible to deploy a cron job to run a RND time ? So maybe a better idea is to use the /etc/crond.d/ Use ansible to push out the sat-killer tasks to run a rnd time NOTE: I would make sure there was no-op hour for sat maintenance backed in , as the load was high #/bin/bash #sat-killer using cron to run the tasks at rnd time subscription-manager refresh yum repolist yum -y install katello-host-tools katello-package-upload --force The old sat5 sat-sync 0 1 * * * perl -le 'sleep rand 9000' && satellite-sync --email >/dev/null \ 2>/dev/null This particular job will run randomly between 1:00 a.m. and 3:30 a.m. system time each night and redirect stdout and stderr from cron to prevent duplicating the more easily read message from satellite-sync. Options other than --email can also be included. Refer to Table 6.2, “Satellite Import/Sync Options” for the full list of options. Once you exit from the editor, the modified crontab is installed immediately. https://github.com/taw00/howto/blob/master/howto-schedule-cron-jobs-to-run-at-random-intervals.md https://access.redhat.com/solutions/3013801 The Satellite Team is attempting to provide an accurate backlog of bugzilla requests which we feel will be resolved in the next few releases. We do not believe this bugzilla will meet that criteria, and have plans to close it out in 1 month. This is not a reflection on the validity of the request, but a reflection of the many priorities for the product. If you have any concerns about this, feel free to contact Red Hat Technical Support or your account team. If we do not hear from you, we will close this bug out. Thank you. Thank you for your interest in Satellite 6. We have evaluated this request, and while we recognize that it is a valid request, we do not expect this to be implemented in the product in the foreseeable future. This is due to other priorities for the product, and not a reflection on the request itself. We are therefore closing this out as WONTFIX. If you have any concerns about this, please do not reopen. Instead, feel free to contact Red Hat Technical Support. Thank you. |
Created attachment 1514031 [details] output of `passenger-status --show=requests` as advised by pmoravec Description of problem: Running remote command via Ansible on 48k hosts causes load almost 300 Satellite is a VM which have: 20 CPUs, 48 GB RAM, 24 GB swap KVM host (no other VM/load on it) have: 64 CPUs, 256 GB RAM Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz Version-Release number of selected component (if applicable): satellite-6.5.0-5.beta.el7sat.noarch How reproducible: tried once, reproduced once Steps to Reproduce: 1. Have Satellite with 48k registered hosts with ReX ssh keys deployed 2. In the evening start this remote command: subscription-manager refresh yum repolist yum -y install katello-host-tools katello-package-upload --force 3. Check how is it going 9 hours later Actual results: After 4 hours, Satellite started swapping and used almost full swap. At maximum, it had a load of 296. I have `killall ansible-playbook` (about 1200 of these was running), then `systemctl stop dynflowd` and `systemctl stop smart_proxy_dynflow_core`. I was still unable to login, because passenger queue was full: # passenger-status Version : 4.0.18 Date : 2018-12-13 05:59:22 -0500 Instance: 23192 ----------- General information ----------- Max pool size : 30 Processes : 30 Requests in top-level queue : 0 ----------- Application groups ----------- /usr/share/foreman#default: App root: /usr/share/foreman Requests in queue: 226 * PID: 11592 Sessions: 1 Processed: 8439 Uptime: 8h 26m 25s CPU: 1% Memory : 478M Last used: 19m 28s * PID: 16023 Sessions: 1 Processed: 8836 Uptime: 8h 13m 23s CPU: 1% Memory : 466M Last used: 54m 5s a * PID: 5118 Sessions: 1 Processed: 8133 Uptime: 8h 8m 7s CPU: 0% Memory : 474M Last used: 27m 15s ag * PID: 24817 Sessions: 1 Processed: 8483 Uptime: 7h 18m 15s CPU: 1% Memory : 416M Last used: 36m 36s * PID: 25050 Sessions: 1 Processed: 7952 Uptime: 7h 18m 14s CPU: 1% Memory : 473M Last used: 27m 15s * PID: 5688 Sessions: 1 Processed: 5362 Uptime: 7h 14m 36s CPU: 0% Memory : 467M Last used: 9m 51s a * PID: 21617 Sessions: 1 Processed: 7543 Uptime: 6h 33m 5s CPU: 1% Memory : 475M Last used: 26m 14s a * PID: 15468 Sessions: 1 Processed: 6394 Uptime: 5h 55m 1s CPU: 1% Memory : 409M Last used: 26m 15s a * PID: 1827 Sessions: 1 Processed: 2948 Uptime: 5h 27m 39s CPU: 0% Memory : 395M Last used: 26m 55s * PID: 17077 Sessions: 1 Processed: 6645 Uptime: 5h 5m 2s CPU: 1% Memory : 471M Last used: 25m 14s ag * PID: 28042 Sessions: 1 Processed: 6178 Uptime: 4h 49m 40s CPU: 1% Memory : 479M Last used: 25m 15s * PID: 12071 Sessions: 1 Processed: 4847 Uptime: 4h 24m 31s CPU: 1% Memory : 463M Last used: 59m 57s * PID: 19001 Sessions: 1 Processed: 4668 Uptime: 4h 14m 45s CPU: 1% Memory : 454M Last used: 50m 44s * PID: 23899 Sessions: 1 Processed: 6356 Uptime: 4h 9m 26s CPU: 1% Memory : 469M Last used: 26m 14s a * PID: 23494 Sessions: 1 Processed: 6015 Uptime: 3h 20m 41s CPU: 1% Memory : 462M Last used: 27m 15s * PID: 30527 Sessions: 1 Processed: 5819 Uptime: 3h 11m 1s CPU: 1% Memory : 416M Last used: 14m 23s a * PID: 31541 Sessions: 1 Processed: 4141 Uptime: 1h 50m 7s CPU: 2% Memory : 454M Last used: 27m 18s a * PID: 5637 Sessions: 1 Processed: 4589 Uptime: 1h 46m 48s CPU: 2% Memory : 461M Last used: 27m 15s * PID: 30119 Sessions: 1 Processed: 6190 Uptime: 1h 33m 11s CPU: 3% Memory : 465M Last used: 27m 15s * PID: 9316 Sessions: 1 Processed: 2913 Uptime: 1h 26m 26s CPU: 1% Memory : 433M Last used: 52m 49s * PID: 10003 Sessions: 1 Processed: 4297 Uptime: 1h 26m 21s CPU: 2% Memory : 459M Last used: 27m 14s * PID: 10146 Sessions: 1 Processed: 4344 Uptime: 1h 26m 20s CPU: 2% Memory : 450M Last used: 28m 21s * PID: 18244 Sessions: 1 Processed: 5157 Uptime: 1h 25m 0s CPU: 3% Memory : 458M Last used: 15m 23s a * PID: 21215 Sessions: 1 Processed: 4710 Uptime: 1h 24m 31s CPU: 2% Memory : 407M Last used: 28m 42s * PID: 10210 Sessions: 1 Processed: 319 Uptime: 1h 19m 11s CPU: 0% Memory : 400M Last used: 1m 23s a * PID: 29089 Sessions: 1 Processed: 3611 Uptime: 1h 1m 42s CPU: 3% Memory : 403M Last used: 26m 14s a * PID: 11198 Sessions: 1 Processed: 3140 Uptime: 58m 40s CPU: 2% Memory : 472M Last used: 28m 21s ago * PID: 24448 Sessions: 1 Processed: 2082 Uptime: 49m 43s CPU: 2% Memory : 430M Last used: 28m 21s ago * PID: 26261 Sessions: 1 Processed: 1781 Uptime: 49m 27s CPU: 1% Memory : 423M Last used: 27m 15s ago * PID: 4676 Sessions: 1 Processed: 165 Uptime: 29m 10s CPU: 0% Memory : 350M Last used: 26m 14s ago After about an hour, passenger queue got unstuck and I could login again. Expected results: I agree these are bit extreme circumstances, but still, maybe Satellite should handle it better. Additional info: Only 3 changes compared to default config I did are these changes: /etc/httpd/conf.d/passenger.conf PassengerMaxPoolSize 30 PassengerMaxRequestQueueSize 400 PassengerStatThrottleRate 120