Description of problem:
smart_proxy_dynflow_core.service on capsule keeps failing when running `yum -y install --advisory ...` ReX on 6k hosts
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Have Satellite with 4 capsules, each in its own "Location" with 2k hosts
registered through each capsule
2. Run remote execution job with:
Job category: Katello
Job template: Install Errata - Katello SSH Default
and select some errata. In our case that was applicable on 6k hosts
From time to time smart_proxy_dynflow_core.service on capsules fails
It should not fail
I was not able to find more errors than this:
Jan 31 06:49:00 <capsule_fqdn> systemd: smart_proxy_dynflow_core.service: main process exited, code=killed, status=6/ABRT
Jan 31 06:49:00 <capsule_fqdn> systemd: Unit smart_proxy_dynflow_core.service entered failed state.
Jan 31 06:49:00 <capsule_fqdn> systemd: smart_proxy_dynflow_core.service failed.
Note that failure happened with this setting (only "database" value is non-default):
# cat /etc/smart_proxy_dynflow_core/settings.yml
# Path to dynflow database, leave blank for in-memory non-persistent database
:database: "" # /var/lib/foreman-proxy/dynflow/dynflow.sqlite
# URL of the foreman, used for reporting back
# SSL settings for client authentication against foreman.
# Listen on address
# Listen on port
# :ssl_ca_file: ssl/ca.pem
# :ssl_private_key: ssl/localhost.pem
# :ssl_certificate: ssl/certs/localhost.pem
# File to log to, leave empty for logging to STDOUT
# :log_file: /var/log/foreman-proxy/smart_proxy_dynflow_core.log
# Log level, one of UNKNOWN, FATAL, ERROR, WARN, INFO, DEBUG
So far, I was not able to reproduce the state where the proxy would actually fail (testing with 2000 hosts per proxy). I have noticed proxy starting to take constantly 100% of CPU after some time: right now, I'm not sure, if it's actually related to the DEBUG logging: and if the debug logging doesn't affect the reproducibility of the failure. Will continue with investigation tomorrow
I'm going to create some simulated environment, that will work without actually need for real hosts to be there (simulating the command output being generated gradually) to see the behaviour and to be able to tweak some of the parameters around gathering the command output + tuning the database
(In reply to Ivan Necas from comment #6)
> So far, I was not able to reproduce the state where the proxy would actually
> fail (testing with 2000 hosts per proxy). I have noticed proxy starting to
> take constantly 100% of CPU after some time: right now, I'm not sure, if
> it's actually related to the DEBUG logging: and if the debug logging doesn't
> affect the reproducibility of the failure. Will continue with investigation
Note that ":log_level: DEBUG" was there when it failed - is it default? The only non-default setting in "settings.yml" I know about is ':database: ""'.
It was a non-default. Thanks for letting me know it failed before as well with debug logging. I still was not able to reproduce this, but I've found one place that can cause the executor to get under long-term 100% CPU when handling too many executions at once. I'm working on solution now (should have it ready for initial testing tomorrow) and there is a chance this will also positively affect this behaviour.
After some time, we've been finally able to find the reason for the smart_proxy_dynflow_core service to crash:
[2017-02-03 06:30:33.712 #28115] ERROR -- Errno::EMFILE: Too many open files - accept(2)
Will increase open files on capsules
# cat /etc/systemd/system/smart_proxy_dynflow_core.service.d/limits.conf
# systemctl daemon-reload
# katello-service restart
(In reply to Ivan Necas from comment #11)
> After some time, we've been finally able to find the reason for the
> smart_proxy_dynflow_core service to crash:
> [2017-02-03 06:30:33.712 #28115] ERROR -- Errno::EMFILE: Too many open files
> - accept(2)
> /opt/rh/rh-ruby22/root/usr/share/ruby/openssl/ssl.rb:286:in `accept'
Good finding - I guess smart_proxy_dynflow_core service should/will be fixed to prevent segfaulting in this situation, am I correct?
Created redmine issue http://projects.theforeman.org/issues/18449 from this bug
Upstream bug assigned to firstname.lastname@example.org
Moving back to ASSIGNED since the original upstream fix fixed the issue only for EL7, PR with fix for EL6 is opened in the upstream repo.
The additional fix was merged
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.