Bug 1417978
Summary: | smart_proxy_dynflow_core.service on capsule keeps failing when running `yum -y install --advisory ...` ReX on 6k hosts | |||
---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Jan Hutař <jhutar> | |
Component: | Remote Execution | Assignee: | Adam Ruzicka <aruzicka> | |
Status: | CLOSED ERRATA | QA Contact: | Katello QA List <katello-qa-list> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 6.2.6 | CC: | adprice, bbuckingham, bkearney, cduryee, egolov, inecas, jcallaha, jhutar, mmccune, pmoravec, psuriset, zhunting | |
Target Milestone: | Unspecified | Keywords: | Performance, Triaged | |
Target Release: | Unused | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | scale_lab | |||
Fixed In Version: | rubygem-smart_proxy_dynflow-0.1.3.1-1 rubygem-smart_proxy_dynflow_core-0.1.3.1-1.el7sat | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1446716 (view as bug list) | Environment: | ||
Last Closed: | 2017-06-20 17:22:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Jan Hutař
2017-01-31 14:24:41 UTC
Note that failure happened with this setting (only "database" value is non-default): # cat /etc/smart_proxy_dynflow_core/settings.yml --- # Path to dynflow database, leave blank for in-memory non-persistent database :database: "" # /var/lib/foreman-proxy/dynflow/dynflow.sqlite :console_auth: false # URL of the foreman, used for reporting back :foreman_url: https://gprfc018.sbu.lab.eng.bos.redhat.com # SSL settings for client authentication against foreman. :foreman_ssl_ca: /etc/foreman-proxy/foreman_ssl_ca.pem :foreman_ssl_cert: /etc/foreman-proxy/foreman_ssl_cert.pem :foreman_ssl_key: /etc/foreman-proxy/foreman_ssl_key.pem # Listen on address :listen: 0.0.0.0 # Listen on port :port: 8008 :use_https: true :ssl_ca_file: /etc/foreman-proxy/ssl_ca.pem :ssl_certificate: /etc/foreman-proxy/ssl_cert.pem :ssl_private_key: /etc/foreman-proxy/ssl_key.pem # :ssl_ca_file: ssl/ca.pem # :ssl_private_key: ssl/localhost.pem # :ssl_certificate: ssl/certs/localhost.pem # File to log to, leave empty for logging to STDOUT # :log_file: /var/log/foreman-proxy/smart_proxy_dynflow_core.log # Log level, one of UNKNOWN, FATAL, ERROR, WARN, INFO, DEBUG :log_level: DEBUG So far, I was not able to reproduce the state where the proxy would actually fail (testing with 2000 hosts per proxy). I have noticed proxy starting to take constantly 100% of CPU after some time: right now, I'm not sure, if it's actually related to the DEBUG logging: and if the debug logging doesn't affect the reproducibility of the failure. Will continue with investigation tomorrow I'm going to create some simulated environment, that will work without actually need for real hosts to be there (simulating the command output being generated gradually) to see the behaviour and to be able to tweak some of the parameters around gathering the command output + tuning the database (In reply to Ivan Necas from comment #6) > So far, I was not able to reproduce the state where the proxy would actually > fail (testing with 2000 hosts per proxy). I have noticed proxy starting to > take constantly 100% of CPU after some time: right now, I'm not sure, if > it's actually related to the DEBUG logging: and if the debug logging doesn't > affect the reproducibility of the failure. Will continue with investigation > tomorrow Note that ":log_level: DEBUG" was there when it failed - is it default? The only non-default setting in "settings.yml" I know about is ':database: ""'. It was a non-default. Thanks for letting me know it failed before as well with debug logging. I still was not able to reproduce this, but I've found one place that can cause the executor to get under long-term 100% CPU when handling too many executions at once. I'm working on solution now (should have it ready for initial testing tomorrow) and there is a chance this will also positively affect this behaviour. After some time, we've been finally able to find the reason for the smart_proxy_dynflow_core service to crash: [2017-02-03 06:30:33.712 #28115] ERROR -- Errno::EMFILE: Too many open files - accept(2) /opt/rh/rh-ruby22/root/usr/share/ruby/openssl/ssl.rb:286:in `accept' Thanks ivan Will increase open files on capsules # cat /etc/systemd/system/smart_proxy_dynflow_core.service.d/limits.conf [Service] LimitNOFILE=640000 # systemctl daemon-reload # katello-service restart (In reply to Ivan Necas from comment #11) > After some time, we've been finally able to find the reason for the > smart_proxy_dynflow_core service to crash: > > [2017-02-03 06:30:33.712 #28115] ERROR -- Errno::EMFILE: Too many open files > - accept(2) > /opt/rh/rh-ruby22/root/usr/share/ruby/openssl/ssl.rb:286:in `accept' Good finding - I guess smart_proxy_dynflow_core service should/will be fixed to prevent segfaulting in this situation, am I correct? Created redmine issue http://projects.theforeman.org/issues/18449 from this bug Upstream bug assigned to aruzicka Upstream bug assigned to aruzicka Moving back to ASSIGNED since the original upstream fix fixed the issue only for EL7, PR with fix for EL6 is opened in the upstream repo. The additional fix was merged Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1553 |