Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1417978 - smart_proxy_dynflow_core.service on capsule keeps failing when running `yum -y install --advisory ...` ReX on 6k hosts
smart_proxy_dynflow_core.service on capsule keeps failing when running `yum -...
Status: CLOSED ERRATA
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Remote Execution (Show other bugs)
6.2.6
Unspecified Unspecified
unspecified Severity urgent (vote)
: 6.2.10
: Unused
Assigned To: Adam Ruzicka
Katello QA List
scale_lab
: Performance, Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-01-31 09:24 EST by Jan Hutař
Modified: 2017-06-20 13:22 EDT (History)
12 users (show)

See Also:
Fixed In Version: rubygem-smart_proxy_dynflow-0.1.3.1-1 rubygem-smart_proxy_dynflow_core-0.1.3.1-1.el7sat
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1446716 (view as bug list)
Environment:
Last Closed: 2017-06-20 13:22:11 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Foreman Issue Tracker 18449 None None None 2017-02-09 12:35 EST
Red Hat Product Errata RHBA-2017:1553 normal SHIPPED_LIVE Satellite 6.2.10 Async Bug Release 2017-06-20 17:19:07 EDT

  None (edit)
Description Jan Hutař 2017-01-31 09:24:41 EST
Description of problem:
smart_proxy_dynflow_core.service on capsule keeps failing when running `yum -y install --advisory ...` ReX on 6k hosts


Version-Release number of selected component (if applicable):
satellite-capsule-6.2.6-2.0.el7sat.noarch


How reproducible:
always


Steps to Reproduce:
1. Have Satellite with 4 capsules, each in its own "Location" with 2k hosts
   registered through each capsule
2. Run remote execution job with:
     Job category: Katello
     Job template: Install Errata - Katello SSH Default
   and select some errata. In our case that was applicable on 6k hosts


Actual results:
From time to time smart_proxy_dynflow_core.service on capsules fails


Expected results:
It should not fail


Additional info:
I was not able to find more errors than this:

Jan 31 06:49:00 <capsule_fqdn> systemd[1]: smart_proxy_dynflow_core.service: main process exited, code=killed, status=6/ABRT
Jan 31 06:49:00 <capsule_fqdn> systemd[1]: Unit smart_proxy_dynflow_core.service entered failed state.
Jan 31 06:49:00 <capsule_fqdn> systemd[1]: smart_proxy_dynflow_core.service failed.
Comment 2 Jan Hutař 2017-01-31 09:45:08 EST
Note that failure happened with this setting (only "database" value is non-default):

# cat /etc/smart_proxy_dynflow_core/settings.yml
---
# Path to dynflow database, leave blank for in-memory non-persistent database
:database: "" # /var/lib/foreman-proxy/dynflow/dynflow.sqlite
:console_auth: false

# URL of the foreman, used for reporting back
:foreman_url: https://gprfc018.sbu.lab.eng.bos.redhat.com

# SSL settings for client authentication against foreman.
:foreman_ssl_ca: /etc/foreman-proxy/foreman_ssl_ca.pem
:foreman_ssl_cert: /etc/foreman-proxy/foreman_ssl_cert.pem
:foreman_ssl_key: /etc/foreman-proxy/foreman_ssl_key.pem

# Listen on address
:listen: 0.0.0.0

# Listen on port
:port: 8008

:use_https: true
:ssl_ca_file: /etc/foreman-proxy/ssl_ca.pem
:ssl_certificate: /etc/foreman-proxy/ssl_cert.pem
:ssl_private_key: /etc/foreman-proxy/ssl_key.pem
# :ssl_ca_file: ssl/ca.pem
# :ssl_private_key: ssl/localhost.pem
# :ssl_certificate: ssl/certs/localhost.pem

# File to log to, leave empty for logging to STDOUT
# :log_file: /var/log/foreman-proxy/smart_proxy_dynflow_core.log

# Log level, one of UNKNOWN, FATAL, ERROR, WARN, INFO, DEBUG
:log_level: DEBUG
Comment 6 Ivan Necas 2017-02-01 18:18:48 EST
So far, I was not able to reproduce the state where the proxy would actually fail (testing with 2000 hosts per proxy). I have noticed proxy starting to take constantly 100% of CPU after some time: right now, I'm not sure, if it's actually related to the DEBUG logging: and if the debug logging doesn't affect the reproducibility of the failure. Will continue with investigation tomorrow
Comment 7 Ivan Necas 2017-02-01 18:26:30 EST
I'm going to create some simulated environment, that will work without actually need for real hosts to be there (simulating the command output being generated gradually) to see the behaviour and to be able to tweak some of the parameters around gathering the command output + tuning the database
Comment 8 Jan Hutař 2017-02-02 07:30:32 EST
(In reply to Ivan Necas from comment #6)
> So far, I was not able to reproduce the state where the proxy would actually
> fail (testing with 2000 hosts per proxy). I have noticed proxy starting to
> take constantly 100% of CPU after some time: right now, I'm not sure, if
> it's actually related to the DEBUG logging: and if the debug logging doesn't
> affect the reproducibility of the failure. Will continue with investigation
> tomorrow

Note that ":log_level: DEBUG" was there when it failed - is it default? The only non-default setting in "settings.yml" I know about is ':database: ""'.
Comment 9 Ivan Necas 2017-02-02 10:40:02 EST
It was a non-default. Thanks for letting me know it failed before as well with debug logging. I still was not able to reproduce this, but I've found one place that can cause the executor to get under long-term 100% CPU when handling too many executions at once. I'm working on solution now (should have it ready for initial testing tomorrow) and there is a chance this will also positively affect this behaviour.
Comment 11 Ivan Necas 2017-02-03 06:41:53 EST
After some time, we've been finally able to find the reason for the smart_proxy_dynflow_core service to crash:

[2017-02-03 06:30:33.712 #28115] ERROR -- Errno::EMFILE: Too many open files - accept(2)
        /opt/rh/rh-ruby22/root/usr/share/ruby/openssl/ssl.rb:286:in `accept'
Comment 12 Pradeep Kumar Surisetty 2017-02-03 06:55:48 EST
Thanks ivan

Will increase open files on capsules


# cat /etc/systemd/system/smart_proxy_dynflow_core.service.d/limits.conf
[Service]
LimitNOFILE=640000
# systemctl daemon-reload
# katello-service restart
Comment 15 Pavel Moravec 2017-02-03 10:57:34 EST
(In reply to Ivan Necas from comment #11)
> After some time, we've been finally able to find the reason for the
> smart_proxy_dynflow_core service to crash:
> 
> [2017-02-03 06:30:33.712 #28115] ERROR -- Errno::EMFILE: Too many open files
> - accept(2)
>         /opt/rh/rh-ruby22/root/usr/share/ruby/openssl/ssl.rb:286:in `accept'

Good finding - I guess smart_proxy_dynflow_core service should/will be fixed to prevent segfaulting in this situation, am I correct?
Comment 18 Ivan Necas 2017-02-09 12:35:47 EST
Created redmine issue http://projects.theforeman.org/issues/18449 from this bug
Comment 20 pm-sat@redhat.com 2017-04-10 08:18:29 EDT
Upstream bug assigned to aruzicka@redhat.com
Comment 21 pm-sat@redhat.com 2017-04-10 08:18:34 EDT
Upstream bug assigned to aruzicka@redhat.com
Comment 22 Adam Ruzicka 2017-04-12 07:12:35 EDT
Moving back to ASSIGNED since the original upstream fix fixed the issue only for EL7, PR with fix for EL6 is opened in the upstream repo.
Comment 23 Ivan Necas 2017-04-13 06:22:32 EDT
The additional fix was merged
Comment 27 errata-xmlrpc 2017-06-20 13:22:11 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1553

Note You need to log in before you can comment on or make changes to this bug.