Bug 1417978

Summary: smart_proxy_dynflow_core.service on capsule keeps failing when running `yum -y install --advisory ...` ReX on 6k hosts
Product: Red Hat Satellite Reporter: Jan Hutař <jhutar>
Component: Remote ExecutionAssignee: Adam Ruzicka <aruzicka>
Status: CLOSED ERRATA QA Contact: Katello QA List <katello-qa-list>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.2.6CC: adprice, bbuckingham, bkearney, cduryee, egolov, inecas, jcallaha, jhutar, mmccune, pmoravec, psuriset, zhunting
Target Milestone: UnspecifiedKeywords: Performance, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: scale_lab
Fixed In Version: rubygem-smart_proxy_dynflow-0.1.3.1-1 rubygem-smart_proxy_dynflow_core-0.1.3.1-1.el7sat Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1446716 (view as bug list) Environment:
Last Closed: 2017-06-20 17:22:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Hutař 2017-01-31 14:24:41 UTC
Description of problem:
smart_proxy_dynflow_core.service on capsule keeps failing when running `yum -y install --advisory ...` ReX on 6k hosts


Version-Release number of selected component (if applicable):
satellite-capsule-6.2.6-2.0.el7sat.noarch


How reproducible:
always


Steps to Reproduce:
1. Have Satellite with 4 capsules, each in its own "Location" with 2k hosts
   registered through each capsule
2. Run remote execution job with:
     Job category: Katello
     Job template: Install Errata - Katello SSH Default
   and select some errata. In our case that was applicable on 6k hosts


Actual results:
From time to time smart_proxy_dynflow_core.service on capsules fails


Expected results:
It should not fail


Additional info:
I was not able to find more errors than this:

Jan 31 06:49:00 <capsule_fqdn> systemd[1]: smart_proxy_dynflow_core.service: main process exited, code=killed, status=6/ABRT
Jan 31 06:49:00 <capsule_fqdn> systemd[1]: Unit smart_proxy_dynflow_core.service entered failed state.
Jan 31 06:49:00 <capsule_fqdn> systemd[1]: smart_proxy_dynflow_core.service failed.

Comment 2 Jan Hutař 2017-01-31 14:45:08 UTC
Note that failure happened with this setting (only "database" value is non-default):

# cat /etc/smart_proxy_dynflow_core/settings.yml
---
# Path to dynflow database, leave blank for in-memory non-persistent database
:database: "" # /var/lib/foreman-proxy/dynflow/dynflow.sqlite
:console_auth: false

# URL of the foreman, used for reporting back
:foreman_url: https://gprfc018.sbu.lab.eng.bos.redhat.com

# SSL settings for client authentication against foreman.
:foreman_ssl_ca: /etc/foreman-proxy/foreman_ssl_ca.pem
:foreman_ssl_cert: /etc/foreman-proxy/foreman_ssl_cert.pem
:foreman_ssl_key: /etc/foreman-proxy/foreman_ssl_key.pem

# Listen on address
:listen: 0.0.0.0

# Listen on port
:port: 8008

:use_https: true
:ssl_ca_file: /etc/foreman-proxy/ssl_ca.pem
:ssl_certificate: /etc/foreman-proxy/ssl_cert.pem
:ssl_private_key: /etc/foreman-proxy/ssl_key.pem
# :ssl_ca_file: ssl/ca.pem
# :ssl_private_key: ssl/localhost.pem
# :ssl_certificate: ssl/certs/localhost.pem

# File to log to, leave empty for logging to STDOUT
# :log_file: /var/log/foreman-proxy/smart_proxy_dynflow_core.log

# Log level, one of UNKNOWN, FATAL, ERROR, WARN, INFO, DEBUG
:log_level: DEBUG

Comment 6 Ivan Necas 2017-02-01 23:18:48 UTC
So far, I was not able to reproduce the state where the proxy would actually fail (testing with 2000 hosts per proxy). I have noticed proxy starting to take constantly 100% of CPU after some time: right now, I'm not sure, if it's actually related to the DEBUG logging: and if the debug logging doesn't affect the reproducibility of the failure. Will continue with investigation tomorrow

Comment 7 Ivan Necas 2017-02-01 23:26:30 UTC
I'm going to create some simulated environment, that will work without actually need for real hosts to be there (simulating the command output being generated gradually) to see the behaviour and to be able to tweak some of the parameters around gathering the command output + tuning the database

Comment 8 Jan Hutař 2017-02-02 12:30:32 UTC
(In reply to Ivan Necas from comment #6)
> So far, I was not able to reproduce the state where the proxy would actually
> fail (testing with 2000 hosts per proxy). I have noticed proxy starting to
> take constantly 100% of CPU after some time: right now, I'm not sure, if
> it's actually related to the DEBUG logging: and if the debug logging doesn't
> affect the reproducibility of the failure. Will continue with investigation
> tomorrow

Note that ":log_level: DEBUG" was there when it failed - is it default? The only non-default setting in "settings.yml" I know about is ':database: ""'.

Comment 9 Ivan Necas 2017-02-02 15:40:02 UTC
It was a non-default. Thanks for letting me know it failed before as well with debug logging. I still was not able to reproduce this, but I've found one place that can cause the executor to get under long-term 100% CPU when handling too many executions at once. I'm working on solution now (should have it ready for initial testing tomorrow) and there is a chance this will also positively affect this behaviour.

Comment 11 Ivan Necas 2017-02-03 11:41:53 UTC
After some time, we've been finally able to find the reason for the smart_proxy_dynflow_core service to crash:

[2017-02-03 06:30:33.712 #28115] ERROR -- Errno::EMFILE: Too many open files - accept(2)
        /opt/rh/rh-ruby22/root/usr/share/ruby/openssl/ssl.rb:286:in `accept'

Comment 12 Pradeep Kumar Surisetty 2017-02-03 11:55:48 UTC
Thanks ivan

Will increase open files on capsules


# cat /etc/systemd/system/smart_proxy_dynflow_core.service.d/limits.conf
[Service]
LimitNOFILE=640000
# systemctl daemon-reload
# katello-service restart

Comment 15 Pavel Moravec 2017-02-03 15:57:34 UTC
(In reply to Ivan Necas from comment #11)
> After some time, we've been finally able to find the reason for the
> smart_proxy_dynflow_core service to crash:
> 
> [2017-02-03 06:30:33.712 #28115] ERROR -- Errno::EMFILE: Too many open files
> - accept(2)
>         /opt/rh/rh-ruby22/root/usr/share/ruby/openssl/ssl.rb:286:in `accept'

Good finding - I guess smart_proxy_dynflow_core service should/will be fixed to prevent segfaulting in this situation, am I correct?

Comment 18 Ivan Necas 2017-02-09 17:35:47 UTC
Created redmine issue http://projects.theforeman.org/issues/18449 from this bug

Comment 20 Satellite Program 2017-04-10 12:18:29 UTC
Upstream bug assigned to aruzicka

Comment 21 Satellite Program 2017-04-10 12:18:34 UTC
Upstream bug assigned to aruzicka

Comment 22 Adam Ruzicka 2017-04-12 11:12:35 UTC
Moving back to ASSIGNED since the original upstream fix fixed the issue only for EL7, PR with fix for EL6 is opened in the upstream repo.

Comment 23 Ivan Necas 2017-04-13 10:22:32 UTC
The additional fix was merged

Comment 27 errata-xmlrpc 2017-06-20 17:22:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1553