Description of problem: Registering 700 content hosts (in batches of 100 parallel registrations with about 3 idle minutes between each batch) causes problems to httpd (checked by `service httpd status`), pulp and pulp_auth (checked by hammer ping) services. Httpd logs traceback with "pulp.server.webservices.middleware.exception:ERROR: (21021-78048) OSError: [Errno 24] Too many open files". Version-Release number of selected component (if applicable): satellite-6.2.0-8.2.beta.el7sat.noarch pulp-server-2.8.1.3-1.el7sat.noarch httpd-2.4.6-40.el7.x86_64 How reproducible: 1 of 1 (will try to reproduce again) Steps to Reproduce: 1. Have Sat 6.2.0-8.2.beta with RHEL7 + RHEL7 RH Common + RHEL7 SCL synced 2. Attempt to register 1000 RHEL7 systems to that server in batches of 100 parallel registrations and with about 3 mins of idle time between each batch Actual results: First 7 batches (i.e. 700 registered systems) went fine, but in 8th batch 59 registrations failed with "There was an issue with the backend service pulp: Pulp message bus connection issue.", next batch was worse and in 3 batches after that all registrations failed this way. Command `service httpd status` says: # service httpd status Redirecting to /bin/systemctl status httpd.service ● httpd.service - The Apache HTTP Server Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2016-04-20 08:17:24 EDT; 1h 50min ago Docs: man:httpd(8) man:apachectl(8) Process: 20352 ExecStop=/bin/kill -WINCH ${MAINPID} (code=exited, status=0/SUCCESS) Main PID: 20964 (httpd) Status: "Total requests: 0; Current requests/sec: 0; Current traffic: 0 B/sec" CGroup: /system.slice/httpd.service ├─13414 /usr/sbin/httpd -DFOREGROUND ├─13462 /usr/sbin/httpd -DFOREGROUND ├─15621 /usr/sbin/httpd -DFOREGROUND ├─17637 /usr/sbin/httpd -DFOREGROUND ├─17654 /usr/sbin/httpd -DFOREGROUND ├─17656 /usr/sbin/httpd -DFOREGROUND ├─17657 /usr/sbin/httpd -DFOREGROUND ├─17661 /usr/sbin/httpd -DFOREGROUND ├─17690 /usr/sbin/httpd -DFOREGROUND ├─17692 /usr/sbin/httpd -DFOREGROUND ├─17695 /usr/sbin/httpd -DFOREGROUND ├─17696 /usr/sbin/httpd -DFOREGROUND ├─17698 /usr/sbin/httpd -DFOREGROUND ├─17702 /usr/sbin/httpd -DFOREGROUND ├─17729 /usr/sbin/httpd -DFOREGROUND ├─17735 /usr/sbin/httpd -DFOREGROUND ├─17736 /usr/sbin/httpd -DFOREGROUND ├─17737 /usr/sbin/httpd -DFOREGROUND ├─17738 /usr/sbin/httpd -DFOREGROUND ├─17751 /usr/sbin/httpd -DFOREGROUND ├─20964 /usr/sbin/httpd -DFOREGROUND ├─21019 (wsgi:pulp) -DFOREGROUND ├─21020 (wsgi:pulp) -DFOREGROUND ├─21021 (wsgi:pulp) -DFOREGROUND ├─21022 (wsgi:pulp-cont -DFOREGROUND ├─21023 (wsgi:pulp-cont -DFOREGROUND ├─21024 (wsgi:pulp-cont -DFOREGROUND ├─21025 (wsgi:pulp_forg -DFOREGROUND ├─21026 (wsgi:pulp_forg -DFOREGROUND ├─21027 (wsgi:pulp_forg -DFOREGROUND ├─21028 PassengerWatchdog ├─21031 PassengerHelperAgent ├─21039 PassengerLoggingAgent ├─21763 Passenger RackApp: /usr/share/foreman ├─22662 Passenger RackApp: /etc/puppet/rack ├─23824 Passenger RackApp: /usr/share/foreman ├─23988 Passenger RackApp: /usr/share/foreman ├─24040 Passenger RackApp: /usr/share/foreman └─24119 Passenger RackApp: /usr/share/foreman Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048) File "/usr/lib/python2.7/site-packages/pulp/server/managers/auth/cert/cert_generator.py... _make_priv_key Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048) p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048) File "/usr/lib64/python2.7/subprocess.py", line 703, in __init__ Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048) errread, errwrite), to_close = self._get_handles(stdin, stdout, stderr) Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048) File "/usr/lib64/python2.7/subprocess.py", line 1123, in _get_handles Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048) c2pread, c2pwrite = self.pipe_cloexec() Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048) File "/usr/lib64/python2.7/subprocess.py", line 1168, in pipe_cloexec Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048) r, w = os.pipe() Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048) OSError: [Errno 24] Too many open files Apr 20 10:00:32 <server> puppet-master[22662]: Compiled catalog for <server> in environment production in 0.01 seconds Hint: Some lines were ellipsized, use -l to show in full. Command `hammer ping` says: # hammer ping [Foreman] Password for admin: candlepin: Status: ok Server Response: Duration: 29ms candlepin_auth: Status: ok Server Response: Duration: 31ms pulp: Status: FAIL Server Response: pulp_auth: Status: FAIL foreman_tasks: Status: ok Server Response: Duration: 16ms HW config of the system with Satellite: 16x Intel(R) Xeon(R) CPU X5570 @ 2.93GHz 47 GB of RAM Expected results: Satellite should handle batches of 100 registrations (with some delay between batches). Satellite should be able to resume normal state even after some overload. Additional info: Filled with psuriset.
Have registered close to 2000 content hosts to satellite server And also was able to do 100 concurrent registrations. With increasing passenger queue, got around 200 concurrent registrations too in 6.2, With 700-800 : started noticing issues. Concurrency also seem to be an issue.
Re-test (with default config): 8 bunches of 100 concurrent registrations worked (1 og them failed in setup phase, so registration did not happened), then one bunch had 42 pulp errors and bunches after that had 100 pulp errors.
Created attachment 1155685 [details] graphs Adding a regression keyword. This graph shows that that there is some file descriptor leak probably which is not present on Sat 6.1. Also shows sudden drop of number of open files at about 7:30. On the graphs, both Satellites (6.1 and 6.2) were under same load of stream of registrations flowing through their capsules.
*** Bug 1332784 has been marked as a duplicate of this bug. ***
If this happens again, please capture "lsof -p <pid>" output for the process that is failing. This will help with finding the leak.
(In reply to Chris Duryee from comment #9) > If this happens again, please capture "lsof -p <pid>" output for the process > that is failing. This will help with finding the leak. Hello Chris, I'm bit uncertain here: is it enough to capture this after it happens? Which pid should I capture this for? I plan to capture it for all currently running pids, is that OK?
Capturing after it happens works. Your suggestion of capturing for all pids is probably better:)
Jan, and anyone else who adds data to this issue, please add the specific versions of all qpid-related packages that are installed. Thanks!
I'm seeing this as well when the katello:upgrades:3.0:add_export_distributor rake task gets added as part of the upgrade. If a user has ~3000 repositories the default max # of open files limit gets hit. The wsgi:pulp process seems to be the one creating a ton of FIFO pipes. Note this is on RHEL 6. libqpid-dispatch-0.4-13.el6sat.x86_64 pulp-admin-client-2.8.3.3-1.el6sat.noarch pulp-client-1.0-1.noarch pulp-docker-plugins-2.0.1.1-1.el6sat.noarch pulp-katello-1.0.1-1.el6sat.noarch pulp-puppet-plugins-2.8.3.3-1.el6sat.noarch pulp-puppet-tools-2.8.3.3-1.el6sat.noarch pulp-rpm-admin-extensions-2.8.3.5-1.el6sat.noarch pulp-rpm-handlers-2.8.3.5-1.el6sat.noarch pulp-rpm-plugins-2.8.3.5-1.el6sat.noarch pulp-selinux-2.8.3.3-1.el6sat.noarch pulp-server-2.8.3.3-1.el6sat.noarch python-gofer-qpid-2.7.6-1.el6sat.noarch python-isodate-0.5.0-4.pulp.el6_6sat.noarch python-pulp-agent-lib-2.8.3.3-1.el6sat.noarch python-pulp-bindings-2.8.3.3-1.el6sat.noarch python-pulp-client-lib-2.8.3.3-1.el6sat.noarch python-pulp-common-2.8.3.3-1.el6sat.noarch python-pulp-docker-common-2.0.1.1-1.el6sat.noarch python-pulp-oid_validation-2.8.3.3-1.el6sat.noarch python-pulp-puppet-common-2.8.3.3-1.el6sat.noarch python-pulp-repoauth-2.8.3.3-1.el6sat.noarch python-pulp-rpm-common-2.8.3.5-1.el6sat.noarch python-pulp-streamer-2.8.3.3-1.el6sat.noarch python-qpid-0.30-9.el6sat.noarch python-qpid-qmf-0.30-5.el6.x86_64 qpid-cpp-client-0.30-11.el6.x86_64 qpid-cpp-client-devel-0.30-11.el6.x86_64 qpid-cpp-server-0.30-11.el6.x86_64 qpid-cpp-server-linearstore-0.30-11.el6.x86_64 qpid-dispatch-router-0.4-13.el6sat.x86_64 qpid-java-client-0.30-3.el6.noarch qpid-java-common-0.30-3.el6.noarch qpid-proton-c-0.9-16.el6.x86_64 qpid-qmf-0.30-5.el6.x86_64 qpid-tools-0.30-4.el6.noarch
(In reply to Michael Hrivnak from comment #15) > Jan, and anyone else who adds data to this issue, please add the specific > versions of all qpid-related packages that are installed. Thanks! # rpm -qa | grep qpid | sort libqpid-dispatch-0.4-13.el7sat.x86_64 <host>-qpid-broker-1.0-1.noarch <host>-qpid-client-cert-1.0-1.noarch <host>-qpid-router-client-1.0-1.noarch <host>-qpid-router-server-1.0-1.noarch python-gofer-qpid-2.7.6-1.el7sat.noarch python-qpid-0.30-9.el7sat.noarch python-qpid-qmf-0.30-5.el7.x86_64 qpid-cpp-client-0.30-11.el7sat.x86_64 qpid-cpp-client-devel-0.30-11.el7sat.x86_64 qpid-cpp-server-0.30-11.el7sat.x86_64 qpid-cpp-server-linearstore-0.30-11.el7sat.x86_64 qpid-dispatch-router-0.4-13.el7sat.x86_64 qpid-proton-c-0.9-16.el7.x86_64 qpid-qmf-0.30-5.el7.x86_64 qpid-tools-0.30-4.el7.noarch tfm-rubygem-qpid_messaging-0.30.0-7.el7sat.x86_64
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.
The Pulp upstream bug priority is at Urgent. Updating the external tracker on this bug.
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.
Not sure if it is related and have no clue if it is OK or not, but noticed, that with ~550 clients, there is lots of pulp's connections to qpidd: # qpid-stat -q --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 | grep pulp\.agent | wc -l 554 # qpid-stat -q --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 | grep pulp\.agent | tail -n 3 pulp.agent.ffc25d50-f0e6-487c-9810-e18e81d4ee50 Y 0 0 0 0 0 0 13 1 pulp.agent.ffc96572-749c-4a97-bb12-225be5da35cd Y 0 0 0 0 0 0 17 1 pulp.agent.ffeab4f4-ea47-4261-b310-d559a7459ab4 Y 0 0 0 0 0 0 11 1
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.
The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.
The Pulp upstream bug status is at VERIFIED. Updating the external tracker on this bug.
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.
Please add verifications steps for this bug to help QE verify
This is resolved by using python-kombu 3.0.33-6 or newer. There's nothing to backport onto Pulp, so you can just update that dep.
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.
The Pulp upstream bug status is at CLOSED - COMPLETE. Updating the external tracker on this bug.
Verified in Satellite 6.2.9 Snap 2. I used a script to create/register 1,000 content hosts to the Satellite, in batches of 100, with a 3 minute gap from host creation to teardown. https://gist.github.com/JacobCallahan/27bb6bcf436aa8f07ed8be08c217eff4 -bash-4.2# python docker-flood.py rhel7 ibm-x3550m3-06.lab.eng.brq.redhat.com flood 7key 1000 --limit 100 --delay 180 Starting content host creation. Created: 1 Created: 2 Created: 3 ... Created: 100 Done with container #1 Created: 101 ... Done with container #999 Done with container #1000 Finished content host creation. -bash-4.1# lsof | cut -d ' ' -f 1 | sort | uniq -c | sort -n | tail 108 mongod 109 Passenger 122 qpidd 140 java 141 pulp_stre 157 sshd 584 ruby 2866 python 3900 postmaste 3906 httpd -bash-4.1# hammer ping candlepin: Status: ok Server Response: Duration: 45ms candlepin_auth: Status: ok Server Response: Duration: 42ms pulp: Status: ok Server Response: Duration: 66ms foreman_tasks: Status: ok Server Response: Duration: 19ms -bash-4.1# service httpd status httpd (pid 3835) is running...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1191