Bug 1328984 - registering 800 content hosts kills pulp: pulp.server.webservices.middleware.exception Too many open files
Summary: registering 800 content hosts kills pulp: pulp.server.webservices.middleware....
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Pulp
Version: 6.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: jcallaha
URL:
Whiteboard:
: 1332784 (view as bug list)
Depends On:
Blocks: 1426381
TreeView+ depends on / blocked
 
Reported: 2016-04-20 19:46 UTC by Jan Hutař
Modified: 2021-04-06 17:56 UTC (History)
20 users (show)

Fixed In Version: python-kombu-3.0.33-8
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1426381 (view as bug list)
Environment:
Last Closed: 2017-05-01 13:52:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
graphs (171.48 KB, image/png)
2016-05-10 11:49 UTC, Jan Hutař
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Pulp Redmine 2124 0 Urgent CLOSED - CURRENTRELEASE Pulp status check leaves open Pipes 2016-08-19 14:00:23 UTC
Pulp Redmine 2653 0 Normal CLOSED - COMPLETE Backport #2124, Pulp status check leaves open Pipes 2017-03-30 23:06:06 UTC
Red Hat Product Errata RHBA-2017:1191 0 normal SHIPPED_LIVE Satellite 6.2.9 Async Bug Release 2017-05-01 17:49:42 UTC

Description Jan Hutař 2016-04-20 19:46:52 UTC
Description of problem:
Registering 700 content hosts (in batches of 100 parallel registrations with about 3 idle minutes between each batch) causes problems to httpd (checked by `service httpd status`), pulp and pulp_auth (checked by hammer ping) services. Httpd logs traceback with "pulp.server.webservices.middleware.exception:ERROR: (21021-78048) OSError: [Errno 24] Too many open files".


Version-Release number of selected component (if applicable):
satellite-6.2.0-8.2.beta.el7sat.noarch
pulp-server-2.8.1.3-1.el7sat.noarch
httpd-2.4.6-40.el7.x86_64


How reproducible:
1 of 1 (will try to reproduce again)


Steps to Reproduce:
1. Have Sat 6.2.0-8.2.beta with RHEL7 + RHEL7 RH Common + RHEL7 SCL synced
2. Attempt to register 1000 RHEL7 systems to that server in batches of 100
   parallel registrations and with about 3 mins of idle time between each batch


Actual results:
First 7 batches (i.e. 700 registered systems) went fine, but in 8th batch 59 registrations failed with "There was an issue with the backend service pulp: Pulp message bus connection issue.", next batch was worse and in 3 batches after that all registrations failed this way.

Command `service httpd status` says:
# service httpd status
Redirecting to /bin/systemctl status  httpd.service
● httpd.service - The Apache HTTP Server
   Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2016-04-20 08:17:24 EDT; 1h 50min ago
     Docs: man:httpd(8)
           man:apachectl(8)
  Process: 20352 ExecStop=/bin/kill -WINCH ${MAINPID} (code=exited, status=0/SUCCESS)
 Main PID: 20964 (httpd)
   Status: "Total requests: 0; Current requests/sec: 0; Current traffic:   0 B/sec"
   CGroup: /system.slice/httpd.service
           ├─13414 /usr/sbin/httpd -DFOREGROUND
           ├─13462 /usr/sbin/httpd -DFOREGROUND
           ├─15621 /usr/sbin/httpd -DFOREGROUND
           ├─17637 /usr/sbin/httpd -DFOREGROUND
           ├─17654 /usr/sbin/httpd -DFOREGROUND
           ├─17656 /usr/sbin/httpd -DFOREGROUND
           ├─17657 /usr/sbin/httpd -DFOREGROUND
           ├─17661 /usr/sbin/httpd -DFOREGROUND
           ├─17690 /usr/sbin/httpd -DFOREGROUND
           ├─17692 /usr/sbin/httpd -DFOREGROUND
           ├─17695 /usr/sbin/httpd -DFOREGROUND
           ├─17696 /usr/sbin/httpd -DFOREGROUND
           ├─17698 /usr/sbin/httpd -DFOREGROUND
           ├─17702 /usr/sbin/httpd -DFOREGROUND
           ├─17729 /usr/sbin/httpd -DFOREGROUND
           ├─17735 /usr/sbin/httpd -DFOREGROUND
           ├─17736 /usr/sbin/httpd -DFOREGROUND
           ├─17737 /usr/sbin/httpd -DFOREGROUND
           ├─17738 /usr/sbin/httpd -DFOREGROUND
           ├─17751 /usr/sbin/httpd -DFOREGROUND
           ├─20964 /usr/sbin/httpd -DFOREGROUND
           ├─21019 (wsgi:pulp)     -DFOREGROUND
           ├─21020 (wsgi:pulp)     -DFOREGROUND
           ├─21021 (wsgi:pulp)     -DFOREGROUND
           ├─21022 (wsgi:pulp-cont -DFOREGROUND
           ├─21023 (wsgi:pulp-cont -DFOREGROUND
           ├─21024 (wsgi:pulp-cont -DFOREGROUND
           ├─21025 (wsgi:pulp_forg -DFOREGROUND
           ├─21026 (wsgi:pulp_forg -DFOREGROUND
           ├─21027 (wsgi:pulp_forg -DFOREGROUND
           ├─21028 PassengerWatchdog
           ├─21031 PassengerHelperAgent
           ├─21039 PassengerLoggingAgent
           ├─21763 Passenger RackApp: /usr/share/foreman
           ├─22662 Passenger RackApp: /etc/puppet/rack
           ├─23824 Passenger RackApp: /usr/share/foreman
           ├─23988 Passenger RackApp: /usr/share/foreman
           ├─24040 Passenger RackApp: /usr/share/foreman
           └─24119 Passenger RackApp: /usr/share/foreman

Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048)   File "/usr/lib/python2.7/site-packages/pulp/server/managers/auth/cert/cert_generator.py... _make_priv_key
Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048)     p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048)   File "/usr/lib64/python2.7/subprocess.py", line 703, in __init__
Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048)     errread, errwrite), to_close = self._get_handles(stdin, stdout, stderr)
Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048)   File "/usr/lib64/python2.7/subprocess.py", line 1123, in _get_handles
Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048)     c2pread, c2pwrite = self.pipe_cloexec()
Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048)   File "/usr/lib64/python2.7/subprocess.py", line 1168, in pipe_cloexec
Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048)     r, w = os.pipe()
Apr 20 09:46:08 <server> pulp[21021]: pulp.server.webservices.middleware.exception:ERROR: (21021-78048) OSError: [Errno 24] Too many open files
Apr 20 10:00:32 <server> puppet-master[22662]: Compiled catalog for <server> in environment production in 0.01 seconds
Hint: Some lines were ellipsized, use -l to show in full.

Command `hammer ping` says:
# hammer ping
[Foreman] Password for admin: 
candlepin:      
    Status:          ok
    Server Response: Duration: 29ms
candlepin_auth: 
    Status:          ok
    Server Response: Duration: 31ms
pulp:           
    Status:          FAIL
    Server Response:
pulp_auth:      
    Status: FAIL
foreman_tasks:  
    Status:          ok
    Server Response: Duration: 16ms

HW config of the system with Satellite:
  16x Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
  47 GB of RAM


Expected results:
Satellite should handle batches of 100 registrations (with some delay between batches).
Satellite should be able to resume normal state even after some overload.


Additional info:
Filled with psuriset.

Comment 3 Pradeep Kumar Surisetty 2016-04-21 01:06:52 UTC
Have registered close to 2000 content hosts to satellite server
And also was able to do 100 concurrent registrations.   With increasing passenger queue, got around 200 concurrent registrations too



in 6.2,  With 700-800 : started noticing issues. Concurrency also seem to be an issue.

Comment 4 Jan Hutař 2016-04-21 04:16:04 UTC
Re-test (with default config): 8 bunches of 100 concurrent registrations worked (1 og them failed in setup phase, so registration did not happened), then one bunch had 42 pulp errors and bunches after that had 100 pulp errors.

Comment 5 Jan Hutař 2016-05-10 11:49:55 UTC
Created attachment 1155685 [details]
graphs

Adding a regression keyword. This graph shows that that there is some file descriptor leak probably which is not present on Sat 6.1. Also shows sudden drop of number of open files at about 7:30.

On the graphs, both Satellites (6.1 and 6.2) were under same load of stream of registrations flowing through their capsules.

Comment 7 Pradeep Kumar Surisetty 2016-05-11 15:58:05 UTC
*** Bug 1332784 has been marked as a duplicate of this bug. ***

Comment 9 Chris Duryee 2016-07-15 13:08:10 UTC
If this happens again, please capture "lsof -p <pid>" output for the process that is failing. This will help with finding the leak.

Comment 10 Jan Hutař 2016-07-20 12:36:39 UTC
(In reply to Chris Duryee from comment #9)
> If this happens again, please capture "lsof -p <pid>" output for the process
> that is failing. This will help with finding the leak.

Hello Chris,
I'm bit uncertain here: is it enough to capture this after it happens? Which pid should I capture this for? I plan to capture it for all currently running pids, is that OK?

Comment 11 Chris Duryee 2016-07-20 13:57:52 UTC
Capturing after it happens works. Your suggestion of capturing for all pids is probably better:)

Comment 15 Michael Hrivnak 2016-07-29 14:15:02 UTC
Jan, and anyone else who adds data to this issue, please add the specific versions of all qpid-related packages that are installed. Thanks!

Comment 16 Justin Sherrill 2016-07-29 14:17:15 UTC
I'm seeing this as well when the katello:upgrades:3.0:add_export_distributor rake task gets added as part of the upgrade.  If a user has ~3000 repositories the default max # of open files limit gets hit.  

The wsgi:pulp process seems to be the one creating a ton of FIFO pipes.

Note this is on RHEL 6.

libqpid-dispatch-0.4-13.el6sat.x86_64
pulp-admin-client-2.8.3.3-1.el6sat.noarch
pulp-client-1.0-1.noarch
pulp-docker-plugins-2.0.1.1-1.el6sat.noarch
pulp-katello-1.0.1-1.el6sat.noarch
pulp-puppet-plugins-2.8.3.3-1.el6sat.noarch
pulp-puppet-tools-2.8.3.3-1.el6sat.noarch
pulp-rpm-admin-extensions-2.8.3.5-1.el6sat.noarch
pulp-rpm-handlers-2.8.3.5-1.el6sat.noarch
pulp-rpm-plugins-2.8.3.5-1.el6sat.noarch
pulp-selinux-2.8.3.3-1.el6sat.noarch
pulp-server-2.8.3.3-1.el6sat.noarch
python-gofer-qpid-2.7.6-1.el6sat.noarch
python-isodate-0.5.0-4.pulp.el6_6sat.noarch
python-pulp-agent-lib-2.8.3.3-1.el6sat.noarch
python-pulp-bindings-2.8.3.3-1.el6sat.noarch
python-pulp-client-lib-2.8.3.3-1.el6sat.noarch
python-pulp-common-2.8.3.3-1.el6sat.noarch
python-pulp-docker-common-2.0.1.1-1.el6sat.noarch
python-pulp-oid_validation-2.8.3.3-1.el6sat.noarch
python-pulp-puppet-common-2.8.3.3-1.el6sat.noarch
python-pulp-repoauth-2.8.3.3-1.el6sat.noarch
python-pulp-rpm-common-2.8.3.5-1.el6sat.noarch
python-pulp-streamer-2.8.3.3-1.el6sat.noarch
python-qpid-0.30-9.el6sat.noarch
python-qpid-qmf-0.30-5.el6.x86_64
qpid-cpp-client-0.30-11.el6.x86_64
qpid-cpp-client-devel-0.30-11.el6.x86_64
qpid-cpp-server-0.30-11.el6.x86_64
qpid-cpp-server-linearstore-0.30-11.el6.x86_64
qpid-dispatch-router-0.4-13.el6sat.x86_64
qpid-java-client-0.30-3.el6.noarch
qpid-java-common-0.30-3.el6.noarch
qpid-proton-c-0.9-16.el6.x86_64
qpid-qmf-0.30-5.el6.x86_64
qpid-tools-0.30-4.el6.noarch

Comment 17 Jan Hutař 2016-08-01 15:10:56 UTC
(In reply to Michael Hrivnak from comment #15)
> Jan, and anyone else who adds data to this issue, please add the specific
> versions of all qpid-related packages that are installed. Thanks!

# rpm -qa | grep qpid | sort
libqpid-dispatch-0.4-13.el7sat.x86_64
<host>-qpid-broker-1.0-1.noarch
<host>-qpid-client-cert-1.0-1.noarch
<host>-qpid-router-client-1.0-1.noarch
<host>-qpid-router-server-1.0-1.noarch
python-gofer-qpid-2.7.6-1.el7sat.noarch
python-qpid-0.30-9.el7sat.noarch
python-qpid-qmf-0.30-5.el7.x86_64
qpid-cpp-client-0.30-11.el7sat.x86_64
qpid-cpp-client-devel-0.30-11.el7sat.x86_64
qpid-cpp-server-0.30-11.el7sat.x86_64
qpid-cpp-server-linearstore-0.30-11.el7sat.x86_64
qpid-dispatch-router-0.4-13.el7sat.x86_64
qpid-proton-c-0.9-16.el7.x86_64
qpid-qmf-0.30-5.el7.x86_64
qpid-tools-0.30-4.el7.noarch
tfm-rubygem-qpid_messaging-0.30.0-7.el7sat.x86_64

Comment 18 pulp-infra@redhat.com 2016-08-02 18:30:19 UTC
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 19 pulp-infra@redhat.com 2016-08-02 18:30:22 UTC
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 20 pulp-infra@redhat.com 2016-08-02 19:30:28 UTC
The Pulp upstream bug priority is at Urgent. Updating the external tracker on this bug.

Comment 21 pulp-infra@redhat.com 2016-08-02 23:00:21 UTC
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 22 Jan Hutař 2016-08-03 19:13:00 UTC
Not sure if it is related and have no clue if it is OK or not, but noticed, that with ~550 clients, there is lots of pulp's connections to qpidd:

# qpid-stat -q --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 | grep pulp\.agent | wc -l
554
# qpid-stat -q --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 | grep pulp\.agent | tail -n 3
  pulp.agent.ffc25d50-f0e6-487c-9810-e18e81d4ee50                                  Y                      0     0      0       0      0        0        13     1
  pulp.agent.ffc96572-749c-4a97-bb12-225be5da35cd                                  Y                      0     0      0       0      0        0        17     1
  pulp.agent.ffeab4f4-ea47-4261-b310-d559a7459ab4                                  Y                      0     0      0       0      0        0        11     1

Comment 23 pulp-infra@redhat.com 2016-08-05 16:00:23 UTC
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 24 pulp-infra@redhat.com 2016-08-15 14:00:28 UTC
The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.

Comment 25 pulp-infra@redhat.com 2016-08-15 17:00:30 UTC
The Pulp upstream bug status is at VERIFIED. Updating the external tracker on this bug.

Comment 26 pulp-infra@redhat.com 2016-08-19 14:00:24 UTC
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 29 Satellite Program 2017-02-23 21:08:28 UTC
Please add verifications steps for this bug to help QE verify

Comment 32 Michael Hrivnak 2017-03-21 20:59:06 UTC
This is resolved by using python-kombu 3.0.33-6 or newer. There's nothing to backport onto Pulp, so you can just update that dep.

Comment 34 pulp-infra@redhat.com 2017-03-30 02:06:40 UTC
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 35 pulp-infra@redhat.com 2017-03-30 02:06:45 UTC
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 36 pulp-infra@redhat.com 2017-03-30 23:06:08 UTC
The Pulp upstream bug status is at CLOSED - COMPLETE. Updating the external tracker on this bug.

Comment 37 jcallaha 2017-04-10 19:09:22 UTC
Verified in Satellite 6.2.9 Snap 2. 

I used a script to create/register 1,000 content hosts to the Satellite, in batches of 100, with a 3 minute gap from host creation to teardown. 

https://gist.github.com/JacobCallahan/27bb6bcf436aa8f07ed8be08c217eff4


-bash-4.2# python docker-flood.py rhel7 ibm-x3550m3-06.lab.eng.brq.redhat.com flood 7key 1000 --limit 100 --delay 180
Starting content host creation.
Created: 1
Created: 2
Created: 3
...
Created: 100
Done with container #1
Created: 101
...
Done with container #999
Done with container #1000
Finished content host creation.

-bash-4.1# lsof | cut -d ' ' -f 1 | sort | uniq -c | sort -n | tail
    108 mongod
    109 Passenger
    122 qpidd
    140 java
    141 pulp_stre
    157 sshd
    584 ruby
   2866 python
   3900 postmaste
   3906 httpd

-bash-4.1# hammer ping
candlepin:      
    Status:          ok
    Server Response: Duration: 45ms
candlepin_auth: 
    Status:          ok
    Server Response: Duration: 42ms
pulp:           
    Status:          ok
    Server Response: Duration: 66ms
foreman_tasks:  
    Status:          ok
    Server Response: Duration: 19ms

-bash-4.1# service httpd status
httpd (pid  3835) is running...

Comment 40 errata-xmlrpc 2017-05-01 13:52:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1191


Note You need to log in before you can comment on or make changes to this bug.