1122055 – qpid fails to start too many open files

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1122055 - qpid fails to start too many open files

Summary: qpid fails to start too many open files

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Content Management
Sub Component:
Version:	6.0.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	Unspecified
Assignee:	Jason Montleon
QA Contact:	Tazim Kolhar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-07-22 12:56 UTC by Alex Krzos
Modified:	2017-07-26 19:39 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-08-12 13:57:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Apache JIRA	QPID-5924	0	None	None	None	Never

Description Alex Krzos 2014-07-22 12:56:22 UTC

Description of problem:
Published/Promoted a large number of content-views > 100.  qpid refuses to start.  See Actual Results for error message.

Version-Release number of selected component (if applicable):
Satellite 6.0.3 Beta


How reproducible:
Always on this environment

Steps to Reproduce:
1. Install Sat6, upload manifest, sync content, upload puppet modules
2. create, publish, promote content views
3. katello-service restart or service qpidd restart

Actual results:
# service qpidd start
Starting Qpid AMQP daemon: Daemon startup failed: Queue pulp.agent.5752dc04-7536-4e5c-b406-a0cd5d9c9119: recoverMessages() failed: jexception 0x0104 RecoveryManager::getFile() threw JERR__FILEIO: File read or write failure. (/var/lib/qpidd/qls/jrnl/pulp.agent.5752dc04-7536-4e5c-b406-a0cd5d9c9119/818fa4b0-3319-4478-b2b0-d2195f90f695.jrnl) (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1004)
                                                           [FAILED]

Expected results:
qpidd to start so my Satellite 6 server is useful

Additional info:
Workaround is available by adjusting ulimit -n to more files and starting qpid or katello-service to get everything started up correctly again:
# ulimit -n 102400
# service qpidd start
Starting Qpid AMQP daemon:                                 [  OK  ]


Looking at /var/lib/qpidd/qls/jrnl/ directory there is 2676 jrnl files, 2640 of them start with pulp.agent.

Comment 1 RHEL Program Management 2014-07-22 13:23:46 UTC

Since this issue was entered in Red Hat Bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

Comment 3 Alex Krzos 2014-07-23 17:04:36 UTC

This issue causes hammer and portions of the UI to become unresponsive (Products Page) and manual intervention to restore functionality.

If qpidd is opening excessive files due to pulp, then this can occur on any system that accelerates pulp "leaking jrnl" files into qpidd. Once the number of files allowed open is exceeded, qpidd remains running however as mentioned above portions of the UI and hammer commands which rely on pulp timeout and become completely unresponsive.  

See below example:


# ulimit -n 5418
# katello-service start
...
# ls -l /proc/55726/fd/ | wc -l
5416
[root@perfc-380g8-01 ~]# hammer ping
candlepin:
    Status:          ok
    Server Response: Duration: 77ms
candlepin_auth:
    Status:          ok
    Server Response: Duration: 52ms
pulp:
    Status:          ok
    Server Response: Duration: 36ms
pulp_auth:
    Status:          FAIL
    Server Response: Message: undefined method `resources' for nil:NilClass
elasticsearch:
    Status:          ok
    Server Response: Duration: 54ms
katello_jobs:
    Status:          ok
    Server Response: Duration: 66ms
# time hammer content-view publish --organization-id 1 --name cv-composite-265
[.......................................                                       ] [18%]


^ Process seems frozen, in separate terminal window:

# ls -l /proc/55726/fd/ | wc -l
5419

In dynaflow console, task appears stuck on 13: Actions::Pulp::Repository::CopyPackageGroup (suspended)

Comment 4 Brian Bouterse 2014-07-24 14:16:30 UTC

Issue Recap:

After discussion in the #messaging channel and some hand testing. Alex correctly identified the root cause as the OS limiting the number of file descriptors that Qpid can use to read in its journal files when it starts. In situations where qpidd manages a large number of queues (thousands) and those queues are durable, each queue receives its own journal file. When managing a huge number of queues the OS doesn't give Qpid enough file descriptors by default to start correctly.

Why so many Queues?

The pulp.agent.<UUID> queues are created 1-1 for systems that Pulp is managing (Consumers). Pulp requires those queues to be durable so that updates to Consumers are not "lost" which could be a serious reliability problem. Given that reliability is required the scalability is more difficult to achieve.

The short term workaround:

Raise the number of file descriptors that Qpidd has access to as part of Satellite until the scalability goals are achieved. This can be done per process so that the rest of the system is not adjusted. 

Pulp needs to document this limitation, and I've created a Pulp issue that references this one to do that in the Pulp upstream docs [0].

The long-term fix:
I've filed an upstream Qpid bug on this issue here [1]. Qpid could organize its journal files differently to avoid running out of 

[0]: https://bugzilla.redhat.com/show_bug.cgi?id=1122987
[1]: https://issues.apache.org/jira/browse/QPID-5924

Comment 5 Brian Bouterse 2014-07-24 14:21:01 UTC

finishing the sentence from my comment...

Qpid could organize its journal files differently to avoid running out of file descriptors. See the recommendation in the Qpid upstream JIRA issue (5924) for more on that.

Comment 6 Kim van der Riet 2014-07-30 14:19:07 UTC

See comments and proposed patch upstream at https://issues.apache.org/jira/browse/QPID-5924

Comment 7 Brian Bouterse 2014-07-30 15:26:36 UTC

Satellite 6 uses the qpid-cpp MRG product. This BZ [0] tracks the inclusion of the upstream fix into the MRG product.

[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1124906

Comment 11 Tazim Kolhar 2015-04-27 08:43:18 UTC

VERIFIED:

# rpm -qa | grep foreman
foreman-libvirt-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman_bootdisk-4.0.2.10-1.el6_6sat.noarch
ruby193-rubygem-foreman_hooks-0.3.7-2.el6_6sat.noarch
rubygem-hammer_cli_foreman_tasks-0.0.3.3-1.el6_6sat.noarch
rubygem-hammer_cli_foreman_bootdisk-0.1.2.5-1.el6_6sat.noarch
foreman-postgresql-1.7.2.17-1.el6_6sat.noarch
foreman-debug-1.7.2.17-1.el6_6sat.noarch
foreman-1.7.2.17-1.el6_6sat.noarch
foreman-ovirt-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman-tasks-0.6.12.3-1.el6_6sat.noarch
foreman-proxy-1.7.2.4-1.el6_6sat.noarch
qe-sat6-rhel66.usersys.redhat.com-foreman-client-1.0-1.noarch
qe-sat6-rhel66.usersys.redhat.com-foreman-proxy-client-1.0-1.noarch
foreman-selinux-1.7.2.13-1.el6_6sat.noarch
rubygem-hammer_cli_foreman-0.1.4.9-1.el6_6sat.noarch
foreman-compute-1.7.2.17-1.el6_6sat.noarch
foreman-vmware-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman-redhat_access-0.1.0-1.el6_6sat.noarch
ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el6_6sat.noarch
qe-sat6-rhel66.usersys.redhat.com-foreman-proxy-1.0-2.noarch
ruby193-rubygem-foreman_docker-1.2.0.9-1.el6_6sat.noarch
rubygem-hammer_cli_foreman_discovery-0.0.1.7-1.el6_6sat.noarch
foreman-gce-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman_discovery-2.0.0.9-1.el6_6sat.noarch

steps:
1. Install Sat6, upload manifest, sync content, upload puppet modules
2. create, publish, promote content views
3. katello-service restart or service qpidd restart

# hammer subscription list --organization-id 1
[Foreman] Username: admin
[Foreman] Password for admin: 
---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|---------
NAME                             | CONTRACT | ACCOUNT | SUPPORT      | QUANTITY  | CONSUMED | END DATE   | ID                               | PRODUCT                          | QUANTITY  | ATTACHED
---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|---------
CloudForms Employee Subscription | 10041814 | 477931  | Self-Support | 10        | 0        | 2022-01-01 | ff8080814cecd21a014cf9ac9a040099 | CloudForms Employee Subscription | 10        | 0       
Red Hat Employee Subscription    | 2596950  | 477931  | Self-Support | 10        | 0        | 2022-01-01 | ff8080814cecd21a014cf9ac994a0071 | Red Hat Employee Subscription    | 10        | 0       
prod                             |          |         |              | Unlimited | 0        | 2045-04-19 | ff8080814cecd21a014cf9c9627200ce | prod                             | Unlimited | 0       
---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|---------

# hammer puppet-module list
[Foreman] Username: admin
[Foreman] Password for admin: 
-------------------------------------|--------|------------|--------
ID                                   | NAME   | AUTHOR     | VERSION
-------------------------------------|--------|------------|--------
392162ec-b006-4b27-aaf0-34f16694281b | stdlib | puppetlabs | 4.6.0  
-------------------------------------|--------|------------|--------

# hammer content-view info --name con_viewA --organization-id 1
[Foreman] Username: admin
[Foreman] Password for admin: 
ID:                     3
Name:                   con_viewA
Label:                  con_viewA
Composite:              
Description:            
Content Host Count:     0
Organization:           Default Organization
Yum Repositories:       

Docker Repositories:    

Puppet Modules:         

Lifecycle Environments: 
 1) ID:   2
    Name: DEV
 2) ID:   1
    Name: Library
Versions:               
 1) ID:        3
    Version:   1.0
    Published: 2015/04/27 08:32:18
Components:             

Activation Keys:

# service qpidd restart
Stopping Qpid AMQP daemon:                                 [  OK  ]
Starting Qpid AMQP daemon:                                 [  OK  ]

# hammer ping
[Foreman] Username: admin
[Foreman] Password for admin: 
candlepin:      
    Status:          ok
    Server Response: Duration: 29ms
candlepin_auth: 
    Status:          ok
    Server Response: Duration: 206ms
pulp:           
    Status:          ok
    Server Response: Duration: 16ms
pulp_auth:      
    Status:          ok
    Server Response: Duration: 20ms
elasticsearch:  
    Status:          ok
    Server Response: Duration: 48ms
foreman_tasks:  
    Status:          ok
    Server Response: Duration: 1ms

Comment 12 Bryan Kearney 2015-08-11 13:31:17 UTC

This bug is slated to be released with Satellite 6.1.

Comment 13 Bryan Kearney 2015-08-12 13:57:21 UTC

This bug was fixed in version 6.1.1 of Satellite which was released on 12 August, 2015.

Note You need to log in before you can comment on or make changes to this bug.