Bug 1122055 - qpid fails to start too many open files
Summary: qpid fails to start too many open files
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Content Management
Version: 6.0.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: Unspecified
Assignee: Jason Montleon
QA Contact: Tazim Kolhar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-07-22 12:56 UTC by Alex Krzos
Modified: 2017-07-26 19:39 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-08-12 13:57:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-5924 0 None None None Never

Description Alex Krzos 2014-07-22 12:56:22 UTC
Description of problem:
Published/Promoted a large number of content-views > 100.  qpid refuses to start.  See Actual Results for error message.

Version-Release number of selected component (if applicable):
Satellite 6.0.3 Beta


How reproducible:
Always on this environment

Steps to Reproduce:
1. Install Sat6, upload manifest, sync content, upload puppet modules
2. create, publish, promote content views
3. katello-service restart or service qpidd restart

Actual results:
# service qpidd start
Starting Qpid AMQP daemon: Daemon startup failed: Queue pulp.agent.5752dc04-7536-4e5c-b406-a0cd5d9c9119: recoverMessages() failed: jexception 0x0104 RecoveryManager::getFile() threw JERR__FILEIO: File read or write failure. (/var/lib/qpidd/qls/jrnl/pulp.agent.5752dc04-7536-4e5c-b406-a0cd5d9c9119/818fa4b0-3319-4478-b2b0-d2195f90f695.jrnl) (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1004)
                                                           [FAILED]

Expected results:
qpidd to start so my Satellite 6 server is useful

Additional info:
Workaround is available by adjusting ulimit -n to more files and starting qpid or katello-service to get everything started up correctly again:
# ulimit -n 102400
# service qpidd start
Starting Qpid AMQP daemon:                                 [  OK  ]


Looking at /var/lib/qpidd/qls/jrnl/ directory there is 2676 jrnl files, 2640 of them start with pulp.agent.

Comment 1 RHEL Program Management 2014-07-22 13:23:46 UTC
Since this issue was entered in Red Hat Bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

Comment 3 Alex Krzos 2014-07-23 17:04:36 UTC
This issue causes hammer and portions of the UI to become unresponsive (Products Page) and manual intervention to restore functionality.

If qpidd is opening excessive files due to pulp, then this can occur on any system that accelerates pulp "leaking jrnl" files into qpidd. Once the number of files allowed open is exceeded, qpidd remains running however as mentioned above portions of the UI and hammer commands which rely on pulp timeout and become completely unresponsive.  

See below example:


# ulimit -n 5418
# katello-service start
...
# ls -l /proc/55726/fd/ | wc -l
5416
[root@perfc-380g8-01 ~]# hammer ping
candlepin:
    Status:          ok
    Server Response: Duration: 77ms
candlepin_auth:
    Status:          ok
    Server Response: Duration: 52ms
pulp:
    Status:          ok
    Server Response: Duration: 36ms
pulp_auth:
    Status:          FAIL
    Server Response: Message: undefined method `resources' for nil:NilClass
elasticsearch:
    Status:          ok
    Server Response: Duration: 54ms
katello_jobs:
    Status:          ok
    Server Response: Duration: 66ms
# time hammer content-view publish --organization-id 1 --name cv-composite-265
[.......................................                                       ] [18%]


^ Process seems frozen, in separate terminal window:

# ls -l /proc/55726/fd/ | wc -l
5419

In dynaflow console, task appears stuck on 13: Actions::Pulp::Repository::CopyPackageGroup (suspended)

Comment 4 Brian Bouterse 2014-07-24 14:16:30 UTC
Issue Recap:

After discussion in the #messaging channel and some hand testing. Alex correctly identified the root cause as the OS limiting the number of file descriptors that Qpid can use to read in its journal files when it starts. In situations where qpidd manages a large number of queues (thousands) and those queues are durable, each queue receives its own journal file. When managing a huge number of queues the OS doesn't give Qpid enough file descriptors by default to start correctly.

Why so many Queues?

The pulp.agent.<UUID> queues are created 1-1 for systems that Pulp is managing (Consumers). Pulp requires those queues to be durable so that updates to Consumers are not "lost" which could be a serious reliability problem. Given that reliability is required the scalability is more difficult to achieve.

The short term workaround:

Raise the number of file descriptors that Qpidd has access to as part of Satellite until the scalability goals are achieved. This can be done per process so that the rest of the system is not adjusted. 

Pulp needs to document this limitation, and I've created a Pulp issue that references this one to do that in the Pulp upstream docs [0].

The long-term fix:
I've filed an upstream Qpid bug on this issue here [1]. Qpid could organize its journal files differently to avoid running out of 

[0]: https://bugzilla.redhat.com/show_bug.cgi?id=1122987
[1]: https://issues.apache.org/jira/browse/QPID-5924

Comment 5 Brian Bouterse 2014-07-24 14:21:01 UTC
finishing the sentence from my comment...

Qpid could organize its journal files differently to avoid running out of file descriptors. See the recommendation in the Qpid upstream JIRA issue (5924) for more on that.

Comment 6 Kim van der Riet 2014-07-30 14:19:07 UTC
See comments and proposed patch upstream at https://issues.apache.org/jira/browse/QPID-5924

Comment 7 Brian Bouterse 2014-07-30 15:26:36 UTC
Satellite 6 uses the qpid-cpp MRG product. This BZ [0] tracks the inclusion of the upstream fix into the MRG product.

[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1124906

Comment 11 Tazim Kolhar 2015-04-27 08:43:18 UTC
VERIFIED:

# rpm -qa | grep foreman
foreman-libvirt-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman_bootdisk-4.0.2.10-1.el6_6sat.noarch
ruby193-rubygem-foreman_hooks-0.3.7-2.el6_6sat.noarch
rubygem-hammer_cli_foreman_tasks-0.0.3.3-1.el6_6sat.noarch
rubygem-hammer_cli_foreman_bootdisk-0.1.2.5-1.el6_6sat.noarch
foreman-postgresql-1.7.2.17-1.el6_6sat.noarch
foreman-debug-1.7.2.17-1.el6_6sat.noarch
foreman-1.7.2.17-1.el6_6sat.noarch
foreman-ovirt-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman-tasks-0.6.12.3-1.el6_6sat.noarch
foreman-proxy-1.7.2.4-1.el6_6sat.noarch
qe-sat6-rhel66.usersys.redhat.com-foreman-client-1.0-1.noarch
qe-sat6-rhel66.usersys.redhat.com-foreman-proxy-client-1.0-1.noarch
foreman-selinux-1.7.2.13-1.el6_6sat.noarch
rubygem-hammer_cli_foreman-0.1.4.9-1.el6_6sat.noarch
foreman-compute-1.7.2.17-1.el6_6sat.noarch
foreman-vmware-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman-redhat_access-0.1.0-1.el6_6sat.noarch
ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el6_6sat.noarch
qe-sat6-rhel66.usersys.redhat.com-foreman-proxy-1.0-2.noarch
ruby193-rubygem-foreman_docker-1.2.0.9-1.el6_6sat.noarch
rubygem-hammer_cli_foreman_discovery-0.0.1.7-1.el6_6sat.noarch
foreman-gce-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman_discovery-2.0.0.9-1.el6_6sat.noarch

steps:
1. Install Sat6, upload manifest, sync content, upload puppet modules
2. create, publish, promote content views
3. katello-service restart or service qpidd restart

# hammer subscription list --organization-id 1
[Foreman] Username: admin
[Foreman] Password for admin: 
---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|---------
NAME                             | CONTRACT | ACCOUNT | SUPPORT      | QUANTITY  | CONSUMED | END DATE   | ID                               | PRODUCT                          | QUANTITY  | ATTACHED
---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|---------
CloudForms Employee Subscription | 10041814 | 477931  | Self-Support | 10        | 0        | 2022-01-01 | ff8080814cecd21a014cf9ac9a040099 | CloudForms Employee Subscription | 10        | 0       
Red Hat Employee Subscription    | 2596950  | 477931  | Self-Support | 10        | 0        | 2022-01-01 | ff8080814cecd21a014cf9ac994a0071 | Red Hat Employee Subscription    | 10        | 0       
prod                             |          |         |              | Unlimited | 0        | 2045-04-19 | ff8080814cecd21a014cf9c9627200ce | prod                             | Unlimited | 0       
---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|---------

# hammer puppet-module list
[Foreman] Username: admin
[Foreman] Password for admin: 
-------------------------------------|--------|------------|--------
ID                                   | NAME   | AUTHOR     | VERSION
-------------------------------------|--------|------------|--------
392162ec-b006-4b27-aaf0-34f16694281b | stdlib | puppetlabs | 4.6.0  
-------------------------------------|--------|------------|--------

# hammer content-view info --name con_viewA --organization-id 1
[Foreman] Username: admin
[Foreman] Password for admin: 
ID:                     3
Name:                   con_viewA
Label:                  con_viewA
Composite:              
Description:            
Content Host Count:     0
Organization:           Default Organization
Yum Repositories:       

Docker Repositories:    

Puppet Modules:         

Lifecycle Environments: 
 1) ID:   2
    Name: DEV
 2) ID:   1
    Name: Library
Versions:               
 1) ID:        3
    Version:   1.0
    Published: 2015/04/27 08:32:18
Components:             

Activation Keys:

# service qpidd restart
Stopping Qpid AMQP daemon:                                 [  OK  ]
Starting Qpid AMQP daemon:                                 [  OK  ]

# hammer ping
[Foreman] Username: admin
[Foreman] Password for admin: 
candlepin:      
    Status:          ok
    Server Response: Duration: 29ms
candlepin_auth: 
    Status:          ok
    Server Response: Duration: 206ms
pulp:           
    Status:          ok
    Server Response: Duration: 16ms
pulp_auth:      
    Status:          ok
    Server Response: Duration: 20ms
elasticsearch:  
    Status:          ok
    Server Response: Duration: 48ms
foreman_tasks:  
    Status:          ok
    Server Response: Duration: 1ms

Comment 12 Bryan Kearney 2015-08-11 13:31:17 UTC
This bug is slated to be released with Satellite 6.1.

Comment 13 Bryan Kearney 2015-08-12 13:57:21 UTC
This bug was fixed in version 6.1.1 of Satellite which was released on 12 August, 2015.


Note You need to log in before you can comment on or make changes to this bug.