Bug 1122055
| Summary: | qpid fails to start too many open files | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Alex Krzos <akrzos> |
| Component: | Content Management | Assignee: | Jason Montleon <jmontleo> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Tazim Kolhar <tkolhar> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.0.3 | CC: | bbuckingham, bkearney, cwelton, jross, kim.vdriet, mmccune, mmurray, perfbz, tkolhar, xdmoon, zkraus |
| Target Milestone: | Unspecified | Keywords: | Triaged |
| Target Release: | Unused | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-08-12 13:57:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Since this issue was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release. This issue causes hammer and portions of the UI to become unresponsive (Products Page) and manual intervention to restore functionality.
If qpidd is opening excessive files due to pulp, then this can occur on any system that accelerates pulp "leaking jrnl" files into qpidd. Once the number of files allowed open is exceeded, qpidd remains running however as mentioned above portions of the UI and hammer commands which rely on pulp timeout and become completely unresponsive.
See below example:
# ulimit -n 5418
# katello-service start
...
# ls -l /proc/55726/fd/ | wc -l
5416
[root@perfc-380g8-01 ~]# hammer ping
candlepin:
Status: ok
Server Response: Duration: 77ms
candlepin_auth:
Status: ok
Server Response: Duration: 52ms
pulp:
Status: ok
Server Response: Duration: 36ms
pulp_auth:
Status: FAIL
Server Response: Message: undefined method `resources' for nil:NilClass
elasticsearch:
Status: ok
Server Response: Duration: 54ms
katello_jobs:
Status: ok
Server Response: Duration: 66ms
# time hammer content-view publish --organization-id 1 --name cv-composite-265
[....................................... ] [18%]
^ Process seems frozen, in separate terminal window:
# ls -l /proc/55726/fd/ | wc -l
5419
In dynaflow console, task appears stuck on 13: Actions::Pulp::Repository::CopyPackageGroup (suspended)
Issue Recap: After discussion in the #messaging channel and some hand testing. Alex correctly identified the root cause as the OS limiting the number of file descriptors that Qpid can use to read in its journal files when it starts. In situations where qpidd manages a large number of queues (thousands) and those queues are durable, each queue receives its own journal file. When managing a huge number of queues the OS doesn't give Qpid enough file descriptors by default to start correctly. Why so many Queues? The pulp.agent.<UUID> queues are created 1-1 for systems that Pulp is managing (Consumers). Pulp requires those queues to be durable so that updates to Consumers are not "lost" which could be a serious reliability problem. Given that reliability is required the scalability is more difficult to achieve. The short term workaround: Raise the number of file descriptors that Qpidd has access to as part of Satellite until the scalability goals are achieved. This can be done per process so that the rest of the system is not adjusted. Pulp needs to document this limitation, and I've created a Pulp issue that references this one to do that in the Pulp upstream docs [0]. The long-term fix: I've filed an upstream Qpid bug on this issue here [1]. Qpid could organize its journal files differently to avoid running out of [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1122987 [1]: https://issues.apache.org/jira/browse/QPID-5924 finishing the sentence from my comment... Qpid could organize its journal files differently to avoid running out of file descriptors. See the recommendation in the Qpid upstream JIRA issue (5924) for more on that. See comments and proposed patch upstream at https://issues.apache.org/jira/browse/QPID-5924 Satellite 6 uses the qpid-cpp MRG product. This BZ [0] tracks the inclusion of the upstream fix into the MRG product. [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1124906 VERIFIED:
# rpm -qa | grep foreman
foreman-libvirt-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman_bootdisk-4.0.2.10-1.el6_6sat.noarch
ruby193-rubygem-foreman_hooks-0.3.7-2.el6_6sat.noarch
rubygem-hammer_cli_foreman_tasks-0.0.3.3-1.el6_6sat.noarch
rubygem-hammer_cli_foreman_bootdisk-0.1.2.5-1.el6_6sat.noarch
foreman-postgresql-1.7.2.17-1.el6_6sat.noarch
foreman-debug-1.7.2.17-1.el6_6sat.noarch
foreman-1.7.2.17-1.el6_6sat.noarch
foreman-ovirt-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman-tasks-0.6.12.3-1.el6_6sat.noarch
foreman-proxy-1.7.2.4-1.el6_6sat.noarch
qe-sat6-rhel66.usersys.redhat.com-foreman-client-1.0-1.noarch
qe-sat6-rhel66.usersys.redhat.com-foreman-proxy-client-1.0-1.noarch
foreman-selinux-1.7.2.13-1.el6_6sat.noarch
rubygem-hammer_cli_foreman-0.1.4.9-1.el6_6sat.noarch
foreman-compute-1.7.2.17-1.el6_6sat.noarch
foreman-vmware-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman-redhat_access-0.1.0-1.el6_6sat.noarch
ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el6_6sat.noarch
qe-sat6-rhel66.usersys.redhat.com-foreman-proxy-1.0-2.noarch
ruby193-rubygem-foreman_docker-1.2.0.9-1.el6_6sat.noarch
rubygem-hammer_cli_foreman_discovery-0.0.1.7-1.el6_6sat.noarch
foreman-gce-1.7.2.17-1.el6_6sat.noarch
ruby193-rubygem-foreman_discovery-2.0.0.9-1.el6_6sat.noarch
steps:
1. Install Sat6, upload manifest, sync content, upload puppet modules
2. create, publish, promote content views
3. katello-service restart or service qpidd restart
# hammer subscription list --organization-id 1
[Foreman] Username: admin
[Foreman] Password for admin:
---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|---------
NAME | CONTRACT | ACCOUNT | SUPPORT | QUANTITY | CONSUMED | END DATE | ID | PRODUCT | QUANTITY | ATTACHED
---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|---------
CloudForms Employee Subscription | 10041814 | 477931 | Self-Support | 10 | 0 | 2022-01-01 | ff8080814cecd21a014cf9ac9a040099 | CloudForms Employee Subscription | 10 | 0
Red Hat Employee Subscription | 2596950 | 477931 | Self-Support | 10 | 0 | 2022-01-01 | ff8080814cecd21a014cf9ac994a0071 | Red Hat Employee Subscription | 10 | 0
prod | | | | Unlimited | 0 | 2045-04-19 | ff8080814cecd21a014cf9c9627200ce | prod | Unlimited | 0
---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|---------
# hammer puppet-module list
[Foreman] Username: admin
[Foreman] Password for admin:
-------------------------------------|--------|------------|--------
ID | NAME | AUTHOR | VERSION
-------------------------------------|--------|------------|--------
392162ec-b006-4b27-aaf0-34f16694281b | stdlib | puppetlabs | 4.6.0
-------------------------------------|--------|------------|--------
# hammer content-view info --name con_viewA --organization-id 1
[Foreman] Username: admin
[Foreman] Password for admin:
ID: 3
Name: con_viewA
Label: con_viewA
Composite:
Description:
Content Host Count: 0
Organization: Default Organization
Yum Repositories:
Docker Repositories:
Puppet Modules:
Lifecycle Environments:
1) ID: 2
Name: DEV
2) ID: 1
Name: Library
Versions:
1) ID: 3
Version: 1.0
Published: 2015/04/27 08:32:18
Components:
Activation Keys:
# service qpidd restart
Stopping Qpid AMQP daemon: [ OK ]
Starting Qpid AMQP daemon: [ OK ]
# hammer ping
[Foreman] Username: admin
[Foreman] Password for admin:
candlepin:
Status: ok
Server Response: Duration: 29ms
candlepin_auth:
Status: ok
Server Response: Duration: 206ms
pulp:
Status: ok
Server Response: Duration: 16ms
pulp_auth:
Status: ok
Server Response: Duration: 20ms
elasticsearch:
Status: ok
Server Response: Duration: 48ms
foreman_tasks:
Status: ok
Server Response: Duration: 1ms
This bug is slated to be released with Satellite 6.1. This bug was fixed in version 6.1.1 of Satellite which was released on 12 August, 2015. |
Description of problem: Published/Promoted a large number of content-views > 100. qpid refuses to start. See Actual Results for error message. Version-Release number of selected component (if applicable): Satellite 6.0.3 Beta How reproducible: Always on this environment Steps to Reproduce: 1. Install Sat6, upload manifest, sync content, upload puppet modules 2. create, publish, promote content views 3. katello-service restart or service qpidd restart Actual results: # service qpidd start Starting Qpid AMQP daemon: Daemon startup failed: Queue pulp.agent.5752dc04-7536-4e5c-b406-a0cd5d9c9119: recoverMessages() failed: jexception 0x0104 RecoveryManager::getFile() threw JERR__FILEIO: File read or write failure. (/var/lib/qpidd/qls/jrnl/pulp.agent.5752dc04-7536-4e5c-b406-a0cd5d9c9119/818fa4b0-3319-4478-b2b0-d2195f90f695.jrnl) (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1004) [FAILED] Expected results: qpidd to start so my Satellite 6 server is useful Additional info: Workaround is available by adjusting ulimit -n to more files and starting qpid or katello-service to get everything started up correctly again: # ulimit -n 102400 # service qpidd start Starting Qpid AMQP daemon: [ OK ] Looking at /var/lib/qpidd/qls/jrnl/ directory there is 2676 jrnl files, 2640 of them start with pulp.agent.