Description of problem: Published/Promoted a large number of content-views > 100. qpid refuses to start. See Actual Results for error message. Version-Release number of selected component (if applicable): Satellite 6.0.3 Beta How reproducible: Always on this environment Steps to Reproduce: 1. Install Sat6, upload manifest, sync content, upload puppet modules 2. create, publish, promote content views 3. katello-service restart or service qpidd restart Actual results: # service qpidd start Starting Qpid AMQP daemon: Daemon startup failed: Queue pulp.agent.5752dc04-7536-4e5c-b406-a0cd5d9c9119: recoverMessages() failed: jexception 0x0104 RecoveryManager::getFile() threw JERR__FILEIO: File read or write failure. (/var/lib/qpidd/qls/jrnl/pulp.agent.5752dc04-7536-4e5c-b406-a0cd5d9c9119/818fa4b0-3319-4478-b2b0-d2195f90f695.jrnl) (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1004) [FAILED] Expected results: qpidd to start so my Satellite 6 server is useful Additional info: Workaround is available by adjusting ulimit -n to more files and starting qpid or katello-service to get everything started up correctly again: # ulimit -n 102400 # service qpidd start Starting Qpid AMQP daemon: [ OK ] Looking at /var/lib/qpidd/qls/jrnl/ directory there is 2676 jrnl files, 2640 of them start with pulp.agent.
Since this issue was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.
This issue causes hammer and portions of the UI to become unresponsive (Products Page) and manual intervention to restore functionality. If qpidd is opening excessive files due to pulp, then this can occur on any system that accelerates pulp "leaking jrnl" files into qpidd. Once the number of files allowed open is exceeded, qpidd remains running however as mentioned above portions of the UI and hammer commands which rely on pulp timeout and become completely unresponsive. See below example: # ulimit -n 5418 # katello-service start ... # ls -l /proc/55726/fd/ | wc -l 5416 [root@perfc-380g8-01 ~]# hammer ping candlepin: Status: ok Server Response: Duration: 77ms candlepin_auth: Status: ok Server Response: Duration: 52ms pulp: Status: ok Server Response: Duration: 36ms pulp_auth: Status: FAIL Server Response: Message: undefined method `resources' for nil:NilClass elasticsearch: Status: ok Server Response: Duration: 54ms katello_jobs: Status: ok Server Response: Duration: 66ms # time hammer content-view publish --organization-id 1 --name cv-composite-265 [....................................... ] [18%] ^ Process seems frozen, in separate terminal window: # ls -l /proc/55726/fd/ | wc -l 5419 In dynaflow console, task appears stuck on 13: Actions::Pulp::Repository::CopyPackageGroup (suspended)
Issue Recap: After discussion in the #messaging channel and some hand testing. Alex correctly identified the root cause as the OS limiting the number of file descriptors that Qpid can use to read in its journal files when it starts. In situations where qpidd manages a large number of queues (thousands) and those queues are durable, each queue receives its own journal file. When managing a huge number of queues the OS doesn't give Qpid enough file descriptors by default to start correctly. Why so many Queues? The pulp.agent.<UUID> queues are created 1-1 for systems that Pulp is managing (Consumers). Pulp requires those queues to be durable so that updates to Consumers are not "lost" which could be a serious reliability problem. Given that reliability is required the scalability is more difficult to achieve. The short term workaround: Raise the number of file descriptors that Qpidd has access to as part of Satellite until the scalability goals are achieved. This can be done per process so that the rest of the system is not adjusted. Pulp needs to document this limitation, and I've created a Pulp issue that references this one to do that in the Pulp upstream docs [0]. The long-term fix: I've filed an upstream Qpid bug on this issue here [1]. Qpid could organize its journal files differently to avoid running out of [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1122987 [1]: https://issues.apache.org/jira/browse/QPID-5924
finishing the sentence from my comment... Qpid could organize its journal files differently to avoid running out of file descriptors. See the recommendation in the Qpid upstream JIRA issue (5924) for more on that.
See comments and proposed patch upstream at https://issues.apache.org/jira/browse/QPID-5924
Satellite 6 uses the qpid-cpp MRG product. This BZ [0] tracks the inclusion of the upstream fix into the MRG product. [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1124906
VERIFIED: # rpm -qa | grep foreman foreman-libvirt-1.7.2.17-1.el6_6sat.noarch ruby193-rubygem-foreman_bootdisk-4.0.2.10-1.el6_6sat.noarch ruby193-rubygem-foreman_hooks-0.3.7-2.el6_6sat.noarch rubygem-hammer_cli_foreman_tasks-0.0.3.3-1.el6_6sat.noarch rubygem-hammer_cli_foreman_bootdisk-0.1.2.5-1.el6_6sat.noarch foreman-postgresql-1.7.2.17-1.el6_6sat.noarch foreman-debug-1.7.2.17-1.el6_6sat.noarch foreman-1.7.2.17-1.el6_6sat.noarch foreman-ovirt-1.7.2.17-1.el6_6sat.noarch ruby193-rubygem-foreman-tasks-0.6.12.3-1.el6_6sat.noarch foreman-proxy-1.7.2.4-1.el6_6sat.noarch qe-sat6-rhel66.usersys.redhat.com-foreman-client-1.0-1.noarch qe-sat6-rhel66.usersys.redhat.com-foreman-proxy-client-1.0-1.noarch foreman-selinux-1.7.2.13-1.el6_6sat.noarch rubygem-hammer_cli_foreman-0.1.4.9-1.el6_6sat.noarch foreman-compute-1.7.2.17-1.el6_6sat.noarch foreman-vmware-1.7.2.17-1.el6_6sat.noarch ruby193-rubygem-foreman-redhat_access-0.1.0-1.el6_6sat.noarch ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el6_6sat.noarch qe-sat6-rhel66.usersys.redhat.com-foreman-proxy-1.0-2.noarch ruby193-rubygem-foreman_docker-1.2.0.9-1.el6_6sat.noarch rubygem-hammer_cli_foreman_discovery-0.0.1.7-1.el6_6sat.noarch foreman-gce-1.7.2.17-1.el6_6sat.noarch ruby193-rubygem-foreman_discovery-2.0.0.9-1.el6_6sat.noarch steps: 1. Install Sat6, upload manifest, sync content, upload puppet modules 2. create, publish, promote content views 3. katello-service restart or service qpidd restart # hammer subscription list --organization-id 1 [Foreman] Username: admin [Foreman] Password for admin: ---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|--------- NAME | CONTRACT | ACCOUNT | SUPPORT | QUANTITY | CONSUMED | END DATE | ID | PRODUCT | QUANTITY | ATTACHED ---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|--------- CloudForms Employee Subscription | 10041814 | 477931 | Self-Support | 10 | 0 | 2022-01-01 | ff8080814cecd21a014cf9ac9a040099 | CloudForms Employee Subscription | 10 | 0 Red Hat Employee Subscription | 2596950 | 477931 | Self-Support | 10 | 0 | 2022-01-01 | ff8080814cecd21a014cf9ac994a0071 | Red Hat Employee Subscription | 10 | 0 prod | | | | Unlimited | 0 | 2045-04-19 | ff8080814cecd21a014cf9c9627200ce | prod | Unlimited | 0 ---------------------------------|----------|---------|--------------|-----------|----------|------------|----------------------------------|----------------------------------|-----------|--------- # hammer puppet-module list [Foreman] Username: admin [Foreman] Password for admin: -------------------------------------|--------|------------|-------- ID | NAME | AUTHOR | VERSION -------------------------------------|--------|------------|-------- 392162ec-b006-4b27-aaf0-34f16694281b | stdlib | puppetlabs | 4.6.0 -------------------------------------|--------|------------|-------- # hammer content-view info --name con_viewA --organization-id 1 [Foreman] Username: admin [Foreman] Password for admin: ID: 3 Name: con_viewA Label: con_viewA Composite: Description: Content Host Count: 0 Organization: Default Organization Yum Repositories: Docker Repositories: Puppet Modules: Lifecycle Environments: 1) ID: 2 Name: DEV 2) ID: 1 Name: Library Versions: 1) ID: 3 Version: 1.0 Published: 2015/04/27 08:32:18 Components: Activation Keys: # service qpidd restart Stopping Qpid AMQP daemon: [ OK ] Starting Qpid AMQP daemon: [ OK ] # hammer ping [Foreman] Username: admin [Foreman] Password for admin: candlepin: Status: ok Server Response: Duration: 29ms candlepin_auth: Status: ok Server Response: Duration: 206ms pulp: Status: ok Server Response: Duration: 16ms pulp_auth: Status: ok Server Response: Duration: 20ms elasticsearch: Status: ok Server Response: Duration: 48ms foreman_tasks: Status: ok Server Response: Duration: 1ms
This bug is slated to be released with Satellite 6.1.
This bug was fixed in version 6.1.1 of Satellite which was released on 12 August, 2015.