Bug 1159281 - Mention request to increase ulimit of nofiles for larger deployments
Summary: Mention request to increase ulimit of nofiles for larger deployments
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Docs Install Guide
Version: 6.0.4
Hardware: All
OS: All
high
high
Target Milestone: Unspecified
Assignee: Peter Ondrejka
QA Contact: Russell Dickenson
URL:
Whiteboard:
Depends On: 1159303
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-10-31 11:28 UTC by Pavel Moravec
Modified: 2019-09-26 16:28 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-13 13:30:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1136020 0 medium NEW Incomplete durable queues created : AIO error 2024-01-19 19:11:06 UTC
Red Hat Bugzilla 1169397 0 medium CLOSED gofer takes 100% CPU and does not reconnect after AMQP connection bounced 2021-08-30 12:31:59 UTC
Red Hat Bugzilla 1169416 0 high CLOSED gofer does not try to reconnect after network issue 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 1295583 0 None None None Never
Red Hat Knowledge Base (Solution) 1355703 0 None None None Never
Red Hat Knowledge Base (Solution) 1375253 0 None None None Never
Red Hat Knowledge Base (Solution) 1425893 0 None None None Never
Red Hat Knowledge Base (Solution) 1528243 0 None None None Never

Internal Links: 1136020 1169416

Description Pavel Moravec 2014-10-31 11:28:56 UTC
Document URL: 
https://access.redhat.com/documentation/en-US/Red_Hat_Satellite/6.0/html-single/Installation_Guide/index.html#Prerequisites3

Section Number and Name: 
1.5. Prerequisites

Describe the issue: 
For each content host, pulp maintains one consumer. That consumer requires creating a durable queue in qpid broker. That consumes one or two file descriptors to maintain the underlying journal file(s) (like /var/lib/qpidd/qls/jrnl/pulp.agent.<uuid>/*jrnl on RHEL6, similar on RHEL7).

When Satellite or capsule server is supposed to provision >500 content hosts, qpid broker might reach ulimit on open files with default value 1024.

Suggestions for improvement: 
Add there some notice/warning like "if you plan to provision many hundreds of content host from a satellite server or a capsule, increase ulimit of open files by 2*number_of_content_hosts".


Additional information: 
Not sure if install guide is the right place for tuning Sat6, feel free to put it elsewhere.

Also it might be usefull to provide particular commands/setup for increasing ulimit - if you decide so, feel free to ask me for technical details.

Comment 2 Justin Sherrill 2014-11-03 18:01:41 UTC
related to https://bugzilla.redhat.com/show_bug.cgi?id=1159303

Comment 3 Athene Chan 2014-11-05 22:24:01 UTC
Hi Pavel,

I am adding this to our prioritization list for review on when we can work on it. Thanks for the feedback! We'll update this ticket once we start on the work required for it.

Cheers,
Athene

Comment 4 Pavel Moravec 2014-11-21 18:57:18 UTC
Another - imho even more important notice from this field.

Background:
By default, qpid broker is run with --max-connections=500 option. That means, the broker accepts at most 500 connections from clients. If more content hosts would be needed, the broker will deny connections from it.

The suggestion:
In parallel with ulimit, note there that for (more) hundreds of content hosts, /etc/qpid/qpidd.conf needs to have:

max-connections=XXX

where XXX is at least 500+number_of_content_hosts (???)

Comment 5 Pavel Moravec 2014-11-21 19:01:55 UTC
Brian,
could you please review the calculations here? I.e. adding a new content host:
- how many file descriptors qpidd broker additionally needs? (for ulimit -n)
- how many TCP/AMQP connections are additionally established to the broker? (for max-connections broker option)

I guess there is 1 connection from goferd on content host - or is there also some other process connecting to qpidd?

And I guess there can be up to 3 FDs (1 for the TCP connection and up to 2 for journal files).

But I have little experience with this so far so my numbers can be wrong.

Could you please provide the calculations?

Thanks.

Comment 6 Brian Bouterse 2014-11-21 21:26:42 UTC
Pavel,

I did some testing, but it would be good if someone else did this on an actual sat6 installation. Some thoughts:

- The only issue are the qpidd file descriptors

- Each content host is a Pulp consumer

- Each Pulp consumer has exactly 1 durable queue, which causes some number of file descriptors to be used while in use, startup, and shutdown of Qpid

- AMQP connection run over TCP so we can search file descriptors for TCP connections

- file descriptors can be used internally to applications so reading the code to determine the file descriptors in use isn't always a reliable method of figuring this out.

- actually measuring the file descriptors and TCP connections of a satellite system would provide this info more reliably.

- qpidd uses a max of 2 file descriptors per durable queue it manages, but it depends on what version you have. I recommend talking with the Kim van der Riet the developer who fixed the qpidd file descriptor issue [0] upstream. Irina would know if that fix is in the Satellite 6 MRG.


I really think measuring the way Satellite consumes these resources is the right things to do. For instance when I reported the issue Kim thought it worked one way but there was a bug and the file descriptor count was not as expected. There could be lots of things like this, but measurement of the file descriptor count and connection count will be a reliable way to reason about these numbers.

We can list file descriptors on qpidd as a process with lsof. In this case my PID was 14542.

I can count total file descriptors with:   sudo lsof -p 14542 | wc -l
I can count TCP connections with:   sudo lsof -p 1775 | grep TCP | wc -l

Here are some numbers I get:

FD, TCP <- STATE
256, 2 <- No pulp service running, qpidd started.  2 listening sockets are listed as TCP connections. **
259, 5 <- start pulp_celerybeat, 3 TCP connections for celerybeat. this number should not change
261, 7 <- start pulp_resource_manager, 2 TCP connections for this process, it will not change
267, 13 <- start httpd. We have 6 WSGI processes and each one communicates with the message bus
283, 29 <- start 8 workers. This indicates 2 file descriptors per worker and 2 TCP connections.
285, 31 <- each gofer needs 2 file descriptors (durable queue) and 2 TCP connections

** Note my install has a lot of durable queues already, so this number is probably already sized for 8 workers and 1 consumer. It likely would be smaller on a fresh install. Here is the listing [1] in case this helps. Hopefully this should help with the planning of Sat6 connections and file descriptors.

[0]: https://issues.apache.org/jira/browse/QPID-5924
[1]: http://ur1.ca/iuhey

Comment 7 Pavel Moravec 2014-11-30 12:54:46 UTC
(In reply to bbouters from comment #6)
> Pavel,
> 
> I did some testing, but it would be good if someone else did this on an
> actual sat6 installation. Some thoughts:

I did it now on fresh install and new content host. Adding one content host and running goferd there consumed 2 TCP connections and 4 FDs (that include FDs for TCP sockets). I.e. diff between "lsof $(pgrep qpidd)" before and after the content host added:

> qpidd   14484 qpidd  DEL       REG               0,10             379165 /[aio]
263a265,267
> qpidd   14484 qpidd  150u     IPv4             379163      0t0       TCP pmoravec-rhel7-sat6.gsslab.brq.redhat.com:amqps->10.34.84.212:34533 (ESTABLISHED)
> qpidd   14484 qpidd  151w      REG              253,0  2101248 272564602 /var/lib/qpidd/.qpidd/qls/jrnl/pulp.agent.a726580c-5f1e-4a79-9f11-de0adc52c1e9/c518ef68-2c60-4095-a50d-dc11f7ca42fa.jrnl
> qpidd   14484 qpidd  152u     IPv4             379166      0t0       TCP pmoravec-rhel7-sat6.gsslab.brq.redhat.com:amqps->10.34.84.212:34534 (ESTABLISHED)

So the recommendation should be (with some overhead):

For larger deployments with hundreds of content hosts, it is required to increase some limits on number of file descriptors and TCP connections for qpid broker. Assuming N is the number of content hosts:

- increase ulimit on open files / file descriptors:
  - on RHEL6, add to /etc/security/limits.conf:
  
qpidd	*	nofile	N*4+500

  - on RHEL7, add to /usr/lib/systemd/system/qpidd.service at the end of "[Service]" section:

LimitNOFILE=N*4+500

- increase the number of allowed TCP connections by adding to /etc/qpid/qpidd.conf:

max-connections=N*2+100

Restart of qpidd service is required to apply the change.
(please re-write the text above such that it is clear one needs to do the calculation and not blindly copy&paste text "max-connections=N*2+100" or so)


Assuming bug QPID-5924 to be resolved in soon version of downstream Satellite, I would not complicate docs by it.

Comment 10 Mike McCune 2015-03-17 03:51:08 UTC
We published this KCS:

https://access.redhat.com/solutions/1375253

we can expand it out but it should suffice for now

Comment 11 RHEL Program Management 2015-04-21 16:08:46 UTC
Since this issue was entered in Red Hat Bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

Comment 12 Pavel Moravec 2015-04-25 14:27:25 UTC
Yet another request (I will update the KCS once having some time):

For deployments with >1900 content hosts, qpidd hits fs.aio-max-nr kernel limit. See bz1136020 for details.

This needs to be documented, as it took me almost whole day to identify the real root cause behind the customer problem (see just linked support case for details - really _no_ evidence of a relation).

Comment 13 Pavel Moravec 2015-04-26 08:40:44 UTC
(In reply to Pavel Moravec from comment #12)
> Yet another request (I will update the KCS once having some time):

KCS 1425893 created for this.

Comment 14 Pavel Moravec 2015-05-04 17:34:33 UTC
(In reply to Pavel Moravec from comment #12)
> Yet another request (I will update the KCS once having some time):
> 
> For deployments with >1900 content hosts, qpidd hits fs.aio-max-nr kernel
> limit. See bz1136020 for details.
> 
> This needs to be documented, as it took me almost whole day to identify the
> real root cause behind the customer problem (see just linked support case
> for details - really _no_ evidence of a relation).

Just for case precise calculation / formula is needed: one durable queue of qpid broker consumes 33 AIO requests, i.e. creating one durable queue, fs.aio-nr is increased by 33.

Comment 15 Pavel Moravec 2015-07-02 13:42:20 UTC
Never-ending story..

Please mention yet *another* tuning parameter. For >32k content hosts / >32k durable queues in qpid broker, issue from [1] can be hit. (I *think* the threshold depends on more factors than just # of durable queues, as the issue is a race condition / matter of concurrency. I.e. the faster machine it is, the higher probability of hitting the issue is. So I suggest either lower threshold or little vague "for deployments since many thousands / few tens of thousands of content hosts, ..")

[1] https://access.redhat.com/solutions/1355703

Comment 16 Pavel Moravec 2015-07-12 15:53:32 UTC
Yet another tunable: for Sat6 / qpid on RHEL7 and >32k content hosts, increase vm.max_map_count. See https://access.redhat.com/solutions/1528243.


Note You need to log in before you can comment on or make changes to this bug.