Hide Forgot
Description of problem: Apache on 6.1 capsule have about 5k open files, on 6.2 it have 38k. It seems suspicious. Version-Release number of selected component (if applicable): Capsule61: capsule-installer-2.3.25-1.el7sat.noarch Capsule62: satellite-capsule-6.2.0-9.0.beta.el7sat.noarch How reproducible: always Steps to Reproduce: 1. Restart services with `katello-services restart` 2. # lsof | wc -l Actual results: Capsule61: 19624 Capsule62: 51776 Expected results: We should be sure this is expected
[root@capsule61 ~]# lsof | cut -d ' ' -f 1 | sort | uniq -c | sort -n | tail 201 ruby 207 gmain 344 tuned 348 qdrouterd 447 Passenger 888 qpidd 1628 mongod 2970 python 4119 httpd 6636 celery
[root@capsule62 ~]# lsof | cut -d ' ' -f 1 | sort | uniq -c | sort -n | tail 257 ruby-time 344 tuned 348 qdrouterd 453 pulp_stre 517 ruby 571 Passenger 920 qpidd 3549 mongod 6171 celery 36966 httpd
Offending processes: [root@satellite ~]# lsof | grep httpd | awk '{print $2}' | sort | uniq --count 121 11642 121 15371 121 15700 121 15752 124 6627 5412 6648 5346 6649 5456 6650 2790 6651 2772 6652 2772 6653 4300 6654 4320 6655 4300 6656 8 6657 26 6660 6 6668 121 6675 121 6676 121 6677 121 6678 121 6679 121 6680 121 6681 121 6682 121 7904 121 7933 121 7966 apache 6648 0.0 0.4 1089156 74440 ? Sl Jun02 0:42 (wsgi:pulp) -DFOREGROUND apache 6649 0.0 0.4 1089160 71748 ? Sl Jun02 0:41 (wsgi:pulp) -DFOREGROUND apache 6650 0.0 0.4 1089152 69776 ? Sl Jun02 0:41 (wsgi:pulp) -DFOREGROUND apache 6651 0.0 0.2 684948 34280 ? Sl Jun02 0:02 (wsgi:pulp-cont -DFOREGROUND apache 6652 0.0 0.1 684948 31604 ? Sl Jun02 0:02 (wsgi:pulp-cont -DFOREGROUND apache 6653 0.0 0.1 816020 32160 ? Sl Jun02 0:02 (wsgi:pulp-cont -DFOREGROUND apache 6654 0.0 0.3 797128 59072 ? Sl Jun02 0:14 (wsgi:pulp_forg -DFOREGROUND apache 6655 0.0 0.3 862664 59072 ? Sl Jun02 0:14 (wsgi:pulp_forg -DFOREGROUND apache 6656 0.0 0.3 797128 59072 ? Sl Jun02 0:14 (wsgi:pulp_forg -DFOREGROUND whereas in 6.2 there was only 1 wsgi process. Same version of mod_wsgi installed. Will peak into the httpd configs
Likely the culprit: 6.2: [root@satellite httpd]# grep WSGIProcessGroup /etc/httpd/conf.d/* /etc/httpd/conf.d/pulp.conf:WSGIProcessGroup pulp /etc/httpd/conf.d/pulp.conf: WSGIProcessGroup pulp /etc/httpd/conf.d/pulp_content.conf:WSGIProcessGroup pulp-content /etc/httpd/conf.d/pulp_content.conf: WSGIProcessGroup pulp-content /etc/httpd/conf.d/pulp_puppet.conf:WSGIProcessGroup pulp_forge 6.1: [root@sat-perf-02 6.2_conf]# grep WSGIProcessGroup /etc/httpd/conf.d/* /etc/httpd/conf.d/pulp.conf:WSGIProcessGroup pulp /etc/httpd/conf.d/pulp.conf: WSGIProcessGroup pulp
After chatting with mhrivnak, these changes have actually been in pulp for a while, but katello/satellite hadn't pulled them in. Pulp is now configured to use multiple wsgi processes for each app. There are 3 main apps: pulp (the api) pulp_content (app that handles content fetching, facilitates lazy sync) pulp_forge (serves puppet content via the forge api) In 6.1 all of these were served by a single process, but now they each have 3 processes (which is why there are 9 total). This allows for concurrent requests to pulp. We could decrease the number of pulp_forge wsgi processes to just 1 as satellite really isn't using this feature very much, but I (and pthe ulp team) would recommend keeping the others as is. Bumping that down to 1 is likely not required for 6.2 and could be pushed to another release. It would likely free up ~12K files.
Created redmine issue http://projects.theforeman.org/issues/15841 from this bug
Incoming fix will keep 7 WSGI processes rather than the current 9. We will lower the number of pulp_puppet processes from 3 to 1 as we rarely use that functionality. The main pulp and pulp_content wsgi processes are much more important and likely will lead to some performance improvements over 6.1
Failed. Version Tested: Satellite-6.3 Snap X ( where 'X' is the snap number e.g. Snap1) While logging to Satellite via non-admin UI, an UI error is seen as shown in screenshot. "
Sorry for the previous message - I was testing on a full blown Satellite, not just a capsule
FYI currently seeing the following in Capsule63 [root@cloud-qe-04 ~]# lsof|wc -l 58629
[root@cloud-qe-04 ~]# lsof | cut -d ' ' -f 1 | sort | uniq -c | sort -n | tail 403 ruby-time 455 named 462 pulp_stre 668 Passenger 702 qpidd 1233 ruby 2160 libvirtd 4700 mongod 6703 celery 37671 httpd Failed in SNAP 20
Result on 6.3 snap 22: [~]# lsof|wc -l 47129 [~]# lsof | cut -d ' ' -f 1 | sort | uniq -c | sort -n | tail 262 ruby-time 330 qdrouterd 356 tuned 462 pulp_stre 568 Passenger 600 ruby 1200 qpidd 4050 mongod 8072 celery 28476 httpd Since it's ~ 11k less then comment #22 but only ~3k less then number provided on descriptions, I'm not sure this is enough or not. Could you guys provide some guidance about the threshold we are looking for on this BZ?
Renzo, What you are seeing is expected. In 6.1, only one wsgi process was used to handle requests per python app for pulp. In 6.2, we increased that to 3 for each app, at least tripling the number of open files held by apache. It was realized that for one of the apps, we really only needed 1 process, so it was only reduced by about 20% in theory. Looks like you're seeing a bit more than that. But regardless, we are not expected to reduce it back down to 6.1 levels.
Justin, thanks for the quick answer. Thus because of comment #26 I moving this BZ to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. > > > > For information on the advisory, and where to find the updated files, follow the link below. > > > > If the solution does not work for you, open a new bug report. > > > > https://access.redhat.com/errata/RHSA-2018:0336