Bug 1128200 - ps -p uses huge amount of syscalls and system time
Summary: ps -p uses huge amount of syscalls and system time
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: procps
Version: 5.7
Hardware: All
OS: Linux
unspecified
low
Target Milestone: rc
: ---
Assignee: Jaromír Cápík
QA Contact: BaseOS QE - Apps
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-08-08 14:44 UTC by Arthur Nascimento
Modified: 2018-12-06 17:41 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-10-08 18:00:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dozens of ps processes taking up a lot of processor time (116.62 KB, image/png)
2014-08-08 14:44 UTC, Arthur Nascimento
no flags Details
sketch of a patch to fix this performance issue (2.95 KB, patch)
2014-08-08 14:47 UTC, Arthur Nascimento
no flags Details | Diff

Description Arthur Nascimento 2014-08-08 14:44:20 UTC
Created attachment 925191 [details]
dozens of ps processes taking up a lot of processor time

Description of problem:
In a client's production machine running RHEL 5.7 (and others running 5.8 and 5.9), a hundreds-of-cores server with thousands of scripts running and several calling ps, we see that ps is one of the biggest offenders that is causing the reported high system time and decreased responsiveness of the machine (attachment cliosv2.png).

Upon inspection, we see that it scans all processes in /proc in order to find the only one that is required by -p. Through strace, we see that it makes at least (when no output options are provided) seven syscalls for each process alive, which is a quadratic degradation in the number of ps processes (since each sees each other). strace shows in the order of 50000 syscalls on that machine for each ps execution (strace ps -p 1, for instance). This way, ps takes several seconds to finish every time it is called and wastes resources in our production environments.

This operation should always be O(1) for -p.

Version-Release number of selected component (if applicable):
I am certain that this occurs in RHEL 5.7 - 5.11
And on all procps of RHEL 5 up until the latest, which is procps-3.2.7-26.el5
If I am not mistaken, this is also an issue for RHEL 6 and RHEL 7.

How reproducible:
All of the time. Looking at the code, it is clear that ps has always behaved this way.


Steps to Reproduce:
Run ps


Actual results:
$ strace ps -p 1 2>&1 | wc -l
51168
$ time ps -p 1
  PID TTY          TIME CMD
    1 ?        18:00:50 init

real    0m4.240s
user    0m0.098s
sys     0m0.424s


Expected results:
$ strace ps -p 1 2>&1 | wc -l
111
$ time ps -p 1
  PID TTY          TIME CMD
    1 ?        18:00:50 init

real    0m0.006s
user    0m0.000s
sys     0m0.006s


Additional info:

All measures of syscalls and time are on mid system load on one of the machines of our cluster. We've experienced peaks of up to 12 seconds of execution time on ps.

The expected results section shows the output of a ps on which I applied a patch to fix this performance issue.

We know that on latest HEAD of development of procps there is a commit e751606f. That commit introduces a -q command line option to exercises the O(1) behaviour. But that is only for simple outputs and only if -q is used. Therefore, it is *not* what we need, since we cannot change all of existing code to support yet another non-portable command line option in order to circumvent a performance bug. We believe that the O(1) behaviour should have always been in -p, at the very least in the special case that no other selection options are provided, and also for all possible output options.

If possible, please provide an updated version of procps for RHEL 5 that fixes tnis.

Comment 1 Arthur Nascimento 2014-08-08 14:47:56 UTC
Created attachment 925192 [details]
sketch of a patch to fix this performance issue

This is a sketch of a patch against procps-3.2.7-26 that adds the expected O(1) to -p

Comment 2 Jaromír Cápík 2014-08-08 16:25:26 UTC
Hello Arthur.

The reason for introducing the optimized method under the -q switch has a good reason. The -p option is supposed to work in combination with other selection options which always require searching for the relevant information in all the running processes. The logical operator between the selection options is 'OR' and therefore such optimization would cause regressions. Even the qsort applied to the list might be considered unwanted and cause delay/load spike if you pass a longer list. That's why the fast method was introduced separately under the -q option and we don't sort the list of PIDs with -q. Other selection options are disabled in this case as they cannot work correctly with the optimized method.
Moreover, changing the -p behaviour could cause other unnoticed regressions and as the ps tool is used by zounds of users in their scripts, we're strongly discouraged to touch the -p option just for this reason. The very first implementation of the optimization was reusing the -p switch and I was not satisfied with the result and put it back.
So, it is not a performance bug, it is a long time present feature that comes from the design of selection options.
Unless you find really strong arguments to convince me, I could only offer you a backport of the -q switch.
But regardless of my comments, I don't think PM will approve this bug in such late phase of the RHEL5 maintenance.

Please, let me know.

Regards,
Jaromir.

Comment 3 Arthur Nascimento 2014-08-08 17:02:14 UTC
Hi Jaromir,

I understand what you say and if you'll look into the patch and the outputs you'll see that the behaviour is maintaned (at least for all cases I tried). The qsort+uniquefy hack was to avoid the regression in which -p 1,1 would return twice the resultas as -p 1, but that is not an issue anymore.

For example:
$ ps -p 1 -u $LOGNAME -l
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0     1     0  0  75   0 -  2592 ?      ?        18:02:37 init
5 S   150 14569 14567  0  75   0 - 22559 ?      ?        00:00:00 sshd
0 S   150 14570 14569  0  75   0 - 16519 wait   pts/35   00:00:00 bash
0 S   150 16974     1  0  77   0 - 15962 wait   ?        00:00:00 sh
1 S   150 17051 16974  0  77   0 - 15963 wait   ?        00:19:37 sh
0 S   150 17057 16974  0  78   0 - 14727 pipe_w ?        00:01:31 tee
5 S   150 25919 25910  0  76   0 - 22559 ?      ?        00:00:00 sshd
0 S   150 25920 25919  0  76   0 - 16519 -      pts/43   00:00:00 bash
0 S   150 28519     1  0  78   0 - 15962 wait   ?        00:00:00 sh
0 S   150 28520 28519  1  75   0 - 14734 pause  ?        02:30:08 mpstat
0 S   150 28521 28519  0  78   0 - 14727 pipe_w ?        00:01:08 tee
0 S   150 29941 17051  0  77   0 - 14727 -      ?        00:00:00 sleep
0 R   150 29983 14570  0  78   0 - 16402 -      pts/35   00:00:00 ps

With patch:
$ ./ps -p 1 -u $LOGNAME -l
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0     1     0  0  75   0 -  2592 ?      ?        18:02:36 init
5 S   150 14569 14567  0  75   0 - 22559 ?      ?        00:00:00 sshd
0 S   150 14570 14569  0  75   0 - 16519 wait   pts/35   00:00:00 bash
0 S   150 16974     1  0  77   0 - 15962 wait   ?        00:00:00 sh
1 S   150 17051 16974  0  78   0 - 15963 wait   ?        00:19:37 sh
0 S   150 17057 16974  0  78   0 - 14727 pipe_w ?        00:01:31 tee
5 S   150 25919 25910  0  76   0 - 22559 ?      ?        00:00:00 sshd
0 S   150 25920 25919  0  76   0 - 16519 -      pts/43   00:00:00 bash
0 S   150 28519     1  0  78   0 - 15962 wait   ?        00:00:00 sh
0 S   150 28520 28519  1  75   0 - 14734 pause  ?        02:30:07 mpstat
0 S   150 28521 28519  0  78   0 - 14727 pipe_w ?        00:01:08 tee
0 R   150 29816 14570  0  77   0 - 15886 -      pts/35   00:00:00 ps
0 S   150 29826 17051  0  79   0 - 14727 -      ?        00:00:00 sleep

As you can see, even with more selection arguments and output arguments, it still OR's the selection filters correctly. That is because, as you can see in the patch, I only added a path for the specific case when -p is the only selection filter. When there are other selection parameters, then ps clearly still needs to go through all processes in /proc, hitting the current path.

Is there a regression test suite for ps? I couldn't find one, but if you can, please pass the patch through it for me and let me know of any regressions - there shoudn't be any.

In any case, the patch is only a proof of concept. I know the "if (selection_method) {...}" is ugly and not ready for merge, or else I'd have sent a merge request to the gitorious repo. I'm only trying to show that an optimization as I have described is possible without a new command line option.

Thanks,
Arthur

Comment 4 Jaromír Cápík 2014-08-08 18:53:26 UTC
Hello Arthur.

(In reply to Arthur Nascimento from comment #3)
> As you can see, even with more selection arguments and output arguments, it
> still OR's the selection filters correctly. That is because, as you can see
> in the patch, I only added a path for the specific case when -p is the only
> selection filter. When there are other selection parameters, then ps clearly
> still needs to go through all processes in /proc, hitting the current path.
> 
> Is there a regression test suite for ps? I couldn't find one, but if you
> can, please pass the patch through it for me and let me know of any
> regressions - there shoudn't be any.
> 
> In any case, the patch is only a proof of concept. I know the "if
> (selection_method) {...}" is ugly and not ready for merge, or else I'd have
> sent a merge request to the gitorious repo. I'm only trying to show that an
> optimization as I have described is possible without a new command line
> option.

I know it is possible, I did exactly the same in my first patch:

+#define MAX_SEL_PID 20
+
 /***** just display */
 static void simple_spew(void){
   proc_t buf;
   PROCTAB* ptp;
-  ptp = openproc(needs_for_format | needs_for_sort | needs_for_select | needs_for_threads);
+
+  pid_t pidlist[MAX_SEL_PID+1];
+  int pidcnt;
+  int flags = needs_for_format | needs_for_sort | needs_for_select | needs_for_threads;
+
+  // optimized -p processing only if there's up to MAX_SEL_PID pids in the list and -p is the only selection switch
+  bzero(pidlist, sizeof(pidlist));     // zero-terminate list
+  if (selection_list && selection_list->typecode==SEL_PID && !selection_list->next && selection_list->n<=MAX_SEL_PID) {
+    flags |= PROC_PID;
+    pidcnt = selection_list->n;
+    while (pidcnt--) {
+      pidlist[pidcnt] = selection_list->u[pidcnt].pid;
+    }
+  }
+
+  ptp = openproc(flags, pidlist);
   if(!ptp) {
     fprintf(stderr, "Error: can not access /proc.\n");
     exit(1);


The above patch doesn't sort the sequence and completely ignores the fact, that things like 'deselect' exist. The problem with this approach is that it is hackish and therefore risky + enforces us to do the sort to avoid regressions. But, the default walkthrough gives us a sequence that is sorted directly by the kernel, not by us and nobody says it always have to be sorted that way. Therefore I don't consider this solution nice and safe.

Regarding the regressions. We do have regression tests only for reported bugs. The problem is, that regressions often appear in stuff that has never been reported before.

Regards,
Jaromir.

Comment 5 Arthur Nascimento 2014-08-08 20:05:32 UTC
Hi Jaromir,

As I said before, the sort is necessary to remove duplicates from the array passed to openproc. Otherwise something like ps -p 1,1 would give wrong results. I did that as a quick hack just to get the results and touch as little as I could of the code. If it were up to me, I would have rewritten a lot of code starting from proc/readproc.c to pass perhaps a fd_set (given that pid_t is usually typedef'd to int anyway) as second argument instead of a pid_t array, as well as the selection_node and everything that touches both (so parser.c would use FD_SET and selection.c would use FS_ISSET). Missing that, I'm sure we both agree that all arrays in selection_node should, at the very least, be sorted on the end of args parsing and bsearch'ed when searching - that sequential search in return_if_matched is more hazardous for performance than the ugly loops in the PoC patch I made.

But I digress. My point is not that that patch should be applied as is. My point is that it shows that ps can have -p with O(1) (for the special and very very common case where -p is the only selection option) with little effort, to which I believe we both agree so far. Moreover, given that it is possible and easy, I believe that it should be done on for RHEL 5 since it is weighting a lot on my production use case and also on everyone else's - even those who haven't noticed yet.

Rewrite it as you see fit, but the -q option is not useful for me because I cannot change the users' scripts. If I could, I would have changed some of the ps calls to [ -d /proc/xxx ] already or other variations on each case.

The support case that triggered this analysis is 01149670, if you haven't seen it yet. It is one of the many fronts we are trying to improve on that situation.

Thanks again,
Arthur

Comment 6 Arthur Nascimento 2014-08-08 23:41:55 UTC
Jaromir,

How about we meet half way? You backport the -q switch and also make ps smart enough to detect when a -p can be upgraded automatically to the quick path? The test is simple enough - it is basically what the set_selection_method() on my patch does (although your implementation has more restrictions than I believe are necessary). This way a "ps -p 1" will behave as "ps -q 1" and we'll both be happy.

Cheers,
Arthur

Comment 7 Jaromír Cápík 2014-08-12 20:42:32 UTC
(In reply to Arthur Nascimento from comment #5)
> Hi Jaromir,

Hi Arthur.


> As I said before, the sort is necessary to remove duplicates from the array
> passed to openproc.

Nope, the sort is also necessary to avoid regressions in the PID order,
that is expected to follow the PID order in /proc, you can get
when you do an unsorted listing with 'ls -f /proc'.
But the order of PIDs does not always have to be ascendant
as in fact the /proc content doesn't even have to be generated
by the kernel. Several years ago I saw a monitoring solution
invented by a smaller mobile operator in order to monitor clusters.
It was based on symlinking PIDs from indexed directories to custom
numbers in /proc. I don't remember much / don't know whether the order
was ascendant. Moreover plain ls sorts the directories alphabetically.
There's a risk that we could suddenly return different results,
and change a custom order to an ascendant one. That means you cannot
safely change the behavior without risking regressions. We would have
to invent a fast method for sorting the PIDs according to the /proc
content in order to keep the black box behaviour unchanged. We could
achieve that by sorting the pidlist to be ascendant and then perform
a bsearch of PIDs taken from /proc (with unchanged order) against
the sorted pidlist. That would produce a new pidlist sorted according
to the /proc order and that should still be much faster than reading
all the content and then deciding whether we like the PID or not.
What do you think? 
Other option would be to risk that with the ascendant order and wait
till someone starts screaming, but that option is not my favorite as
I'm responsible for the component sanity. Maybe we could solve that
by introducing an environmantal variable for switching the -p behaviour
to the optimized one. That could solve your problem too, right?
You would define the variable as system wide and the scripts
calling ps with -p could stay untouched.


> If it were up to me, I would have rewritten a
> lot of code starting from proc/readproc.c to pass perhaps a fd_set (given
> that pid_t is usually typedef'd to int anyway)

Ugh, how do you mean that? We're trying to get rid of mixing data types.
It's generally a bad idea to mix types due to the portability reasons.


> as second argument instead of
> a pid_t array, as well as the selection_node and everything that touches
> both (so parser.c would use FD_SET and selection.c would use FS_ISSET).
> Missing that, I'm sure we both agree that all arrays in selection_node
> should, at the very least, be sorted on the end of args parsing and
> bsearch'ed when searching - that sequential search in return_if_matched is
> more hazardous for performance than the ugly loops in the PoC patch I made.

The sort would have to be done only for -p, not -q and in that case
we could use it in the sorting mechanism I mentioned when writing about
preserving the /proc order. It would pre-generate the pidlist respecting
the /proc order and in case of mixing with other selection options
it would at least make the return_if_matched search faster.

 
> But I digress. My point is not that that patch should be applied as is. My
> point is that it shows that ps can have -p with O(1) (for the special and
> very very common case where -p is the only selection option) with little
> effort, to which I believe we both agree so far. Moreover, given that it is
> possible and easy, I believe that it should be done on for RHEL 5 since it
> is weighting a lot on my production use case and also on everyone else's -
> even those who haven't noticed yet.

The decision is not up to me.


> Rewrite it as you see fit, but the -q option is not useful for me because I
> cannot change the users' scripts.

I understand. The only option would be to write a wrapper for ps to convert
the -p switch to the -q switch. But it is also a dirty hack.


> If I could, I would have changed some of
> the ps calls to [ -d /proc/xxx ] already or other variations on each case.
> 
> The support case that triggered this analysis is 01149670, if you haven't
> seen it yet. It is one of the many fronts we are trying to improve on that
> situation.

Seen that a minute ago.

Please, let me know whether you like the /proc order respecting solution
I offered. The initial sort would be probably a bit longer than ascendant
sorting with qsort, but still much faster than the current -p solution.
That way we could introduce it upstream without worrying much.

Thanks and regards,
Jaromir.

Comment 8 Arthur Nascimento 2014-09-22 20:27:47 UTC
Hi Jaromir,

Sorry about the delay. Work has been heavy in the past few weeks.  I'll try to keep the conversation going on a steady pace from now on.

So if I understood correctly your proposal, it would be to add an environment variable that ps would take as a hint that the order of the results is irrelevant and therefore that it could use the quick way of generating possibly-out-of-order results. That way ps could take the pid list, store it sorted and traverse the /proc pids b-searching the list. I suppose that could help, although I still think it is unnecessary to go through all the /proc traversing. It solves half the syscall load, but the order of complexity is still on a very different level that what could be reached.

So allow me to return to a previous point in our conversation just to make sure I got things right. You said that ps output is expected to be sorted just like unsorted-ls (which GNU's coreutils uses readdir for that purpose, like procps does). But I coudln't find this rule of order written down anywhere in the POSIX spec (though I honestly didn't search for long and just read the ps page and a couple of others). Would you mind pointing it out for me? I just want to get my facts straight.

Because if the order is really important when ps is called without sort options in the command line, then I think we may move forward with your proposal (though see below my next paragraph). However, if it isn't, then the order of the results could be some other (and applications should have never depended on it to begin with).

Moreover, I believe that even if the order needs to be maintained then, in the specific case of traversing pids of /proc, the pids will always be sent sorted by the kernel, since for each opendir on /proc, the kernel generates all entries on the fly. The entries are sorted not by inode link order, since this is not an actual disk, but by the order of each subsystem queried by the kernel to answer the opendir/readdir. On the moment the kernel starts with the pids, they will come in pid order because that is how they are sorted and traversed in the rb-tree in-kernel. That is why I believe that we can dispense with the steps involving /proc traversal to get the same order of 'ls -f' - the numeric sort is already what the kernel provides for the specific case of pid directories.

If my suspicions are correct, then we are reaching a sweet spot that we should really take advantage of (and actually all of procps could). To test it I ran a diff a few thousand times on two heavy-load machines I have and I couldn't find a case when the kernel returns any pid of /proc in a different order than numeric.

diff <(ls -f /proc | grep '[[:digit:]]\+') <(ls /proc | grep '[[:digit:]]\+' | sort -n)

I'll try to consult with the kernel guys later this week and see if that is a behavior we could rely on.

Cheers,
Arthur

Comment 9 Chris Williams 2014-10-08 18:00:53 UTC
RHEL 5 is now in Production Phase 3.
https://access.redhat.com/support/policy/updates/errata/#Production_3_Phase
No further minor release are planned for this product.
This BZ is being closed as it does not meet phase 3 inclusion criteria.


Note You need to log in before you can comment on or make changes to this bug.