Bug 158277

Summary:	ps <pid> can sometimes not see <pid>
Product:	Red Hat Enterprise Linux 3	Reporter:	Alan Tyson <alan.tyson>
Component:	kernel	Assignee:	Peter Martuccelli <peterm>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	low	Docs Contact:
Priority:	medium
Version:	3.0	CC:	albert, kzak, petrides, rick.stern
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-05-31 19:52:39 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alan Tyson 2005-05-20 09:00:24 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.6) Gecko/20050321 Firefox/1.0.2

Description of problem:
This problem was first seen on EL 2.1.  The ps command, when a single pid is specified, can sometimes report nothing, wrongly suggesting that the pid has died.  This can happen on systems where there are a large number of short-lived processes. This is a major problem for software that monitors the health of certain appications on the system because it can result in the monitoring software thinking that the application has died.  The chances of this problem happening on EL 3 and 4 are significantly less.

The ps command, regardless of whether it is asked to look at a single process or 
all processes on the system, will look at the details of all processes on the 
system.  What it does is look at all processes and select only the one(s) which 
qualify for the command.

ps uses the /proc/ filesystem to look at the process information using  
interfaces in libproc.  This involves an opendir of /proc and then readdir 
calls to get all of the entries in there.  For each process entry that readdir 
returns, additional information is collected from the "files" in the /proc/<pid>/ directory.  In the case of a ps command for a single pid, most of this information is collected and thrown away.

readdir buffers up information from the /proc directory.  It calls getdents64() 
which returns entries from the directory in blocks.  These blocks are 20 
entries in size (once it's reading pid entries from the filesystem).  The /proc 
root directory consists of the /proc stuff for information other than pids, 
following by the pid information.  When directory reads are done, the pid 
directories are generated on the fly from the current task list in the kernel. 
proc_pid_readdir does this.  This routine will return up to 20 "directory" 
entries in the /proc file system.  Since it is not possible to return all pid 
directories in one call, it is necessary for the proc filesystem layer to keep 
track of the current "file" pointer.  It is this which is causing the problem.

On RHEL 2.1 the first call to getdents64() will return the first 20 pids beginning at the start of the system's task list (let's call these entry numbers 1..20).  The second call will return entry numbers 21..40 and so on.  The problem is that if a process dies, then the relative position in the list of all processes after it moves and so it's possible for the reading of the list of directories to skip an entry.

For example, let's say that pids 1..20 are the first 20 in the task list, pids 
21..40 are the following 20 pids and 41..50 are the last 10.  In the first read 
of the pids getdents64 will return pids 1..20.  If pid 18 dies after this, but 
before the second call to getdents64, then the first 20 pids becomes 
1..17,19,20,21 so the second call to getdents64 will return pids 22..41.  i.e. 
to the caller of getdents64() pid 21 has been skipped.  This scenario can take 
place even when a "ps 21" command is done.

The routine get_pid_list() in fs/proc/base.c is where this is done.

I notice that in RHEL3 and above, the concept of a cursor was added to 
get_pid_list.  In these releases, get_pid_list() stores away a pointer to the 
last pid entry in the private_data field in the file struct.  This allows the reading of the /proc directory to continue on from where it left off without falling into the above problem....  With the exception of the case where the last pid died.  In my above scenario, if pid 20 had died, then the same problem would happen, even with this enhanced code in get_pid_list().

My suggestion would be to enhance ps so that if a single pid is passed as an argument then it just gets information about that pid and does not search the whole /proc/ directory.

As a workaround, I have recommended that the ps command not be used to monitor applications and that the contents of /proc/<pid>/stat be checked instead.

Version-Release number of selected component (if applicable):
procps-2.0.17-13

How reproducible:
Sometimes

Steps to Reproduce:
On EL 2.1, run 10 copies of this script at the same time.  One of the ps commands will fail to find mysqld after a few hours:
  #!/bin/bash
  PROC="mysqld"
  PID_FILE="mysql.pid"
  RETRY_INTERVAL=500000
  counter=0
  error_count=0
  p_pid=`cat $PID_FILE`
  if [ -z "$p_pid" ]; then
    		  echo "$(date '+%b %e %T')" >> loop.sh.log
		  echo "PID file empty!!!" >> loop.sh.log
  fi
  while :; do
	  pid=`ps $p_pid | grep $PROC | awk '{print $1}'`

	  if [ -z "$pid" ]; then
		  echo "$(date '+%b %e %T')" >> loop.sh.log
		  echo "mysqld not found!!!" >> loop.sh.log
              echo "p_pid output" >> loop.sh.log
              ps $p_pid >> loop.sh.log
              echo "full ps output" >> loop.sh.log
              ps -ef >> loop.sh.log
	  fi
	  usleep $RETRY_INTERVAL        
  done

On EL3 and EL4 I was not able to duplicate the problem with the above script. However, I believe that it would just take more time before the problem was seen (and the "right" process died during one of the ps commands).

Additional info:

Comment 1 Karel Zak 2005-05-20 09:40:51 UTC

Yes, I agree that output from "strace -e trace=open ps <pid>" looks strange and
"ps" wastes time with reading unnecessary files. But this problem must be
resolved by upstream. I'm not sure with a fix in EL3 or EL4.

Comment 2 Albert Cahalan 2005-05-21 21:33:08 UTC

This is a serious kernel bug. It's also an "I told you so" bug; you can see my
arguments against the readdir cursor hack on the linux-kernel mailing list.

Somebody, Hugo I believe, had a tree-based /proc lookup that was 100% reliable.
Patch that into your kernel and you might be all set.

There was another problem, probably fixed by now:

Any remaining problems would be caused by glibc and the kernel disagreeing
on the size of a struct being used for directory reads. I do not know if this
has since been fixed. If strace reveals seeks on the directory reads, then
you are likely to have this problem also. Encoding the PID into the directory
offset would take care of this.

Comment 3 Rick Stern 2005-05-26 16:13:31 UTC

Karel - at least for now, is RH going to file this one in kernel.org?

Comment 4 Albert Cahalan 2005-05-27 04:47:42 UTC

(In reply to comment #2)

> Somebody, Hugo I believe, had a tree-based /proc lookup that was 100% reliable.
> Patch that into your kernel and you might be all set.

That is probably Hugh, not Hugo. I mean the guy who was posting small patch sets
a while back, with maybe a dozen patches or so.

Comment 5 Karel Zak 2005-05-27 07:31:40 UTC

Albert, I still don't understand why "ps <pid>" doesn't read /proc/<pid>/
directly. Why does it read files in others directories? It reads
/proc/#/stat|status|cmdline for all process. Why?

Comment 6 Albert Cahalan 2005-05-27 16:31:29 UTC

That's simply a lack of optimization. Before you go trying to hack this,
please remember:

1. There is a -N option.
2. One may use "ps -p 42 -a -U root -p 1" and similar.
3. Consider threads.
4. There are likely to be other "interesting" issues.

Asking for just one PID isn't obscure, but I'm not sure it's all that
common either. So I don't know if this optimization is worth the code
complexity. It might be worthwhile.

NOTE NOTE NOTE!!!

Adding such an optimization does NOT fix the /proc problem. Regular ps
listings will still miss processes from time to time. I find this to be
seriously unacceptable. I was rather shocked when Linus accepted the
band-aid hack (saved /proc read cursor) instead of the reliable version.
Perhaps with a fresh patch and this new bug report you can get a real fix
into the kernel. Until then, all tools that read /proc to find processes
will be unreliable.

Comment 7 Ernie Petrides 2005-05-31 19:52:39 UTC

This issue needs to be resolved upstream first.

Comment 8 Albert Cahalan 2005-06-01 17:21:57 UTC

I find it somewhat odd that this was marked CLOSED/WONTFIX, because:

1. it makes many monitoring tools unreliable (not just ps)
2. a kernel patch is available
3. with real evidence of problems, pushing the patch upstream should be doable

While time has passed, backing out the /proc cursor hack should be doable.

Comment 9 Rick Stern 2006-05-15 20:15:06 UTC

Albert - do you have teh kernel buzilla number that ws fixed?

Comment 10 Albert Cahalan 2006-05-16 02:07:06 UTC

As far as I know, very few kernel bug reports go via the bugzilla.

The current status regarding this bug:

1. glibc behavior is unknown (must NOT use lseek on /proc directory)
2. the 2.6 kernel has several band-aid hacks related to /proc readdir

So you WILL be seeing this bug, but less often than in the past.

Hugh's tree-based /proc lookup patch certainly no longer applies
to the current kernel source. There have been locking changes
since then, and a couple band-aid hacks were added.