Description of problem: 32-bit application using readdir to read the contents of /proc in order to report process level activity fails due to readdir returning a Invalid argument error when it encounters a directory(process) greater than 32768 Version-Release number of selected component (if applicable): $ ls -l /lib/libc.so.6 lrwxrwxrwx 1 root root 11 Jul 14 2010 /lib/libc.so.6 -> libc-2.5.so How reproducible: Increase the kernel.pid_max setting to 65536 and using a 32-bit application to read the contents of /proc when one or more processes pid exceeds 32768 Steps to Reproduce: 1. Change kernel.pid_max setting from 32768 to 65536 by running su root -c "sysctl -w kernel.pid_max=65536" 2. Create enough processes such that the pids start to exceed 32768 3. Using 32-bit application, attempt to read contents of /proc Actual results: Shortened results Found 27712 Found 28539 Found 28540 Found 28542 Found 31147 Found 31170 Found 32162 Error - could not read all contents of /proc: Invalid argument Expected results: Shortened Found 27712 Found 28539 Found 28540 Found 28542 Found 31147 Found 31170 Found 32162 Found 36268 Found 58715 Found 58716 Additional info: I suspect that this is a kernel bug since I have found it works on on distribution using the same version of glibc. Also, this does not appear to be an issue on RHEL 6. It also does not fail when trying to read a pseudo /proc directory where I created a similar directory structure under /tmp/proc to simulate the top level directory contents of /proc.
John, it would significantly help if you indicated what 32-bit application you are using to read the contents of /proc. I've tried a few on a Red Hat Enterprise Linux 5.5 VM and have not managed to trigger a failure yet.
Created attachment 574466 [details] gzipped tar file containing 32-bit test program and source code for it Trying to resubmit test code/executable to use that shows the issues once prerequisite conditions as I described int he how to reproduce have been met.
I use my own 32-bit binary to read /proc. I had attached a zip file with source and the binary I built to make this easy. The problem will only show if you change the kernel.pid_max setting to say 65536 (sysctl -w kernel.pid_max=65536) and create enough processes until processes with a pid greater than 32768 start to appear. Once this happens, the problem shows itself. Note that you must be on a x64 system for all of this. From strace, the problem is related to getdents: getdents(3, /* d_reclen == 0, problem here *//* 1 entries */, 32768) = 6484 getdents(3, /* 0 entries */, 32768) = 0
$ uname -r 2.6.18-194.el5 $ uname -m x86_64 $ cat /etc/redhat\-release Red Hat Enterprise Linux Server release 5.5 (Tikanga) $ sysctl kernel.pid_max kernel.pid_max = 65536 $ ps -eaf | tail -3 qa_inst 65279 8149 0 17:53 pts/36 00:00:00 ps -eaf qa_inst 65280 8149 0 17:53 pts/36 00:00:00 tail -3 Here is the tail output the sample program results: Found 31170 Found 32162 Error - could not read all contents of /proc: Invalid argument
Thanks John. I've been able to reproduce the problem. As you hinted at in the initial report, right now this appears to be a kernel problem. I'm still doing some analysis, but the signs are pointing that direction.
You are welcome. The problem appears to be isolated to /proc. I had initially tried to simulate this until I found a system that I could reproduce it by trying make a copy of /proc under /tmp (/tmp/proc/...) and created directories to make it look like there where processes > 32768 and I could not reproduce it by doing that. I have also been trying to see if this is a generic issue or to specific architectures. So far, I have been I have only been able to find a similar system running on s390x, and the problem does not appear to be there. I am trying to find a similar system on IA64 and PowerPC, but I have yet to do so. So as of now, I have only seen this on x64.
This was fixed in Red Hat Enterprise Linux 5.6 which was released with kernel 2.6.18-238.el5. Simple bisection shows that 2.6.18-219.el5 fails while 2.6.18-221.el5 works. Looking at the ChangeLogs, this change stands out as potentially fixing the problem, perhaps as a side effect of the RFE. - [fs] proc: add file position and flags info in /proc (Jerome Marchand) [498081] Regardless of precisely which change in the 220/221 kernel fixed the bug, the errata for the kernel update is here: http://rhn.redhat.com/errata/RHSA-2011-0017.html