From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.8) Gecko/20050512 Firefox/1.0.4 Description of problem: We have number of 2 CPU servers with HTT enabled on the processors. They run various RedHat kernels with versions 2.4.21 and above. On all of these servers we encounter a problems with a lot of "stoped processes", e.g. shown with status ``T'' in the output of ps(1) command. If the servers run unattended long time their total number of processes become numerous. This affects servers' performance and in some cases causes errors/problems of type "cannot fork". Version-Release number of selected component (if applicable): kernel-2.4.21 and above How reproducible: Always Steps to Reproduce: 1. Run multiprocessor machine with HTT enabled on the processors 2. Leave it running and executing some tasks some days 3. Execute ps -ax | grep T Actual Results: From day to day the number of "stopped processes" (with status ``T'' in the ps(1) output ) becomes bigger and bigger. Expected Results: Normally, there should be a few number of processes with such a status. Additional info: I supose the bug/problem is caused by the process scheduler, working in a SMP HTT enabled machine.
Thanks for your report, Vlady. Could you please specify the exact (most recent) kernel version that exhibits this problem? Also, can you give us an idea of what sort of processes are stopped? And do they go away if they are killed?
Vlady - Could you also capture, 'ps -ax', sysrq-t and sysrq-m output from a system that is experiencing this problem?
Below is a small part of the ``ps -ax | grep T" command results on a server with 2.4.21-27.0.4.ELsmp kernel and HTT enabled. 26337 ? T 0:00 cut -b 7- 24464 ? T 0:00 sed s/$/<NL>/g 23302 ? T 0:00 /bin/sed s|/|.|g 9958 ? T 0:00 id -u 8551 ? T 0:00 netstat -tln 22400 ? T 0:00 mkdir -p /some/dir 1697 ? T 0:00 netstat -tln 13750 ? T 0:00 /bin/sed s|/dev/|| 20597 ? T 0:00 netstat -tln 25570 ? T 0:00 /bin/sed s|/dev/|| 7115 ? T 0:00 /bin/sh /bin/egrep -q (^|:)/usr/X11R6/bin($|:) 22075 ? T 0:00 id -un 1386 ? T 0:00 /usr/bin/tty 2272 ? T 0:00 sh -c date +%Z 2> /dev/null 21237 ? T 0:00 sh -c sysctl fs.file-max 28206 ? T 0:00 /bin/sh -c /usr/lib/sa/sa2 -A 32587 ? T 0:00 sh -c date +%Z 2> /dev/null 30494 ? T 0:00 sh -c date +%Z 2> /dev/null 4591 ? T 0:00 netstat -tln 9427 ? T 0:00 /bin/sed s|/dev/|| Sorry, but i can't supply you with the results of SysRq*. All our servers which experince "stopped processes" problem are in production and don't want to experiement with their kernels! :(
Also, i don't have console access to our servers, so i can't even excute sysreq + t or sysreq + m keyboard secuences.
Vlady - You can use sysrq-trigger remotely: # enable sysrq-trigger $ echo 1 > /proc/sys/kernel/sysrq # sysrq-t $ echo t > /proc/sysrq-trigger # sysrq-m $ echo m > /proc/sysrq-trigger The sysrq info is really important - I can't make any suggestions unless I can see where these process are blocking in the kernel.
Hi Vlady Were you able to use the sysreq-trigger mechanism I mentioned above? I'll need the sysrq-t & sysrq-m info in order to see what's happening with the stopped processes. For now, I'm going to close this issue. Please re-open it if you are able to collect the info (and are still experiencing the problem).