Bug 159799

Summary: A lot of stopped processes on a SMP system with HTT enabled
Product: Red Hat Enterprise Linux 3 Reporter: Vlady <vlady>
Component: kernelAssignee: Don Howard <dhoward>
Status: CLOSED CANTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: petrides
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-11-07 21:31:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vlady 2005-06-08 08:26:45 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.8) Gecko/20050512 Firefox/1.0.4

Description of problem:
We have number of 2 CPU servers with HTT enabled on the processors.

They run various RedHat kernels with versions 2.4.21 and above.
On all of these servers we encounter a problems with a lot of "stoped processes", e.g. shown with status ``T'' in the output of ps(1) command.

If the servers run unattended long time their total number of processes become numerous. This affects servers' performance and in some cases causes errors/problems of type "cannot fork".

Version-Release number of selected component (if applicable):
kernel-2.4.21 and above

How reproducible:
Always

Steps to Reproduce:
1. Run multiprocessor machine with HTT enabled on the processors
2. Leave it running and executing some tasks some days
3. Execute ps -ax | grep T
  

Actual Results:  From day to day the number of "stopped processes" (with status ``T'' in the ps(1) output ) becomes bigger and bigger.

Expected Results:  Normally, there should be a few number of processes with such a status.

Additional info:

I supose the bug/problem is caused by the process scheduler, working in a SMP HTT enabled machine.

Comment 1 Ernie Petrides 2005-06-09 00:18:19 UTC
Thanks for your report, Vlady.  Could you please specify the exact
(most recent) kernel version that exhibits this problem?  Also, can
you give us an idea of what sort of processes are stopped?  And do
they go away if they are killed?

Comment 2 Don Howard 2005-06-09 00:34:05 UTC
Vlady -

Could you also capture, 'ps -ax', sysrq-t and sysrq-m output from a system that
is experiencing this problem?

Comment 3 Vlady 2005-06-09 08:08:46 UTC
Below is a small part of the ``ps -ax | grep T" command results on a server with
2.4.21-27.0.4.ELsmp kernel and HTT enabled.

26337 ?        T      0:00 cut -b 7-
24464 ?        T      0:00 sed s/$/<NL>/g
23302 ?        T      0:00 /bin/sed s|/|.|g
  9958 ?        T      0:00 id -u
  8551 ?        T      0:00 netstat -tln
22400 ?        T      0:00 mkdir -p /some/dir
  1697 ?        T      0:00 netstat -tln
13750 ?        T      0:00 /bin/sed s|/dev/||
20597 ?        T      0:00 netstat -tln
25570 ?        T      0:00 /bin/sed s|/dev/||
  7115 ?        T      0:00 /bin/sh /bin/egrep -q (^|:)/usr/X11R6/bin($|:)
22075 ?        T      0:00 id -un
  1386 ?        T      0:00 /usr/bin/tty
  2272 ?        T      0:00 sh -c date +%Z 2> /dev/null
21237 ?        T      0:00 sh -c sysctl fs.file-max
28206 ?        T      0:00 /bin/sh -c /usr/lib/sa/sa2 -A
32587 ?        T      0:00 sh -c date +%Z 2> /dev/null
30494 ?        T      0:00 sh -c date +%Z 2> /dev/null
 4591 ?         T      0:00 netstat -tln
 9427 ?         T      0:00 /bin/sed s|/dev/||


Sorry, but i can't supply you with the results of SysRq*. All our servers which
experince "stopped processes" problem are in production and don't want to
experiement with their kernels! :(

Comment 4 Vlady 2005-06-09 08:24:55 UTC
Also, i don't have console access to our servers, so i can't even excute sysreq
+ t or sysreq + m keyboard secuences.

Comment 5 Don Howard 2005-06-09 17:40:27 UTC
Vlady -

You can use sysrq-trigger remotely:

# enable sysrq-trigger
$ echo 1 > /proc/sys/kernel/sysrq 

# sysrq-t
$ echo t > /proc/sysrq-trigger

# sysrq-m
$ echo m > /proc/sysrq-trigger


The sysrq info is really important - I can't make any suggestions unless I can
see where these process are blocking in the kernel.

Comment 6 Don Howard 2005-11-07 21:31:33 UTC
Hi Vlady

Were you able to use the sysreq-trigger mechanism I mentioned above? I'll need
the sysrq-t & sysrq-m info in order to see what's happening with the stopped
processes.

For now, I'm going to close this issue.  Please re-open it if you are able to
collect the info (and are still experiencing the problem).