Bug 112028

Summary: x440 crashes under heavy load in X
Product: Red Hat Enterprise Linux 2.1 Reporter: john stultz <johnstul>
Component: XFree86Assignee: X/OpenGL Maintenance List <xgl-maint>
Status: CLOSED WONTFIX QA Contact: David Lawrence <dkl>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: alan, jdennis, kmannth, lwoodman, mharris, tao, tburke, wendyh
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-05-12 05:31:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 143573    
Attachments:
Description Flags
XF86Config-4 file
none
XFree86.0.log file
none
/var/log/messages none

Description john stultz 2003-12-13 02:01:22 UTC
From Bugzilla Helper: 
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux) 
 
Description of problem: 
RHEL2.1AS-U3-re1209.0 w/ the e.34 summit kernel 
 
When running the "pounder" test suite (formally known as tools10) to 
create a heavy load under X, moving windows around will cause the 
machine to crash. It appears X dies as the console image hangs for a 
second and then the monitor clicks resolutions and goes black. The 
system is not pingable, nor does the keyboard respond.  
 
So far the crash is not seen when the test is left to run on its 
own. It only occures while I'm moving windows around. So far I have 
been unable to reproduce the problem by moving/resizing windows 
around without "pounder" running.  
 
Version-Release number of selected component (if applicable): 
XFree86-4.1.0-50.EL and kernel-summit-2.4.9-e.34 
 
How reproducible: 
Always 
 
Steps to Reproduce: 
1. Boot e34 summit kernel on x440 
2. Log into X and run the "pounder" script 
3. While stress-test is running, move windows around 
     
 
Actual Results:  System hangs for a second, then the monitor clicks 
to black. 
 
Expected Results:  No problems.  
 
Additional info: 
 
I'm going to let the tests run over the weekend w/o moving windows 
to see if the problem can be triggered on its own. XF86 config info 
to follow soon.

Comment 1 john stultz 2003-12-13 02:03:09 UTC
Created attachment 96509 [details]
XF86Config-4 file

Comment 2 Mike A. Harris 2003-12-13 05:32:05 UTC
Can you provide your X server log file from the failed session,
your complete /var/log/messages, and the output of lsmod from
inside X prior to running the tests.

Thanks in advance.

Comment 3 john stultz 2003-12-15 18:37:48 UTC
Created attachment 96544 [details]
XFree86.0.log file

Sorry, I ment to attach these files as well, but Bugzilla stopped letting me
connect to it just after posting the config file.

Comment 4 keith mannth 2003-12-15 18:48:13 UTC
  We saw this while testing the previous Beta ISO's as well.  We
wanted to confirm it happened with these iso's.  On the previous iso's
we were able to trigger this without the workload.



Comment 5 john stultz 2003-12-15 18:50:47 UTC
Created attachment 96545 [details]
/var/log/messages

You can see the hang at Dec 12 08:59:55

Comment 6 john stultz 2003-12-15 18:59:09 UTC
[root@elm3b59 root]# lsmod 
Module                  Size  Used by    Not tainted 
ide-cd                 35264   0  (autoclean) 
cdrom                  35264   0  (autoclean) [ide-cd] 
soundcore               7940   0  (autoclean) 
autofs                 13828   0  (autoclean) (unused) 
eepro100               21968   1  
tg3                    50304   1  
usb-uhci               26948   0  (unused) 
usbcore                68800   1  [usb-uhci] 
ext3                   70944   2  
jbd                    55336   2  [ext3] 
aic7xxx               127200   3  
sd_mod                 13920   3  
scsi_mod              126876   2  [aic7xxx sd_mod] 
 

Comment 7 Tim Burke 2003-12-17 03:10:31 UTC
Was this same test scenario completed successfully on RHEL2.1 U2? 
Just trying to identify whether this is a regression or a new test
scenarion which may have never succeeded.


Comment 9 Mike A. Harris 2003-12-17 05:00:33 UTC
Can you use:

    Option "noaccel"

in the video driver section of the config file, and rerun the test
suite, and try to reproduce the bug?  If the bug is not reproduceable
with the option set, we can disable acceleration on this savage
chip, or on the combination of this chip being used on this piece
of hardware.  Video will be much slower for the common case, but
stability will be much higher.

If the above prevents the problem from occuring, and you would like
to retain some video acceleration, please comment out the noaccel
option, and try using the various XaaNo.... options documented in
the XF86Config manpage to determine which (if any) of the
acceleration primitives might be the catalyst.  If this can be
determined, and stability retained with only the problematic accel
primitives disabled, then we can disable just the problem ones,
and keep some level of acceleration.

I don't know if Red Hat has this specific hardware internally or
not for us to attempt to reproduce as well, but I will inquire.


Comment 12 Mike A. Harris 2003-12-17 05:32:44 UTC
Can you also please try disabling hyperthreading on this system to
see if that makes a difference?

Comment 13 Mike A. Harris 2003-12-17 06:08:05 UTC
Here are some tidbits from the attached kernel log which I thought
might be useful to highlight for anyone looking into this report:

I see the following from a previous boot:

Dec 12 07:05:46 elm3b59 insmod:
/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o:
init_module: No such device
Dec 12 07:05:46 elm3b59 insmod: Hint: insmod errors can be caused by
incorrect module parameters, including invalid IO or IRQ parameters
Dec 12 07:05:46 elm3b59 insmod:
/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o:
insmod eth2 failed
Dec 12 07:05:46 elm3b59 insmod:
/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o:
init_module: No such device
Dec 12 07:05:46 elm3b59 insmod: Hint: insmod errors can be caused by
incorrect module parameters, including invalid IO or IRQ parameters
Dec 12 07:05:46 elm3b59 insmod:
/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o:
insmod eth2 failed
Dec 12 07:06:07 elm3b59 insmod:
/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o:
init_module: No such device
Dec 12 07:06:07 elm3b59 insmod: Hint: insmod errors can be caused by
incorrect module parameters, including invalid IO or IRQ parameters
Dec 12 07:06:07 elm3b59 insmod:
/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o:
insmod eth2 failed
Dec 12 07:06:08 elm3b59 insmod:
/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o:
init_module: No such device
Dec 12 07:06:08 elm3b59 insmod: Hint: insmod errors can be caused by
incorrect module parameters, including invalid IO or IRQ parameters
Dec 12 07:06:08 elm3b59 insmod:
/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o:
insmod eth2 failed
Dec 12 07:06:09 elm3b59 insmod:
/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o:
init_module: No such device
Dec 12 07:06:09 elm3b59 insmod: Hint: insmod errors can be caused by
incorrect module parameters, including invalid IO or IRQ parameters
Dec 12 07:06:09 elm3b59 insmod:
/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o:
insmod eth2 failed
Dec 12 07:06:09 elm3b59 insmod:
/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o:
init_module: No such device


Please make absolutely sure when doing any kernel or system testing,
that you are using the official Red Hat compiled kernel with no
3rd party or external kernel modules being used.  I note that the
newer kernel bootup doesn't appear to be using any 3rd party
kernel modules, but I just wanted to make sure that you are using
only Red Hat supplied kernel modules as we do not support systems
that are using 3rd party or self-built modules.



Not all of the CPUs have the same bogomips results, which seems
very odd:

Dec 12 08:34:02 elm3b59 kernel: Calibrating delay loop... 199.88 BogoMIPS
Dec 12 08:34:03 elm3b59 kernel: Calibrating delay loop... 1.58 BogoMIPS
Dec 12 08:34:03 elm3b59 kernel: Calibrating delay loop... 3.27 BogoMIPS

...

Dec 12 08:34:04 elm3b59 kernel: Total of 16 processors activated
(2802.83 BogoMIPS).

...


Possible kernel bug?

Dec 12 08:59:56 elm3b59 kernel: Local APIC address fee00000
Dec 12 08:59:56 elm3b59 kernel: init.c:148: bad pte
c0004ef0(00000000dffbf163).
Dec 12 08:59:56 elm3b59 kernel: init.c:148: bad pte
c0004ef0(00000000dffbf163).

...

Dec 12 08:59:56 elm3b59 kernel: init.c:148: bad pte
c0004ee8(00000000feb00173).
Dec 12 08:59:56 elm3b59 kernel: init.c:148: bad pte
c0004ee8(00000000f0c05173).
Dec 12 08:59:57 elm3b59 kernel: init.c:148: bad pte
c0004ee8(00000000f0c05173).

...

Dec 12 09:00:14 elm3b59 kernel: Keyboard timed out[1]
Dec 12 09:00:14 elm3b59 last message repeated 5 times



Adding some kernel folk to CC, to see if any of the above is
noteworthy.

Comment 15 Tim Burke 2003-12-17 14:06:57 UTC
Can the IBM crew try some smaller x440 configurations, such as a
single CEC?  Trying to identify if its caused by having too many CPUs,
etc.  BTW, we only support 8 CPUs in RHEL2.1; not 16. In general just
trying to simplify the equation here.  Does it reliably fail on
various x440 configurations?

Out of paranoia, please ensure that the irqbalance utility is disabled
as noted in bug #111783.  Trying to rule that out as the culprit.

Another good diagnostic step is to be running with all the latest U3
bits and reproduce the problem.  Then revert to the prior version of
XFree86 from U2.  That will help to isolate kernel vs Xfree86 vs
whether it ever worked in the first place.


Comment 16 john stultz 2003-12-17 18:42:15 UTC
>Just trying to identify whether this is a regression or a new test 
>scenarion which may have never succeeded. 
  
The test suite had been used on RHEL2.1 U2 without problems earlier. 
 
>Dec 12 07:06:09 elm3b59 insmod: 
>/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: 
>init_module: No such device 
 
>Please make absolutely sure when doing any kernel or system 
> testing, that you are using the official Red Hat compiled kernel 
> with no 3rd party or external kernel modules being used. 
 
We had to muck with the network settings a bit, but these are RH 
ISOs w/ the e.34 summit kernel we were told to test. 
 
>Not all of the CPUs have the same bogomips results, which seems 
>very odd: 
>Dec 12 08:34:02 elm3b59 kernel: Calibrating delay loop... 199.88  
>BogoMIPS 
>Dec 12 08:34:03 elm3b59 kernel: Calibrating delay loop... 1.58  
>BogoMIPS 
>Dec 12 08:34:03 elm3b59 kernel: Calibrating delay loop... 3.27 
>BogoMIPS 
 
and 
 
>Dec 12 09:00:14 elm3b59 kernel: Keyboard timed out[1] 
>Dec 12 09:00:14 elm3b59 last message repeated 5 times 
 
Are likely due to a known bug (but #110170). 
 
 
Also the box is an 8cpu system w/ HT turned on, not a 16 cpu system.  
 
I'll start playing with X options and will report back shortly. 
 
 

Comment 17 Bob Johnson 2003-12-17 19:59:08 UTC
from RH QA

jay has run the X stress portion of the cert tests 4 times on our x440
and can't recreate the X issue there 

See Tim's notes in item #15 above.

Comment 18 john stultz 2003-12-17 20:04:52 UTC
Binary searched through all the XaaNo_ options and I think I've  
found a winner.   
  
Option "XaaNoSolidFillRect"  
  
Without this option the hang is very reproduceable. With it, I have  
not been able to trigger the hang. I'm going to continue testing to  
be sure.  

Comment 19 Tim Burke 2003-12-17 20:44:52 UTC
It would be helpful if you provided us the test script as we have been
unable to reproduce.


Comment 20 john stultz 2003-12-17 20:53:31 UTC
Tim: the test we are using is basically tools10, which Chris tells me 
you already have. If you don't, I'll have to pull out the non ibm 
bits (basically just hellhound) to send it to you.  

Comment 21 keith mannth 2003-12-17 21:06:55 UTC
  tools10 == (1) making a kernel in a loop  with (2) copying /usr to
/tmp with (3) copying a cd to /tmp with (4) hellhound and (5) copying
/ to a nfs mount over gigabit connection.   
  We just start the load and move some windows around and we see the
problem.  We don't see the problem without to load.  If we don't do
anything in X we don't see any problems even with the laod.  
  Do you run your X tests with a system load?  

Comment 22 Mike A. Harris 2003-12-18 00:07:29 UTC
>We had to muck with the network settings a bit, but these are RH 
>ISOs w/ the e.34 summit kernel we were told to test. 

If "muck with network settings" involves 3rd party kernel modules
though, that is a non-starter.  ;o)


> Also the box is an 8cpu system w/ HT turned on, not a 16 cpu system.

We support a maximum of 8 virtual CPUs.  If the CPUs support
hyperthreading, and hyperthreading is enabled, then we support
4 real physical CPUs with hyperthreading enabled.  More than that
may or may not work.

Are you able to reproduce the problems with 4 CPUs + HT also?  This
is just as a test point to make sure it isn't caused by >8 CPUs.


>Binary searched through all the XaaNo_ options and I think I've  
>found a winner.   
>  
>Option "XaaNoSolidFillRect"  
>  
>Without this option the hang is very reproduceable. With it, I have  
>not been able to trigger the hang. I'm going to continue testing to  
>be sure.

Ok, please bang on it as much as possible with this setting.  If it
turns out to be stable, I can modify the driver to automatically
disable SolidFillRect on this Savage PCI ID.

Thanks for the info.




Comment 23 john stultz 2003-12-18 02:58:35 UTC
The problem was easily reproduced on an 8 way (disabled 
hyperthreading in the BIOS). 
 
 

Comment 24 keith mannth 2003-12-18 17:28:55 UTC
>If "muck with network settings" involves 3rd party kernel modules
>though, that is a non-starter.  ;o)
  There are no 3rd party kernel modules involved with any of our
distro testing.  This weird module ouput involves a 100 pro card that
the system was installed with.  The card in no longer in the system. 




Comment 25 john stultz 2003-12-18 19:31:53 UTC
System ran pounder overnight without any issues. Did more window 
moving/resizing in the X server and didn't see any crashes.  
 
I'm feeling fairly confident that Option "XaaNoSolidFillRect" solves 
this issue. Although I'm curious why it has just appeared in this 
update. Was there any related Savage driver changes? 

Comment 26 Mike A. Harris 2003-12-18 19:43:16 UTC
There haven't been Savage driver changes for quite a long time,
at least several months prior to RHEL 3 GA.

Comment 27 john stultz 2003-12-18 23:33:32 UTC
Not sure its very useful, but as s an extra data point, I've not 
been able to reproduce the issue w/ the 2.4.23 kernel both with and 
without the XaaNoSolidFillRect.  
 
So I'm not sure if its a kernel bug we avoid by using 
XaaNoSolidFillRect or a X bug we avoid by changing kernels. 
Thoughts? 

Comment 28 john stultz 2003-12-19 03:06:21 UTC
I just opened the same bug against RHEL 3.0 update 1 beta 1.  
 
See bug #112405. 
 

Comment 29 john stultz 2004-03-26 23:25:53 UTC
From initial testing, this problem seems to be present in the  
2.4.9-e.39summit kernel that came w/ REHL2.1-U4-beta1. In addition, 
RHEL3-U2-beta1 also exibited this problem (see bug #112405).  
 
I'll be trying to do further testing to see if enabling 
XaaNoSolidFillRect resolves the problem as well. 

Comment 30 Bob Johnson 2004-04-15 13:13:25 UTC
John, results of your last testing here ?
Have you tried with the latest kernels ? 

Comment 31 john stultz 2004-04-15 15:54:03 UTC
The issue was last seen w/ the RHEL 2.1 U4 partner beta. I have not 
yet tested the RHEL 2.1 U4 public beta.  

Comment 33 Bob Johnson 2004-08-11 19:53:29 UTC
IBM, please retest with U5, if still an issue open IT.
If turning this off in xconfig we will probably defer fixing.
Could enter this a KB issue.

Comment 34 john stultz 2004-08-14 01:44:03 UTC
This issue is still reproducable with RHAS 2.1 Update 5 Beta 1. 

Comment 35 Wendy Hung 2004-08-16 14:23:47 UTC
Bob, issuetracker 46728 has been opened.

Comment 36 Wendy Hung 2004-11-04 15:42:16 UTC
John, this issue is still open with 2.1 U6 beta right?

Comment 42 Mike A. Harris 2005-05-12 05:31:44 UTC
This issue was filed almost a year and a half ago now, and to date, Red Hat
X11 Engineering have not had any x440 hardware with which to investigate
this issue, and nobody has offered to provide the hardware that we're
aware of.

Since Red Hat Enterprise Linux 2.1 is now in security maintenance mode,
we will only be making security fixes available for XFree86 from this
point onward.  As such, I'm closing this bug WONTFIX.  If this issue
is a problem in Red Hat Enterprise Linux 3 or 4 however, please contact
Red Hat global support to file a new ticket in issue tracker.  We will
still require x440 hardware however before we can investigate the
issue.

Setting status to "WONTFIX".