Bug 112028
Summary: | x440 crashes under heavy load in X | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | john stultz <johnstul> | ||||||||
Component: | XFree86 | Assignee: | X/OpenGL Maintenance List <xgl-maint> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | David Lawrence <dkl> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 2.1 | CC: | alan, jdennis, kmannth, lwoodman, mharris, tao, tburke, wendyh | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2005-05-12 05:31:44 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 143573 | ||||||||||
Attachments: |
|
Description
john stultz
2003-12-13 02:01:22 UTC
Created attachment 96509 [details]
XF86Config-4 file
Can you provide your X server log file from the failed session, your complete /var/log/messages, and the output of lsmod from inside X prior to running the tests. Thanks in advance. Created attachment 96544 [details]
XFree86.0.log file
Sorry, I ment to attach these files as well, but Bugzilla stopped letting me
connect to it just after posting the config file.
We saw this while testing the previous Beta ISO's as well. We wanted to confirm it happened with these iso's. On the previous iso's we were able to trigger this without the workload. Created attachment 96545 [details]
/var/log/messages
You can see the hang at Dec 12 08:59:55
[root@elm3b59 root]# lsmod Module Size Used by Not tainted ide-cd 35264 0 (autoclean) cdrom 35264 0 (autoclean) [ide-cd] soundcore 7940 0 (autoclean) autofs 13828 0 (autoclean) (unused) eepro100 21968 1 tg3 50304 1 usb-uhci 26948 0 (unused) usbcore 68800 1 [usb-uhci] ext3 70944 2 jbd 55336 2 [ext3] aic7xxx 127200 3 sd_mod 13920 3 scsi_mod 126876 2 [aic7xxx sd_mod] Was this same test scenario completed successfully on RHEL2.1 U2? Just trying to identify whether this is a regression or a new test scenarion which may have never succeeded. Can you use: Option "noaccel" in the video driver section of the config file, and rerun the test suite, and try to reproduce the bug? If the bug is not reproduceable with the option set, we can disable acceleration on this savage chip, or on the combination of this chip being used on this piece of hardware. Video will be much slower for the common case, but stability will be much higher. If the above prevents the problem from occuring, and you would like to retain some video acceleration, please comment out the noaccel option, and try using the various XaaNo.... options documented in the XF86Config manpage to determine which (if any) of the acceleration primitives might be the catalyst. If this can be determined, and stability retained with only the problematic accel primitives disabled, then we can disable just the problem ones, and keep some level of acceleration. I don't know if Red Hat has this specific hardware internally or not for us to attempt to reproduce as well, but I will inquire. Can you also please try disabling hyperthreading on this system to see if that makes a difference? Here are some tidbits from the attached kernel log which I thought might be useful to highlight for anyone looking into this report: I see the following from a previous boot: Dec 12 07:05:46 elm3b59 insmod: /lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: init_module: No such device Dec 12 07:05:46 elm3b59 insmod: Hint: insmod errors can be caused by incorrect module parameters, including invalid IO or IRQ parameters Dec 12 07:05:46 elm3b59 insmod: /lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: insmod eth2 failed Dec 12 07:05:46 elm3b59 insmod: /lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: init_module: No such device Dec 12 07:05:46 elm3b59 insmod: Hint: insmod errors can be caused by incorrect module parameters, including invalid IO or IRQ parameters Dec 12 07:05:46 elm3b59 insmod: /lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: insmod eth2 failed Dec 12 07:06:07 elm3b59 insmod: /lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: init_module: No such device Dec 12 07:06:07 elm3b59 insmod: Hint: insmod errors can be caused by incorrect module parameters, including invalid IO or IRQ parameters Dec 12 07:06:07 elm3b59 insmod: /lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: insmod eth2 failed Dec 12 07:06:08 elm3b59 insmod: /lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: init_module: No such device Dec 12 07:06:08 elm3b59 insmod: Hint: insmod errors can be caused by incorrect module parameters, including invalid IO or IRQ parameters Dec 12 07:06:08 elm3b59 insmod: /lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: insmod eth2 failed Dec 12 07:06:09 elm3b59 insmod: /lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: init_module: No such device Dec 12 07:06:09 elm3b59 insmod: Hint: insmod errors can be caused by incorrect module parameters, including invalid IO or IRQ parameters Dec 12 07:06:09 elm3b59 insmod: /lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: insmod eth2 failed Dec 12 07:06:09 elm3b59 insmod: /lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: init_module: No such device Please make absolutely sure when doing any kernel or system testing, that you are using the official Red Hat compiled kernel with no 3rd party or external kernel modules being used. I note that the newer kernel bootup doesn't appear to be using any 3rd party kernel modules, but I just wanted to make sure that you are using only Red Hat supplied kernel modules as we do not support systems that are using 3rd party or self-built modules. Not all of the CPUs have the same bogomips results, which seems very odd: Dec 12 08:34:02 elm3b59 kernel: Calibrating delay loop... 199.88 BogoMIPS Dec 12 08:34:03 elm3b59 kernel: Calibrating delay loop... 1.58 BogoMIPS Dec 12 08:34:03 elm3b59 kernel: Calibrating delay loop... 3.27 BogoMIPS ... Dec 12 08:34:04 elm3b59 kernel: Total of 16 processors activated (2802.83 BogoMIPS). ... Possible kernel bug? Dec 12 08:59:56 elm3b59 kernel: Local APIC address fee00000 Dec 12 08:59:56 elm3b59 kernel: init.c:148: bad pte c0004ef0(00000000dffbf163). Dec 12 08:59:56 elm3b59 kernel: init.c:148: bad pte c0004ef0(00000000dffbf163). ... Dec 12 08:59:56 elm3b59 kernel: init.c:148: bad pte c0004ee8(00000000feb00173). Dec 12 08:59:56 elm3b59 kernel: init.c:148: bad pte c0004ee8(00000000f0c05173). Dec 12 08:59:57 elm3b59 kernel: init.c:148: bad pte c0004ee8(00000000f0c05173). ... Dec 12 09:00:14 elm3b59 kernel: Keyboard timed out[1] Dec 12 09:00:14 elm3b59 last message repeated 5 times Adding some kernel folk to CC, to see if any of the above is noteworthy. Can the IBM crew try some smaller x440 configurations, such as a single CEC? Trying to identify if its caused by having too many CPUs, etc. BTW, we only support 8 CPUs in RHEL2.1; not 16. In general just trying to simplify the equation here. Does it reliably fail on various x440 configurations? Out of paranoia, please ensure that the irqbalance utility is disabled as noted in bug #111783. Trying to rule that out as the culprit. Another good diagnostic step is to be running with all the latest U3 bits and reproduce the problem. Then revert to the prior version of XFree86 from U2. That will help to isolate kernel vs Xfree86 vs whether it ever worked in the first place. >Just trying to identify whether this is a regression or a new test >scenarion which may have never succeeded. The test suite had been used on RHEL2.1 U2 without problems earlier. >Dec 12 07:06:09 elm3b59 insmod: >/lib/modules/2.4.9-e.33/kernel/drivers/addon/e100_2124k2/e100_2124k2.o: >init_module: No such device >Please make absolutely sure when doing any kernel or system > testing, that you are using the official Red Hat compiled kernel > with no 3rd party or external kernel modules being used. We had to muck with the network settings a bit, but these are RH ISOs w/ the e.34 summit kernel we were told to test. >Not all of the CPUs have the same bogomips results, which seems >very odd: >Dec 12 08:34:02 elm3b59 kernel: Calibrating delay loop... 199.88 >BogoMIPS >Dec 12 08:34:03 elm3b59 kernel: Calibrating delay loop... 1.58 >BogoMIPS >Dec 12 08:34:03 elm3b59 kernel: Calibrating delay loop... 3.27 >BogoMIPS and >Dec 12 09:00:14 elm3b59 kernel: Keyboard timed out[1] >Dec 12 09:00:14 elm3b59 last message repeated 5 times Are likely due to a known bug (but #110170). Also the box is an 8cpu system w/ HT turned on, not a 16 cpu system. I'll start playing with X options and will report back shortly. from RH QA jay has run the X stress portion of the cert tests 4 times on our x440 and can't recreate the X issue there See Tim's notes in item #15 above. Binary searched through all the XaaNo_ options and I think I've found a winner. Option "XaaNoSolidFillRect" Without this option the hang is very reproduceable. With it, I have not been able to trigger the hang. I'm going to continue testing to be sure. It would be helpful if you provided us the test script as we have been unable to reproduce. Tim: the test we are using is basically tools10, which Chris tells me you already have. If you don't, I'll have to pull out the non ibm bits (basically just hellhound) to send it to you. tools10 == (1) making a kernel in a loop with (2) copying /usr to /tmp with (3) copying a cd to /tmp with (4) hellhound and (5) copying / to a nfs mount over gigabit connection. We just start the load and move some windows around and we see the problem. We don't see the problem without to load. If we don't do anything in X we don't see any problems even with the laod. Do you run your X tests with a system load? >We had to muck with the network settings a bit, but these are RH >ISOs w/ the e.34 summit kernel we were told to test. If "muck with network settings" involves 3rd party kernel modules though, that is a non-starter. ;o) > Also the box is an 8cpu system w/ HT turned on, not a 16 cpu system. We support a maximum of 8 virtual CPUs. If the CPUs support hyperthreading, and hyperthreading is enabled, then we support 4 real physical CPUs with hyperthreading enabled. More than that may or may not work. Are you able to reproduce the problems with 4 CPUs + HT also? This is just as a test point to make sure it isn't caused by >8 CPUs. >Binary searched through all the XaaNo_ options and I think I've >found a winner. > >Option "XaaNoSolidFillRect" > >Without this option the hang is very reproduceable. With it, I have >not been able to trigger the hang. I'm going to continue testing to >be sure. Ok, please bang on it as much as possible with this setting. If it turns out to be stable, I can modify the driver to automatically disable SolidFillRect on this Savage PCI ID. Thanks for the info. The problem was easily reproduced on an 8 way (disabled hyperthreading in the BIOS). >If "muck with network settings" involves 3rd party kernel modules
>though, that is a non-starter. ;o)
There are no 3rd party kernel modules involved with any of our
distro testing. This weird module ouput involves a 100 pro card that
the system was installed with. The card in no longer in the system.
System ran pounder overnight without any issues. Did more window moving/resizing in the X server and didn't see any crashes. I'm feeling fairly confident that Option "XaaNoSolidFillRect" solves this issue. Although I'm curious why it has just appeared in this update. Was there any related Savage driver changes? There haven't been Savage driver changes for quite a long time, at least several months prior to RHEL 3 GA. Not sure its very useful, but as s an extra data point, I've not been able to reproduce the issue w/ the 2.4.23 kernel both with and without the XaaNoSolidFillRect. So I'm not sure if its a kernel bug we avoid by using XaaNoSolidFillRect or a X bug we avoid by changing kernels. Thoughts? I just opened the same bug against RHEL 3.0 update 1 beta 1. See bug #112405. From initial testing, this problem seems to be present in the 2.4.9-e.39summit kernel that came w/ REHL2.1-U4-beta1. In addition, RHEL3-U2-beta1 also exibited this problem (see bug #112405). I'll be trying to do further testing to see if enabling XaaNoSolidFillRect resolves the problem as well. John, results of your last testing here ? Have you tried with the latest kernels ? The issue was last seen w/ the RHEL 2.1 U4 partner beta. I have not yet tested the RHEL 2.1 U4 public beta. IBM, please retest with U5, if still an issue open IT. If turning this off in xconfig we will probably defer fixing. Could enter this a KB issue. This issue is still reproducable with RHAS 2.1 Update 5 Beta 1. Bob, issuetracker 46728 has been opened. John, this issue is still open with 2.1 U6 beta right? This issue was filed almost a year and a half ago now, and to date, Red Hat X11 Engineering have not had any x440 hardware with which to investigate this issue, and nobody has offered to provide the hardware that we're aware of. Since Red Hat Enterprise Linux 2.1 is now in security maintenance mode, we will only be making security fixes available for XFree86 from this point onward. As such, I'm closing this bug WONTFIX. If this issue is a problem in Red Hat Enterprise Linux 3 or 4 however, please contact Red Hat global support to file a new ticket in issue tracker. We will still require x440 hardware however before we can investigate the issue. Setting status to "WONTFIX". |