Platform: Dell Blade-server Poweredge 2850 The blade was either idle or running a bash script, when it becomes unresponsive aka hangs. FSB analyzer trace shows that the CPU is trying to read video config space and never gets a reply. DRI is enabled by default installation (entry /usr/X11/XF86Config ) Workaround: Disabling DRI eliminates this issue at the price of some video performance. Edit the XF86Config file (found in /etc/X11) to remove the line load "dri" from the modules section The failure does not occur in low rez modes: 640x480 64k colors didn't crash but 1024x768 millions of colors does. This problem can be reproduced when all following conditions are true: 1. DELL system with on-board 7000m, (The problem does not occur with a 7000m PCI board on a regular system). 2. DRI is enabled using PCIGART, (The problem does not occur with DRI disabled) 3. Only single display contoller is used, (The problem does not occur if clone mode is enabled). 4. Run eschertilerect100 many times, (It's may not hang in first couple of times). Patch to fix drm kernel driver has been created for RHEL3u2 linux- 2.4.21-15.EL. See "Additional Information" for the patch for fixing this problem. The problem is with the drm driver inside linux-2.4.21- 15.ELsmp kernel tree come with RHEL3u2. The problem was fixed in XFree86/Xorg some time ago. Steps to reproduce: 1. Install RH 3.0 Update 3 2. Verify DRI is on by checking file /etc/X11/XF86Config for the line "load dri" 3. Verify the graphics are using 32 bit color ("Millions of colors"). 4. Run: x11perf -eschertilerect100 This normally runs for about 1 minute and exits without errors but in this case it sometimes fails after about 30 seconds with a server hang. FIX: Patch to fix drm kernel driver in RHEL3u2 linux-2.4.21-15.EL --- linux-2.4.21-15.EL/drivers/char/drm/radeon_cp.c.linux-2.4.21- 15.EL_orig 2004-10-29 16:29:56.000000000 -0400 +++ linux-2.4.21-15.EL/drivers/char/drm/radeon_cp.c 2004-10-29 16:30:59.000000000 -0400 @@ -1347,7 +1347,7 @@ LOCK_TEST_WITH_RETURN( dev ); - if ( copy_from_user( &stop, (drm_radeon_init_t *)arg, sizeof (stop) ) ) + if ( copy_from_user( &stop, (drm_radeon_cp_stop_t *)arg, sizeof(stop) ) ) return -EFAULT; /* Flush any pending CP commands. This ensures any outstanding @@ -1592,7 +1592,7 @@ for ( i = d->granted_count ; i < d->request_count ; i++ ) { buf = radeon_freelist_get( dev ); - if ( !buf ) return -EAGAIN; + if ( !buf ) return -EBUSY; buf->pid = current->pid;
Per Sue Denham, this came in too late to make RHEL3 Update 5, will defer to Update 6 to release a fix.
Please attach patch rather than submitting inline.
U6 closed a couple of weeks ago. Moving from U6 to U7 proposed list.
Rod- Can you please provide the patch to RH as requested since without that we are not making any progress as we are still hoping to get this in for U6.
Created attachment 117917 [details] DRM driver patch fixing video stress hang on Dell server The patch is for DRM driver inside the RHEL3 2.4.21 Linux kernel, which used an old version of DRI code.
Has ATI tested this on RHEL3? The patch is in RHEL4 and upstream 2.6. It is not in RHEL3 or upstream 2.4. The first part of the patch looks obviously correct. I'm wondering if the error return update will cause problems for exiting callers. The 2.6 sources include: buf = radeon_freelist_get( dev ); if ( !buf ) return DRM_ERR(EBUSY); /* NOTE: broken client */
Rod@ATI- Can you please provide answers to question posted in comment #15 ?
Mustfix request is denied. No PM ACK on this mustfix request, and no feedback from ATI regarding the patch.
RHEL3 U6: Problem not reproducable. Confirmed by test at ATI 32b OS - Dell PE6800 server, 20051027 RHEL4 U2: Problem not reproducable. Confirmed by test at ATI 32b OS - Dell PE6800 server, 20051027) This item can be closed.
Defect still reproducable on PE6800 with RHEL3-U6 installed.
Steps to reproduce the defect: 1. Install RHEL3-U6(kernel-2.4.21-37) 64-bit on PE6800. 2. In the GUI mode run the command " x11perf -eschertilerect100 ". 3. Run the above command a few times. 4. System hangs
With reference to comment#22 ATI was not able to reproduce this defect because they were trying it with RHEL3-U6 x86.The defect is reproducable only on RHEL3- U6 x86_64. ATI confirms that now they are able to reproduce the defect on RHEL3-U6 x86_64 on PE6800 and are investigating the defect further.
Created attachment 121713 [details] DRM driver patch fixing video stress hang on Dell server This patch has been updated for RHEL3 U6
Created attachment 121714 [details] DRM driver patch fixing video stress hang on Dell server This patch has been updated for RHEL3 U6
Apologies for the double attachment, both patches are the same. This patch has been tested by Dell and was shown to prevent the system hang that occurs when running x11perf. Regarding the question in #15 above: In the 2.6 kernel source this issue is corrected as shown buf = radeon_freelist_get( dev ); if ( !buf ) return DRM_ERR(EBUSY); /* NOTE: broken client */ This will not cause problems for existing callers since this is a defect. If the function returns EAGAIN the function will be retried over and over potentially locking up the system. The correct return value here is EBUSY.
Patch not applied in RHEL3-U7-Beta(kernel-2.4.21-38). Defect not Fixed in RHEL3-U7-Beta(kernel-2.4.21-38).
I see the hardware field has been changed to X86_64. Can you insure that the code fix is also applied to 32 bit. Although the problem is not reproducing there today (with RHEL3-U6) it did occur with earlier releases.
The hardware field was updated in response to comments 24 and 25. The patch is queued for inclusion in U8, and applies to all platforms.
Raising priority to high based on Dell's U8 consideration.
A fix for this problem has just been committed to the RHEL3 U8 patch pool this evening (in kernel version 2.4.21-40.1.EL).
A kernel has been released that contains a patch for this problem. Please verify if your problem is fixed with the latest available kernel from the RHEL3 public beta channel at rhn.redhat.com and post your results to this bugzilla.
Reverting to ON_QA.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0437.html