+++ This bug was initially created as a clone of Bug #236416 +++ Description of problem: I just installed RHEL 5 client, and noticed that sometimes the X resolution is properly set, as I specified, to 1200x1024, but often, upon restart of the X server, it dumbs down the resolution to 800x600. I will attach two Xorg.0.log outputs showing how the VESA BBE DDE read is said to be successful, but in the dumb-down case no actual data comes in that enables proper configuration of the monitor. This problem DOES NOT occur under RHEL 4.5 beta, nor does it occur using a third party fglrx driver. Version-Release number of selected component (if applicable): X Window System Version 7.1.1 How reproducible: Often. Note that I've got a VGA LCD attached via an adapter cable to the ATI Radeon X1300 pro card. At first, I thought it might be because the adapter or the display or the card was flaky in reporting the VESA data. But when I NEVER got failures under RHEL 4.5, I began to suspect something amiss in the VESA DDI read. Steps to Reproduce: 1. Start the X server Actual results: 800x600 resolution, and no actual data from the VESA BBE DDC read showing up in Xorg.0.log Expected results: Proper detection of the monitor through proper data being returned by the VESA VBE DDC read showing monitor Manufacture string, and all the other relevant data, and an ultimate result in proper configuration at 1200x1024 Additional info: -- Additional comment from wdc on 2007-04-13 14:41 EST -- Created an attachment (id=152574) Log output showing successful VBE DDC read -- Additional comment from wdc on 2007-04-13 14:46 EST -- Created an attachment (id=152575) Log output showing VBE DDC read with no data Note that in this log file, the VESA VBE DDC read is declared successful, but the Manufacture string, and all the other relevant data needed to configure has NOT been obtained. This log file was obtained on the exact same hardware as the Xorg.0.log.good file. My process was: Start RHEL 5. Notice X was configured correctly. Log in. Save the X.org.0 log file as X.org.0.log.good Log out. (And the X server restarted as per apparent config defaults.) Notice that X was dumbed down to 800x600. Log in. Save the X.org.0 log file as X.org.0.log.bad -- Additional comment from jbaron on 2007-05-01 14:24 EST -- hmmm, is it possible that System->Preferences->Screen resolution menu is set to 800x600 thereby over-riding the system configuration on a per-user basis? -- Additional comment from wdc on 2007-05-01 15:51 EST -- Well, the screen resolution problem occurs even when logged in as root. The resolution menu offered by the System Preferences, after the X server has decided to dumb itself down to 800x600 offers no higher resolution than 800x600. When I first installed RHEL 5, I got 800x600, but I ran some tool and specified 1280x1024, and that's when I got to this state of affairs where it sometimes does and sometimes does not work. I regret that I did not take careful note of which tool I ran. It was probably "system-config-display". I am using a Dell Dell E196FP display on the Optiplex 745. I have now used system-config-display to set that explicitly as the monitor. Here's the odd thing: When the VESA data transfer is successful, the monitor clearly reports that its optimal resolution setting is 1280x1024x60, and the setup is correct. When the VESA data transfer is unsuccessful, the X.org.0.log file reports that the 1280x1024x75 resolution is being tried multiple times, but that the 1280x1024x60 resolution is **NEVER** tried. I wonder why this is so. I also wonder why no 1024x768 resolutions are being tried. Even when the ACTUAL monitor I'm using is specified explicitly, no higher resolution than 800x600 is offered when the VESA DDB data transfer fails. I see two questions to answer here: 1. Why doed the VESA DDB transfer sometimes report success when no data is transferred? 2. Why does the X server never try 1024x768 resolutions, nor 1280x1024x60? It tries a WHOLE LOT of them, as can be shown in the Xorg.0.log file. ---- Should I also ask you why you are asking me about user-level configuration settings, when the Xorg. 0.log file already shows that a whole bunch of resolutions, never offered in those user-level configuration commands are being tried, and abandoned for reasons that have nothing to do with the user-level configuration settings, and everything to do with the perceived capabilities of the monitor? Or am I completely misreading the Xorg.0.log file here? -- Additional comment from wdc on 2007-05-10 17:59 EST -- I am disappointed that 10 days have gone by and nobody has followed up. I guess nobody cares that the latest update to RHEL BROKE X server configuration. I REALLY would like some help with this. I've just taken RHEL 4.5, and either I've found a way to more consistently specify a broken configuration, or whatever you broke in RHEL 5 you've BACK PORTED to 4.5, because the RHEL 4.5 beta worked great, but the RHEL 4.5 that was released is ALSO BROKEN. Let's get hopping on understanding this problem and fixing it QUICKLY! -- Additional comment from wdc on 2007-05-10 18:14 EST -- I've tested with another monitor, the Dell 2007WFP LCD. Via the VGA connector. In this case, the VESA data seems to be correctly fetched by both RHEL 5 and RHEL 4.5, but the monitor VERY CAREFULLY configures itself to CHOP OFF the topmost 30 or so pixels. Dell has no vertical size control so I get my choice of having the tool bar or the panel chopped away. This is unacceptable, and extremely frustrating. How can I help MIT customers adopt RHEL 4.5 and RHEL 5 when basic X display configuration has been so badly and obviously broken. Ok, you folks don't see the test case, let's get someone back to me QUICKLY so both MIT and Red Hat see the same symptoms, and pool our collective understanding. -- Additional comment from wdc on 2007-05-11 19:10 EST -- Created an attachment (id=154575) Sysreport of target system running RHEL 5 -- Additional comment from wdc on 2007-05-11 19:28 EST -- In the interests of being helpful I have attached sysreport output of the relevant system. Probably our next step is to decide if we have one bug or two here. The overall symptom is that X is not properly configured. But that could be due to two separate issues: 1. Failure to get consistently good data from the VESA DDB transfer. 2. X chops off the topmost 50 pixels when exact correct display is specified in the System- >Administration->Display tool. -- Additional comment from wdc on 2007-05-11 19:51 EST -- There's something else interesting going on. Yesterday the monitor would configure and chop off the top. Today I can't seem to establish an xorg.conf that will drive the monitor at that size any more. I either get 800x600, or I get a complaint that I'm driving the monitor too hard. I *THINK* it's because the xorg.conf I'm now playing with does not contain explicit resolution settings, and so it's tryign to get them from the failed VESA DDB transfer. -- Additional comment from alanm on 2007-05-16 15:40 EST -- RHEL problems will get attention if they are filed via your TAM. Since this works under RHEL4.5 I'll mark this as a regression. -- Additional comment from pm-rhel on 2007-05-16 15:46 EST -- This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. -- Additional comment from tao on 2007-05-16 16:02 EST -- heya Alan, Update : RHEL4.5 works with third party driver, and is flaky with our driver . Hence, regression is not in question . Additionally, I have posted a query with Dell Partner contact. Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by rkhadgar issue 121369 -- Additional comment from tao on 2007-05-16 16:13 EST -- i have confirmed with customer , rhel 4.5 does not work correctly with vesa driver. flgrx was used . Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by rkhadgar issue 121369 -- Additional comment from wdc on 2007-05-16 16:34 EST -- MIT does not have a TAM. It has something to do with the business model of, Since 1860, compaines have paid for the privelege of collaborating with MIT; Why does Red Hat insist on charging a premium price for the privilege of getting bugs taken seriously by the very community that helped create Linux in the first place... But I digress. Inasmuch as this is a basic problem that will affect MANY users of RHEL 5, it seems in Red Hat's best interest to resolve it quickly. The position expressed by "pm-rhel" seems quite wise. -- Additional comment from tao on 2007-05-16 16:58 EST -- I wasn't aware the MIT doesn't have a TAM. RHEL get bugs get addressed a lot faster if they are submitted via issue tracker. There are times when I've seen bugs submitted by BZ on RHEL problems that wind up hanging in limbo because there isn't an issue tracker ticket associated with it. Having two tools can be a problem because support uses IT, engineering uses BZ and product management makes their decisions using both. This event sent from IssueTracker by alanm issue 121369 -- Additional comment from tao on 2007-05-17 12:53 EST -- **** Problem Description Source : Service Request 1468792 Created by : WDC-RHN Created on : 15-May-2007 20:58:38 <snip> I just discovered one source of why RHEL 4.5 beta worked so well. I'd installed the ATI proprietary driver, and FORGOTTEN. Alas, when I went to restore the ghost image snapshot, I discovered that ghost does NOT restore RHEL 4 images. They won't boot. Tomorrow I'll re-install RHEL 4.5 beta. What I *DO* know is that RHEL 4.5 will create a good xorg.conf file when the Dell2007WFP is connected via DVI. Alas that xorg.conf file does not seem to work properly when the monitor is plugged in via VESA. That xorg.conf file does not work AT ALL under RHEL 5. </snip> This event sent from IssueTracker by rkhadgar issue 121369 -- Additional comment from rkhadgar on 2007-05-19 07:10 EST -- Updated info from customer, Summarised, Cropped, Chipped and Pasted ------------------------------------------------------------------------------- RHEL 4.5 correctly configures and works when connected to DVI. RHEL 4.5 connected to VGA connector chops off the top 50 pixels. The "Perfect" behavior under RHEL 4.5 beta that I originally reported early in this case was due to the use of the ATI proprietary driver. I am relieved to report that this means that I can take RHEL 4.5 roll-out OFF hold here at MIT because I now see that there is no show stopper, merely a performance issue. ------------------------------------------------------------------------------- RHEL 5, however clearly has breakage in the dynamic X configuration, such that hardware configuration that works just fine, under RHEL 4.5 does not work at all under RHEL 5. RHEL 5 dynamic xorg.conf configuration does not work for the ATI Radeon 1300 card. The VESA DDC transfer fails nearly all the time. RHEL 5 will not start X at all when connected to DVI. RHEL 5 configures and runs X at 800x600 when connected via VGA connector. RHEL 5 will run X via VGA connector at at least as high as 1280x1024 with explicit Modelines. -- Additional comment from tao on 2007-05-19 07:41 EST -- Attachin to IT xorg.conf which works fine with VGA connector. The same config fails with DVI connector on the same monitor - Dell 200FP. Xorg log attached to IT for the same. This event sent from IssueTracker by rkhadgar issue 121369 it_file 91403 -- Additional comment from wdc on 2007-05-21 17:17 EST -- Since last I posted to this bug on 16 May, I've done some more careful testing and I understand a LOT more about this situation. Bottom Line summary: The RHEL 4.5 X server is performing acceptably. The RHEL 5 X server suffers from a problem with DDC fetch that ALSO affects Ubuntu 7.04, and perhaps SuSE SLED 10.1. I've searched the x.org bug tree and found two relevant bugs: https://bugs.freedesktop.org/show_bug.cgi?id=6886 https://bugs.freedesktop.org/show_bug.cgi?id=10238 I've subscribed to the later one, and we'll see if the X.org folks respond. Detail: I needed to be told how to create a baseline xorg.conf file. Once I did that, I was able to carefully test RHEL 4.5 and RHEL 5. Along the way, I discovered that some of the extremely good performance I was getting under RHEL 4.5 beta was because I'd installed the ATI proprietary driver but FORGOT. (Oops.) The detailed behavior I got while testing RHEL 4.5 is: On the Optiplex 745 with the ATI Radeon X1300 Pro: up to 1280x1024 works via VESA. up to 1400x1050 works via DVI If your xorg.conf specifies 1400x1050 the VESA display will be too big for the screen. If your xorg.conf specifies 1600x1200 the VESAS display will draw a blank, but the DVI display will know to not use that setting. This seems reasonable, albeit non-ideal behavior to me. Creating a baseline xorg.conf file under RHEL 5, I re-ran tests and determined: The X server will not run AT ALL when connected via DVI. When connected via VESA, the DDC transfer fails, forcing the X server to dumb down to 800x600. If one explicitly provides Modeline directives in the xorg.conf file, the X server can be driven at up to 1280x1024 when connected via the VESA port. Perhaps higher resolutions are possible, but so far I don't have a Modeline for better than that. When connected via DVI, the X server WILL NOT START AT ALL. The monitor complains of being over-driven. DDC transfers under RHEL 5, with the X server version 7.1.1 always fail, both on the VESA port and on the DVI port. Ubuntu 7.10 seems to suffer the same fate. There is a long winded bug report about this at: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/89853 It is still unclear to me whether Red Hat, the Ubuntu community or X.org do or do not understand the root cause of this problem. Perhaps between the four of us we can converge on a useful fix. -- Additional comment from tao on 2007-05-22 03:19 EST -- 150 systems are held back from RHEL5 deployment because of this issue. **** Problem Description Source : Service Request 1468792 Created by : WDC-RHN Created on : 21-May-2007 14:35:26 <snip/> Q: Would you provide the number of system being held back because of this issue, as this will help me push up the priority . A: This is a difficult number to compute, because we don't know if this problem affects all systems, or merely the ones that use the Dell 745 hardware. Let's say that the scope is limited to desktops. In that case our Satellite server says 571 systems are registered for RHEL 4 WS. If it's just the systems we'd replace with Dell Optiplex 745's that generally runs to about 150 systems. As far as I know, the Dell Optiplex is the SINGLE MOST POPULAR Enterprise desktop system in the world, so it's probably a good idea to get this working. Does this help? Priority set to: 2 This event sent from IssueTracker by rkhadgar issue 121369 -- Additional comment from tao on 2007-05-22 03:30 EST -- From tech-list http://post-office.corp.redhat.com/archives/tech-list/2007-May/msg00563.html From: Adam Jackson <ajackson> Reply-To: tech-list To: tech-list Subject: Re: ATI Radeon X1300 Pro card Date: Mon, 21 May 2007 11:25:33 -0400 (20:55 IST) Mailer: Evolution 2.10.0 (2.10.0-2.fc7) On Sat, 2007-05-19 at 15:53 +0530, Ritesh Khadgaray wrote: > Heya, > > Is anyone using a ATI Radeon X1300 Pro card with pci-id listing > "1002:7183" ? > > I have a customer who has issue using the stated card on RHEL5 with > DVI connector. With VGA connector, using explicit ModeLine option works > fine . > > This card works fine on RHEL4.5 with DVI connector, and with top > 50pixels chopped off with VGA connector . Those are R500 cards. You're at the mercy of whatever the VESA BIOS implements. Thanks ATI. - ajax This event sent from IssueTracker by rkhadgar issue 121369 -- Additional comment from cra on 2007-05-28 14:25 EST -- wdc and I dug into the X server sources, and produced the attached patch, with interesting results. Issues: 1. The initialization of the EDID buffer carefully memset's 4 bytes to zero because it uses the size of the pointer to the structure instead of the size of the structure itself. However, in our patch we use the constant 128, because that is the size of an EDID buffer (as described in the EDID documentation we found on the Web.) 2. When the EDID transfer fails and gives us an EDID buffer full of zeros, xf86InterpretEDID in interpret_edid.c silently fails and returns NULL. We changed the code to report this error condition. 3. The EDID fetch from the BIOS is DEFINITELY flaky in a time-dependent way. We inserted a sleep(2) into vbeReadEDID in vbe.c which seems to improve things somewhat, but running Xorg multiple times results in EDID fetches in various states of completion, with the buffer only being filled up to a certain point, followed by zeros. We copied the hex dump code from print_edid.c into vbe.c so that the EDID buffer could be viewed immediately after the BIOS fetch. Attached is the patch against xorg-x11-server-1.1.1-48.13.0.1 along with Xorg.0.log files from successive runs with this patch showing the EDID buffer in various states of fill. -- Additional comment from cra on 2007-05-28 14:30 EST -- Created an attachment (id=155549) Patch to debug EDID BIOS fetch -- Additional comment from cra on 2007-05-28 14:31 EST -- Created an attachment (id=155550) Xorg.0.log run 1 showing full EDID read -- Additional comment from cra on 2007-05-28 14:32 EST -- Created an attachment (id=155551) Xorg.0.log run 2 showing full EDID read -- Additional comment from cra on 2007-05-28 14:32 EST -- Created an attachment (id=155552) Xorg.0.log run 3 showing partial EDID read -- Additional comment from cra on 2007-05-28 14:33 EST -- Created an attachment (id=155553) Diff between Xorg.0.log run 2 and run 3 -- Additional comment from cra on 2007-05-28 14:38 EST -- If you remove our "sleep(2);" from vbe.c the hex dump output from the EDID fetch from the BIOS pretty much always comes up all zeros. -- Additional comment from wdc on 2007-05-29 18:04 EST -- Created an attachment (id=155646) Log of successful DDC read, RHEL 4.5 with debug patch applied. Today I built the X server under RHEL 4.5, applying the relevant portion of the debug patch that performs the hex dump of the EDID fetch. I ran Xorg several times. Always the result is the same: PERFECTLY RELIABLE fetch of the EDID data! I also looked at the differences in the int10 logic that seems to be doing the nuts and bolts of the EDID fetch. Although I might have missed something, I think they are substantially the same. This causes me to conclude that what we have is a KERNEL bug, not an X server bug. Perhaps something is playing fast and loose with the real mode emulation that serves the VBE? Since this problem seems also to affect Ubuntu 7.04 (although I can't get it to consistently fail), we're probably talking about a kernel bug introduced between 2.6.9 and 2.6.18. (The Ubuntu 7.04 Desktop install CD which HAS the problem uses 2.6.20-15.) QUESTION: What further steps should I take to clarify that the fault lies in the kernel and not in X? -- Additional comment from wdc on 2007-05-30 17:25 EST -- Today I did two things: 1. I experimented under Ubuntu 7.04 to try and learn more -- I got partial EDID transfers, but no clue how to control when the transfers were partial and when they were complete. 2. I found a package called "read-edid" that alleged to use the VM86 code in a stand-alone mode to perform the problematic EDID fetch. See:http://john.fremlin.de/programs/linux/read-edid/ A debian package was available for Ubuntu. Running the program ALWAYS gets a 100% good EDID fetch. Building the package from source under RHEL 5, and running it ALSO ALWAYS gets a 100% good EDID fetch. So now the question is, "What is happening to make stand-alone edid-get successful but X.org fetch un-successful?" Someone suggested that there may be a memory caching issue involved. get-edid is a small program, wehreas X is rather large, so that's not so far-fetched an idea. My next task will be to read the get-edid code, and try to understand if it is doing the same thing the X server is doing. ANY insight froma anyone else reading this bug report would be MOST welcome. -- Additional comment from tao on 2007-05-30 17:53 EST -- ping. from customer -- So now the question is, "What is happening to make stand-alone edid-get successful but X.org fetch un-successful?" This event sent from IssueTracker by rkhadgar issue 121369 -- Additional comment from wdc on 2007-06-12 12:29 EST -- Created an attachment (id=156806) Run of Xorg 6.8.6 under RH5 -- EDID all zeros -- Additional comment from wdc on 2007-06-12 12:36 EST -- Created an attachment (id=156807) Run of Xorg 6.8.6 under RH5 -- EDID partial transfer I believe this Xorg.0.log output demonstrates we have a bug that WAS NOT introduced between Xorg 6.8.6 and Xorg 7.1.1. I tried to build Xorg 6.8.6 under RHEL 5 but hit a wall. I tried to install RHEL 4.5's 6.8.6 on RHEL 5 but made a mess. After cleaning up the mess well enough to get 7.1.1 running again I tried a different tack to get Xorg 6.8.6 just running enough to do the EDID transfer. Since RHEL 4.5 was in another partition, I ran Xorg out of there. Additional arguments were needed. The command line that got me far enough was: /rhel4/usr/X11R6/bin/Xorg -config /rhel4/etc/X11/xorg.conf -modulepath /rhel4/usr/X11R6/lib/modules/ The first new attachment Xorg.0.log-rh5-6.1-a is not sufficient. It only shows all zeros in the EDID transfer, and that could be caused by something else not working as we kludge the Xorg run between major linux versions. The secont new attachment, however, Xorg.0.log-rh5-6.1-b IS sufficient, I believe because is shows a PARTIAL EDID transfer. Xorg could not run enough to really run. (It couldn't find font "fixed" because of how things are re-organized, but I very strongly believe that it DID run far enough to do an EDID transfer, and to manifest EXACTLY THE SAME bug we are experiencing under 7.1.1 under RHEL 5: A timing dependent flaky EDID transfer. -- Additional comment from tao on 2007-06-12 14:33 EST -- **** Problem Description Source : Service Request 1468792 Created by : WDC-RHN Created on : 29-May-2007 18:06:12 Summary of additional work I did today: I built the X server under RHEL 4.5 with my debug code to do a hex dump of the EDID fetch. The fetch is ALWAYS 100% successful under RHEL 4.5. I then compared the X code that did the int10 call out to the vm86 system to fetch the data, and I believe the code is pretty much equivalent, so it may be that we are facing a KERNEL bug that was introduced somewhere between 2.6.9 and 2.6.18. This event sent from IssueTracker by rkhadgar issue 121369 -- Additional comment from tao on 2007-06-12 14:35 EST -- customer is looking for an update . This event sent from IssueTracker by rkhadgar issue 121369 -- Additional comment from tao on 2007-06-18 14:15 EST -- Customer is not happy that this bug-fix is scheduled for 5.2 , as there is no viable option with RHEL5 with DVI connector . The posted workaround only works with VGA conntector. Additionally, customer has posted information on bugzilla w.r.t. this bug . --------------------------------------------------------------------------------- The Email notification system is not working. I HAD NO CLUE that you'd responded to my issue. MIT is preparing to do renewal of hardware with Dell Optiplex 745 systems running Red Hat Linux and Windows Vista. If we cannot demonstrate that the hardware works, then our renewal plan may be postponed for a year. If someone is working on this problem, I would VERY MUCH like to work with them. With the large number of hours I've spent working this issue, it would be nice to know I was not wasting my time and MIT's resources isolating a known fault. I might be able you arrive at a fix for the problem more quickly. Furthermore, since this seems to be a kernel issue, it might put Red Hat in a better light in the Linux community if Red Hat produces the fix that helps not only RHEL, but also Fedora and Ubuntu. Finally, inasmuch as we have ALREADY discussed how important this is, I am a bit disappointed to be told, "We have this in the queue for a release 3 months out and don't want to talk to you about it any more unless you produce compelling reason why we should." --------------------------------------------------------------------------------- > Added Note: Currently, vsync and hsync value hardcoded into xorg.conf is used > as a workaround this issue . If you plug in to the DVI port THIS WORK AROUND DOES NOT WORK AT ALL! So if Red Hat is still handing out this work-around, it's clear you guys don't really understand the problem! Would you PLEASE connect me up with the people who REALLY ARE working on the problem so their time, and my time will not be wasted, and so that we can get this fixed as quickly as possible! Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by rkhadgar issue 121369 -- Additional comment from marcobillpeter on 2007-06-19 04:01 EST -- putting this to 5.1 it has the flag regression set, which seems to be right. If not I'll set the exception flag. Nevertheless, this is a critical issue for MIT and Daniel Riek will meet with this client and hear an earful. This problem is part of a showstopper at MIT thanks - marco -- Additional comment from syeghiay on 2007-06-19 12:18 EST -- The xorg-x11-drv-vesa package is on the 5.1 approved component list. Set pm_ack. Beta Feature Freeze is Jun 27 when errata must be filed. -- Additional comment from ajackson on 2007-06-25 13:59 EST -- Out of curiosity, does it work reliably when using a xen kernel, or on non-x86? The reason I ask is, vm86 is known to be unreliable when using xen, and is simply unavailable on other arches. So for everything other than baremetal i386 kernels, we use a x86 real-mode emulator to execute VBE calls. The logs given appear to all be from non-xen machines. I would be thrilled to learn that the emulator is more reliable. There is also an option to force use of the emulator, by saying: Option "Int10Backend" "x86emu" in the ServerLayout section of xorg.conf. -- Additional comment from tao on 2007-06-25 15:04 EST -- Checking with customer if this options helps. Customer is _not_ using xen. This event sent from IssueTracker by rkhadgar issue 121369 -- Additional comment from wdc on 2007-06-27 21:33 EST -- It may indeed be that the emulator is more reliable. I've just added that line to the xorg.conf file, and run the X server a couple times. Previously the EDID buffer would contain a random amount of data and the rest be all zeros. This time the EDID buffer was consistently full and the data remained the the same across multiple runs. This is good evidence that the problem is in the vm86old code. (We're guessing that the code for auditing has inappropriately messed up the registers, and plan to build a kernel to test that theory in a few days.) The problem here though, is that people will not be able to run X far enough to put in a fix. What options do you think should be pursued to help people get a default install of RHEL 5 and the other 2.6.18+ kernels to get something that works from the get go? -- Additional comment from tao on 2007-07-03 11:17 EST -- For self-reference. sorry for the spam. **** Fact Source : Service Request 1468792 Created by : RKHADGAR Created on : 29-Jun-2007 08:21:44 > "Add this non-standard line to the xorg.conf file"? * A temporary workaround would be to automate this process, by adding the below to kickstart script in post-install section. system-config-display --noui sed -i 's/Section \\"ServerLayout\\"/Section \\"ServerLayout\\"\\n\\tOption \\"Int10Backend\\" \\"x86emu\\"/g' xorg.conf **** Problem Description Source : Service Request 1468792 Created by : WDC-RHN Created on : 29-Jun-2007 16:22:31 Thanks for replying quickly and trying to be further help, but we need to work harder on a solution. 1. No, I did not explain our test sufficiently in my earlier terse note. I am physically in San Francisco CA (at the opposite coast from Boston, the other side of the country from my office in Boston), so I have NOT plugged in the monitor to the DVI connector. I ran a simple test of starting the Xorg server remotely and examining the Xorg.0.log file for the EDID output to see if it was getting through. I believe there is a SECOND X server bug that will need to be investigated after the EDID bug in the kernel is fixed, because I have seen RHEL 5 get perfectly sensible, totally detailed VESA data, but then NOT USE IT to properly configure the monitor. 2. At this time, of the hundreds of customers at MIT that install RHEL, only a handful use kickstart. Your solution will be part of a total solution, but I fear people's habit will require us to do something that does not involve creating a special kickstart setup, and then convincing hundreds of customers to stop "Using the CD like I could with Ubuntu" and do something totally new and different. I think the way forward is to agree upon the sensible sequence of steps that will result in RHEL 5 U2 distribution media incorporating a fix that will just work. (Ideally it would be RHEL 5 U1, but I think it is too late in the U1 development cycle to reasonably ask for that.) This event sent from IssueTracker by rkhadgar issue 121369 -- Additional comment from wdc on 2007-07-10 18:46 EST -- Created an attachment (id=158912) Patch to cut out audit call in the int10 emulator. Today we built a kernel with the attached patch that disables the code that called audit_syscall_exit. Although those nasty error messages about freeing multiple audit contexts came back, the EDID transfers were once again 100% successful. (Yes, I was careful to use an xorg.conf file with x86emu disabled. I tested a stock kernel build to confirm I had a good build process, and that the stock kernel tickled the bug.) So it seems that the way audit_syscall_exit is called is trashing the registers and making the EDID transfer flaky. This is probably appropriately classified as a regression and probably needs to be fast-tracked to the original author so he or she can fix up the call. We have a very reproducible test case and test setup to test candidate kernel patches. (We didn't feel we understood things well enough to propose a change ourselves.) -- Additional comment from wdc on 2007-07-10 18:58 EST -- I have a bug open at kernel.org where I asked for help looking at this. I'll mention there that this regression is the root cause. Would it be appropriate for Red Hat to weigh in and lobby for examination of that bug? http://bugzilla.kernel.org/show_bug.cgi?id=8633 Now that we understand the root cause, and have a work-around, what next steps should we take? Ideally the kernel regression will eventually be remedied. Should we consider lobbying freedesktop.org to make the x86emu as int10backend the default for x86 in addition to everything else? There are additional bugs in the X server, once the EDID data is acquired with 100% fidelity: 1. Plugged into the VESA connector, 1400x1024 resolution will configure if requested, but it will chop off the topmost quarter inch and the leftmost inch of pixels. Modern Dell LCDs no longer support the ability to control the vertical or horizontal size so this is an unpleasant state of affairs. 2. The EDID data provides a detailed modeline for 1680x1050 operation which is ignored. I guess I should take these up with freedesktop.org. Do people think I should open a Red Hat bugzilla bug on these two issues? Finally there is the issue that the X server does not properly report the EDID transfer failure. I will take the freedesktop.org bug I have open about this and lobby for my patch to be considered as a remedy. Here too, I wonder if Red Hat weighing in on the bug would be useful? https://bugs.freedesktop.org/show_bug.cgi?id=10238 Mr. Jackson et. al., what do you advise as the best way forward? -- Additional comment from ajackson on 2007-07-11 15:53 EST -- (In reply to comment #44) > Now that we understand the root cause, and have a work-around, what next steps should we take? > > Ideally the kernel regression will eventually be remedied. > > Should we consider lobbying freedesktop.org to make the x86emu as int10backend the default for x86 > in addition to everything else? We're already doing this for Fedora 7 and later, and I'm certainly telling everyone I can upstream that vm86 is insane. I wish I'd flipped this switch before FC6, so it would have been incorporated in EL5, but the fear that the emulator would prove to be a regression relative to EL4's behaviour was too high. (And justified, it turns out, since several x86emu bugs have been fixed since 5.0.) In the meantime, I'm investigating a way to magically invoke the x86emu backend for DDC transfers if the vm86 method fails. It's slightly hairy due to namespace issues but I think it's doable. (Setting devel ack for 5.1, we should include this if I get it working.) > There are additional bugs in the X server, once the EDID data is acquired with 100% fidelity: > > 1. Plugged into the VESA connector, 1400x1024 resolution will configure if requested, but it will chop > off the topmost quarter inch and the leftmost inch of pixels. Modern Dell LCDs no longer support the > ability to control the vertical or horizontal size so this is an unpleasant state of affairs. > > 2. The EDID data provides a detailed modeline for 1680x1050 operation which is ignored. The X logs in this bz seem to all show the use of the vesa driver. The vesa bios interface is limited in terms of output setup capability. In particular, there are two sets of modes: the set that the monitor reports it can display, and the set that the bios reports it can configure. It's literally not possible to ask the bios to set up a mode outside its list, so the best we can do with the vesa driver - or any other driver that uses the vesa bios mode setting interface - is pick a "good" mode that happens to be in both lists. So regarding these two issues, assuming they're occuring with the vesa driver. The first sounds like we're either picking a mode that's larger than the monitor - in which case, 5.1 includes a vesa driver update that should address this issue - or that the mode we're selecting is not being programmed properly by the video bios, in which case we're just out of luck. The second problem sounds like the 1680x1050 mode is advertised by the monitor but not by the bios, in which case we are again out of luck. If my assumptions are incorrect here, I would certainly like to see an X log of the failure case(s). In general, these limitations mean that although the vesa driver is supported, it's not recommended for regular use, and we strongly prefer that people use native drivers wherever possible. The configuration infrastructure in EL5 should be smart enough to pick the correct native driver when one is available. > Finally there is the issue that the X server does not properly report the EDID transfer failure. I will take > the freedesktop.org bug I have open about this and lobby for my patch to be considered as a remedy. > Here too, I wonder if Red Hat weighing in on the bug would be useful? > https://bugs.freedesktop.org/show_bug.cgi?id=10238 That looks pretty good; I'll take it up upstream. Thanks! -- Additional comment from wdc on 2007-07-11 16:21 EST -- Invoking x86emu if the DDC fails sounds hairy, scary and a lot of work. Thanks for putting in the effort to make it right! Indeed the X resolution issues I am having are occurring with the VESA driver. Apparently the x.org ATI driver does not yet know about the R500 chip set that the x1300 and x1400 use. The reverse engineered driver will, I'm sure, eventually benefit this driver. It will be interesting to test the RHEL 5.1 X server to see which driver it picks. I'll attach Xorg.0.log output showing the 1680x1050 mode that the EDID fetch offers, and how it's not used. I'm still not sure I'm totally up to speed on reading the log output, so I'd be grateful if you'd call my attention to the lines where the BIOS denies support for that mode. Is it in those long, detailed segments? Indeed I see a 1600x1200 go by, and a 1400x1050 go by, but indeed no 1680x1050. -- Additional comment from wdc on 2007-07-11 16:27 EST -- Created an attachment (id=159000) Log of proffered but unused 1680x1050 resolution See lines 461 and 462: (II) VESA(0): h_active: 1680 h_sync: 1728 h_sync_end 1760 h_blank_end 1840 h_border: 0 (II) VESA(0): v_active: 1050 v_sync: 1053 v_sync_end 1059 v_blanking: and line 488: (II) VESA(0): Modeline "1680x1050" 119.00 1680 1728 1760 1840 1050 10 53 1059 1080 -hsync +vsync Here the VESA transfer offers the mode. Why exactly isn't it being used? -- Additional comment from wdc on 2007-07-11 17:32 EST -- I just had a thought! How will you detect a bad EDID transfer? The kernel bug causes the transfer to OFTEN come up all zeros, but sometimes it gets a partial transfer padded out with zeros. Does the EDID block have a checksum in it that you can compute and test? The current code just looks at the first few bytes for a version number and uses that to decide the transfer was good. If you can't detect a zero-padded partial transfer, then your additional work to use x86emu may be wasted. -- Additional comment from benl on 2007-07-12 12:11 EST -- + qa_ack for rhel-5.1.0 QA: we'll need some feedback from the customer on this one. -- Additional comment from ajackson on 2007-07-26 13:42 EST -- (In reply to comment #47) > Created an attachment (id=159000) [edit] > Log of proffered but unused 1680x1050 resolution > > See lines 461 and 462: > > (II) VESA(0): h_active: 1680 h_sync: 1728 h_sync_end 1760 h_blank_end > 1840 h_border: 0 > (II) VESA(0): v_active: 1050 v_sync: 1053 v_sync_end 1059 v_blanking: > > and line 488: > > (II) VESA(0): Modeline "1680x1050" 119.00 1680 1728 1760 1840 1050 10 > 53 1059 1080 -hsync +vsync > > Here the VESA transfer offers the mode. Why exactly isn't it being used? That's the EDID block's mode list. Remember, I can only set modes to things in the intersection of: in the VESA BIOS's mode list, and within the capabilities reported by EDID. So, yeah, 1680x1050 in the monitor, but not in the video BIOS, means no 1680x1050 for you. (In reply to comment #48) > How will you detect a bad EDID transfer? The kernel bug causes the transfer to OFTEN come up all zeros, > but sometimes it gets a partial transfer padded out with zeros. Does the EDID block have a checksum in it > that you can compute and test? The current code just looks at the first few bytes for a version number > and uses that to decide the transfer was good. Yes, there is a checksum. The last byte is set such that a cumulative sum of all bytes in the block, modulo 256, is 0. We do use this to reject bad EDID blocks. See DDC_checksum() in hw/xfree86/ddc/edid.c, and its caller in hw/xfree86/ddc/xf86DDC.c. -- Additional comment from wdc on 2007-07-26 17:04 EST -- I've looked at the code in xf86DDC.c, but there's something that confuses me: How come I never saw a checksum error report in the log? Clearly I was getting bad EDID reads. What determines if the code that's doing the EDID fetch is from hw/xfree86/vbe/vbe.c where it can silently fail (unless you've taken my patch ;-) ) and where no checksum is computed in the readEDID routine, versus the code that's in hw/xfree86/ddc/edid.c? Or are you saying that you plan to add checksum stuff like in ddc/... to vbe/...? ---- Thanks also for the clarification about the BIOS thing. -- Additional comment from wdc on 2007-08-02 18:42 EST -- Andrew: I just installed the X server and VESA driver from the RHEL 5.1 beta. Alas, it does one thing that is admittedly more correct but less desirable to me: Previously, somehow the server would see that the display could handle 1600x1080, and even though no 1400x1024 mode was specifically offered, it would configure that mode. (This got us into trouble when connected to the analog VESA port, but worked just fine on the digital port.) Now, because there is no exact match, the display that used to be 1400x1024 is configured for 1280x1024. By the same token, that particular monitor offers 1600x1050, but not 1600x1200, so even though the vesa driver is improved and has a 1600x1200 mode, 1600x1050 is not configured because it is not an exact match. Wasn't there partial match code being worked on? I thought it was already in place. Somebody is suffering with the latest Ubuntu because their card support 1280x1024, but their display only support 1280x800. That run ends up finding no matching modes whatsoever. I am concerned here that people will have gotten used to running 1400x1050 on these monitors under RHEL 4, but will not get degraded resolution of 1280x1024 after "upgrading" to RHEL 5.1. I will attach the xorg.conf file and the Xorg.0.log files so that this all can be rigorously documented. -- Additional comment from wdc on 2007-08-02 18:49 EST -- Created an attachment (id=160558) xorg.conf file used for testing RHEL 5.1 beta X server -- Additional comment from wdc on 2007-08-02 18:50 EST -- Created an attachment (id=160559) Log of run of RHEL 5.0 debugging X server. It sets 1400x1050. -- Additional comment from wdc on 2007-08-02 18:51 EST -- Created an attachment (id=160561) Log of run of RHEL 5.1 X server and vesa driver. Configs 1280x1024 -- Additional comment from wdc on 2007-08-10 13:52 EST -- Sorry to be a pest here. I expect there are many important issues being worked as RHEL 5.1 beta testing proceeds. I am concerned that people are going to consider this an imporper regression in behavior. If there were a plan of attack in addressing it, I might be able to help do the work. -- Additional comment from ajackson on 2007-08-14 10:00 EST -- The patch looks something like: http://people.redhat.com/ajackson/omg-vbe-hax.patch Utterly untested atm; going to try to hit that today. -- Additional comment from wdc on 2007-08-14 13:18 EST -- Although I've not bench checked it carefully, the patch looks plausible. The issue that concerns me is not so much the EDID thing at the moment, but that the VESA update to the X server currently on track for dissemination as part of the RHEL 5.1 update does a worse job than the present one at finding the highest resolution even when the EDID transfer is 100% successful. Andrew, should I open a different bug about that? What do you think is the way I can be most helpful in identifying the root cause and fixing the new regression? -- Additional comment from ajackson on 2007-08-15 12:02 EST -- (My name's Adam, btw.) (In reply to comment #58) > Although I've not bench checked it carefully, the patch looks plausible. > > The issue that concerns me is not so much the EDID thing at the moment, but that the VESA update to the > X server currently on track for dissemination as part of the RHEL 5.1 update does a worse job than the > present one at finding the highest resolution even when the EDID transfer is 100% successful. Yeah, that's intentional. The issue is that you _really_ want to try for strict intersection of modes between the monitor and the video BIOS in this case. There do exist monitors where the EDID list is literally all it can do. Worse, there are monitors where if (like your example) there's a VBIOS mode between the two largest EDID modes like so: VBIOS EDID A: 1680x1050 B: 1400x1050 C: 1280x1024 and you attempt to set mode B, then the monitor will try to sync as though it's mode C and the rest will just be off the screen. Or go blank. Either one is unacceptable. The other case we ran into was some laptop panels, which give you a mostly-nonconformant EDID block that just contains a mode for the panel size and nothing else, and of course no matching mode in the VBIOS. In that case, strict intersection of mode lists would mean the server just fails to start. So the new heuristic is: Attempt strict intersection. If doing so produces a non-empty mode list, then use it. Otherwise, revalidate the VBIOS mode list against a range-based model of the EDID properties (using the sync ranges from EDID if available, otherwise synthesizing them from an assumed minimum size of 640x480@60 and a max of whatever the EDID block reports as maximum), in the hope that _something_ will survive validation and work. This seems to be the least wrong thing to do. Nonconformant panels get a best effort, conformant panels get whatever the best intersection of BIOS and EDID modes is, and we don't go wrong trying to do something the monitor doesn't explicitly claim it's capable of doing. This does mean some setups that used to work at mode B (in the example above) now won't, but they'll still light up; in exchange, some panels that would fail to do the right thing in mode B now do _a_ right thing, even if that happens to be mode C. The vesa driver is intended to be a conservative fallback driver anyway, so the real solution to the mode B scenario is to use a native driver that doesn't use the VBIOS for output setup. -- Additional comment from wdc on 2007-08-15 13:11 EST -- Thanks very much for taking the time to provide a detailed clarification. In light of those details, I'd have to agree that the new behavior is the least wrong thing to do. -- Additional comment from ajackson on 2007-08-23 13:37 EST -- After some technical review, I've concluded that the patch in comment #57 is a bad idea. The act of initializing an int10 context on a non-primary card has the side effect of posting the card. This will blow away any state set up by the driver prior to the VBE DDC call, which will almost certainly mean bad rendering at best, and failure to launch or system hang at worst. There's a more invasive change one could do where you'd set up the shadow x86emu context _really_ early, and make sure to use the same maps for both vm86 and x86emu execution, but that seems like a ton of work for very little return. Particularly since we know newer kernels have a working vm86 syscall. Fixing the kernel definitely seems like the right thing here.
Nominating for 5.2. The short summary is the vm86 syscall seems to be stomping register state before returning to userland, which confuses (among other things) the VESA DDC code, so the vesa driver can't ask the monitor about capabilities. Relevant links from the above discussion: https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=158912 http://bugzilla.kernel.org/show_bug.cgi?id=8633 Note that current Rawhide kernels are fine here, due to some churn in the area.
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
Moved the Issue Tracker number to this bug from the original.
Re: Comment #1: Are you sure that there are Rawhide kernels that are ok? I'd be willing to test one, but I'm not convinced that the problem has actually been fixed. (Perhaps there have been conversations outside the buzilla and other trackers that I've not been privvy to.)
(In reply to comment #4) > Re: Comment #1: Are you sure that there are Rawhide kernels that are ok? > I'd be willing to test one, but I'm not convinced that the problem has actually been fixed. Apologies, I'm not actually sure. I had discussed this on IRC with another engineer, who pointed out several changes in do_sys_vm86() that looked likely related. RHEL5 had: if (unlikely(current->audit_context)) audit_syscall_exit(AUDITSC_RESULT(eax), eax); Whereas recentish rawhide has: if (unlikely(current->audit_context)) audit_syscall_exit(AUDITSC_RESULT(0), 0); So the rawhide version is almost certainly touching less register state. Between that and not seeing any reports of this failure on rawhide I probably jumped to conclusions. But yes, testing this on rawhide would be valuable.
William, Currently the patch that #if 0 the audit_ssycall_exit code will not pass internal code review. So we will need to get a valid patch for RHEL5.2. Having the results of the rawhide kernel will get us one step closer in getting a fix into the kernel in the 5.2 timeframe. Please let us know as soon as you get those results. Thanks, Jeff
Created attachment 206251 [details] Attempt to extract syscall exit fix from 2.6.20 kernel Acknowledged. I'd planned on getting back to testing sooner, but got swamped here with some RHEL 5 customer documentation tasks. My friend Chuck and I have tried some testing, and here is where we've gotten to today. One point of clarification: The patch with the #if 0 in it was explicitly not for adoption. It was illustrating the minimal scope of the broken code. A careful reading of our bug report said that we were unsure what the correct fix was, and wanted to assist with the testing of a fix. But, as you ultimately concluded testing with a new kernel was the right next step. Question about a kernel to test with: Up until now I'd let others do testing with beta components. I understand the term "rawhide" refers in a generic way to the bleeding edge beta. But that would be Fedora not RHEL 5, right? Indeed, the 2.6.23 kernel at download.fedora.redhat.com:/pub/fedora/linux/development/i386/os/Packages/ will not install under RHEL 5 without an update to mkinitrd. The newer mkinitrd requires half a dozen lower level libraries to be updated before it will install. This seems like it is mutilating too much of the RHEL 5 environment. Bottom line: Is there a "rawhide" kernel that I can just drop into RHEL 5, or did you indeed mean that I should install Fedora with a 2.6.18 kernel, re-run all my tests to confirm I've got a clean failure, and then try the Fedora 2.6.23 kernel and see if the failure goes away? ---- Trying a rawhide kernel may be moot, however because of some other things we have learned today. Back on December 7, 2006, Jeremy Fitzhardinge checked in some changes to vm86.c that look 100% relevant to our problem. Excerpt from the kernel.org ChangeLog-2.6.20: (http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.20) commit 49d26b6eaa8e970c8cf6e299e6ccba2474191bf5 Author: Jeremy Fitzhardinge <jeremy> Date: Thu Dec 7 02:14:03 2006 +0100 [PATCH] i386: Update sys_vm86 to cope with changed pt_regs and %gs usage sys_vm86 uses a struct kernel_vm86_regs, which is identical to pt_regs, but adds an extra space for all the segment registers. Previously this structure was completely independent, so changes in pt_regs had to be reflected in kernel_vm86_regs. This changes just embeds pt_regs in kernel_vm86_regs, and makes the appropriate changes to vm86.c to deal with the new naming. Also, since %gs is dealt with differently in the kernel, this change adjusts vm86.c to reflect this. While making these changes, I also cleaned up some frankly bizarre code which was added when auditing was added to sys_vm86. Signed-off-by: Jeremy Fitzhardinge <jeremy> Signed-off-by: Andi Kleen <ak> Cc: Chuck Ebbert <76306.1226> Cc: Zachary Amsden <zach> Cc: Jan Beulich <jbeulich> Cc: Andi Kleen <ak> Cc: Al Viro <viro.org.uk> Cc: Jason Baron <jbaron> Cc: Chris Wright <chrisw> Signed-off-by: Andrew Morton <akpm> ---- Chuck and I made an attempt to isolate just the relevant change and test with it. The result was that the EDID transfer always failed. So either Fitzhardinge's amendment to audit_syscall_exit was insufficient, or we didn't take enough of the patch to get correct operation. Indeed our understanding of Fitzhardinge's work is poor, and it is most likely that we incorrectly extracted his cleanup of the audit code. Attached is that trial patch. Chuck and I are evaluating what the right next step would be: To do that Fedora testing path? To re-examine Fitzhardinge's patch and more fully understand his audit fix. To ask Andi Kleen at kernel.org to re-examine our bug report in the context of the Fitzhardinge patch he signed off on in December 2006? What do you recommend as a next step? Owing to the relevance of this patch, is there value to involving Jason Baron of Red Hat who was on the reviewer list of Fitzhardinge's patch? My concern is that we fully enough understand what is going on to clearly demonstrate that the problem is fixed, not just driven underground by code having landed in a different place. Therefore, I see value in getting the minimal correct chunk of Fitzhardinge's changes onto my test bed with minimal other changes.
William, Thanks for the detaied feedback. It looks like the patch set you made reference to: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=49d26b6eaa8e970c8cf6e299e6ccba2474191bf5 Is part of the 2.6.20 baseline. So you should be able to use our RT Vanilla kernel to test with. Please note this is an _unofficial_ kernel to be use only for testing this issue. You can download the kernel from here: http://people.redhat.com/jburke/kernel-rt-vanilla-2.6.21-39.el5rt.i686.rpm The above kernel should just install ontop of RHEL5. Thanks Jeff
Thanks VERY much for that kernel. It made things much easier, and saved us a lot of work. GOOD NEWS: The kernel indeed ran without trouble on the RHEL 5 test system. GREAT NEWS: The problem with EDID transfers through vm86.c appears to be resolved. I ran the X server several times, and each time the EDID fetch was 100% complete. I strongly suspect that Fitzhardinge's patch remedied the problem. Any guess when 2.6.20 or later will make it to public RHEL 5?
William, That is good news. So we have a starting point for to backport a fix. Q. Any guess when 2.6.20 or later will make it to public RHEL 5? A. We never do major kernel upgrades in a release. So 2.6.18 is the kernel for the life span of RHEL5. But we will do backports for fixes into the 2.6.18 RHEL kernel. I think the next step is for me to isolate the portions of the patch set that actually fixes the issue. Or come up with an alternative patch. I will get back to you when I have a test kernel. If you don't hear back from me in an acceptable time frame please feel free to ping me on the issue. Thanks, Jeff
That sounds like the sensible way forward.
hi William, what version of rhel5 kernel are you running? Since we weren't able to reproduce this yet in house, and there have been a number of audit changes, perhaps you can try to reproduce this on the latest rhel5 beta kernel link below. thanks. http://people.redhat.com/dzickus/el5/51.el5/i686/kernel-2.6.18-51.el5.i686.rpm
Also, another variable here that might be interesting is wether turning off auditing works around this? you can disable audit by doing: /sbin/chkconfig auditd off and then rebooting. You can verify audit on/off by doing: /sbin/auditctl -s and verifying that enable=0. thanks.
Jason, The kernel that we have been working with up to now was 2.6.18-6. I fetched your 2.6.18-51 kernel, and the EDID fetch fails with 100% repeatability for me. (But it *IS* nice to not have to wait for the SATA timeouts. I'm really looking forward to THAT fix showing up in the production kernel asap.) Turning off the audit demon has no effect. The EDID fetch through vm86 still occurrs with 100% repeatability. Though kernel.org, we've been in touch with Jeremy Fitzhardinge in an attempt to craft the minimal effective patch. Alas our second attempt has also been unsuccessful. My current plan is to show the patch I tried to Fitzhardinge and have him tell me how I got it wrong.
Grrr! Typo. The kernel that we have been working with is 2.6.18-8.
Status update: A couple folks from kernel.org proposed a couple variations on the minimal patch to correct the audit_syscall_exit code. Sadly, none of those candidtate patches worked. I'm pursuing as a next step an integration of the whole pt_regs patch (commit 49d26b6eaa8e970c8cf6e299e6ccba2474191bf5) from kernel.org to see if it has a beneficial effect.
William, When you have time can you please test the following kernel. http://people.redhat.com/jburke/kernel-2.6.18-53.el5.bz254024.1.i686.rpm please post the test results. Thanks in advance, Jeff
I am sorry to report that, although the kernel seemed to run just fine, the EDID transfer still came back wrong.
William, I appreciate you running the test kernels. If you wouldn't mind I have one additional test for you to try when you have time. http://people.redhat.com/jburke/kernel-2.6.18-53.el5.bz254024.3.i686.rpm Again post your results when you can. Thanks in advance, Jeff
Alas, still no joy. The EDID transfer still came up all zeros. Here's an observation I tripped over in other testing. I'd updated an RHEL 4 WS system to RHEL 5 Server. (I was seeing what would happen to apps like Open Office across such an update. It wasn't pretty. I opened a trouble ticket. But I digress...) That system had two kernels installed the PAE and the vanilla 2.6.18-8 kernel. The vanilla kernel consistently had bad EDID transfers. but booting the PAE kernel on that setup had consistently good EDID transfers! I'll also note in passing that the EDID transfers are so flaky that RHEL 5 x86_64 gets sufficiently confused that it won't start X at all with the xorg.conf it creates at install time. Let me say this again: I'm booting RHEL 5 i386 and RHEL 5 i86_64 out of different partitions of the same system. I did clean installs from RHEL 5.0 media. But the i386 will actually start X, but the x86_64 wont until I hand tool the config file. This one is probably due to some Xorg app behaving differently in 64 bits. Sheesh!
William, Sorry to rehash this at this point in time but I would like to clear up some confusion. Your Comment #20 makes me think we have a disconnect. The patch you posted to the bz in Comment #7 is for i386 the bz is opened for i386. The kernel-rt-vanilla-2.6.21-39.el5rt.i686.rpm I pointed you at in Comment #8 was for i686. The git commit from Fitzhardinge's 49d26b6eaa8e970c8cf6e299e6ccba2474191bf5 was for i386. So I have been only looking at i386. Please correct any of the following statements below that are not true and add any additional data you think is relevant. 1.) 2.6.18-8.EL.i686 kernel will start X ... sometimes? 2.) 2.6.18-8.ELPAE.i686 kernel works without issue. 3.) 2.6.18-8.ELxen.i686 kernel unknown 4.) 2.6.18-8.EL.x86_64 kernel fails always. 5.) 2.6.18-8.ELxen.x86_64 kernel unknown. 6.) 2.6.18-53.EL.i686 ? 7.) 2.6.18-53.EL.x86_64 ? 8.) Regardless of kernel. the 3rd party EDID application aways works. Thanks in advance, Jeff
Don't panic. The i386 subdirectory is stuff that's built also for the i686 kernel. When I build my test kernel I build the i686 kernel, but it uses code from i386 for vm86.c. I've not run the stand alone read-edid under multiple kernels. I stopped testing with it when I satisfied myself that it aways succeeded whereas X always had a bad EDID fetch under the 2.6.18-8.EL.i686 kernel. Your case #1 above, I'd not say "will start X ... sometimes". X starts, it just has a flaky EDID read that one either can or cannot work around in xorg.conf. I suggest the most correct way to characterize the test is, "Full and correct EDID fetch of BIOS data from the video card." So I'd recast your list as follows: 1.) 2.6.18-8.EL.i686 kernel flaky EDID fetch under X. Fine under get-edid. 2.) 2.6.18-8.ELPAE.i686 kernel apparent 100% reliable EDID fetch under X. 3.) 2.6.18-8.ELxen.i686 kernel unknown 4.) 2.6.18-8.EL.x86_64 kernel flaky EDID fetch under X, but not rigorously tested. (X seems harder to configure under RHEL 5 and 4.5 on the x86_64 platform but that may be unrelated to this bug.) 5.) 2.6.18-8.ELxen.x86_64 kernel unknown. 6.) 2.6.18-53.EL.i686 kernel flaky EDID fetch under X. 7.) 2.6.18-53.EL.x86_64 unknown.
In comment #16, I said I'd pursue an attempt to integrate the whole pt_regs patch. I finally got some time and attempted that today. Unfortunately, Fitzhardinge's whole patch is based on the 2.6.20 kernel's definition of struct pt_regs, which has the element: int xgs; which is missing from the 2.6.18 kernel. Alas, I don't know enough about register saving and restoring to have what I would consider a useful suggestion going forward. Perhaps having that extra register in the pt struct is allowing space to save a register that the audit code trashed in 2.6.18? Perhaps the fix to this bug is in some seemingly unrelated region of the 2.6.20 kernel? Jeff: If there are specific things that you've been integrating into the candidate test kernels you've been sending me, I'd be happy to bench check or review them for applicability to this problem. There ARE a couple more radical positions we could take with this issue: 1. Discourage use of vm86.c for EDID transfers, and push for use of the emulator. 2. Encourage use of ATI drivers that use the native register set and abandon the VESA compatibility layer as something that is these days getting insufficient attention. It would be nice if we had a clearer sense of why 2.6.20 works whereas 2.6.18 does not. Is there a 2.6.20 withOUT Fitzhardinge's patch, built for RHE5 that I could test. At least then we'd have a single delta across which to test.
William, Ubuntu bug 89853 (https://launchpad.net/ubuntu/+source/xorg/+bug/89853) got solved by an updated vesa driver rpm. xserver-xorg-video-vesa 1.3.0-1ubuntu5 I know in this bug folks could read EDID properly and it might not be the same issue. Still it might be worth to test out new vesa driver once to see if it helps. Few things which are confusing. - RHEL 5 PAE kernel works fine. That means not necessarily vm86() implementation is bad as code base is same. - read-geteid works fine. Why there is an issue with X? I think I should print register states before vm86() call and after and see if there are any registers which have not been restored after the system call. - Jeff mentioned that he backported Jermey's upstream patch to RHEL5 and it did not help. That means issue is probably somewhere else. I am still trying to find out a machine with ATI Radeon 1300 card to see if I can reproduce the issue here.
Ubuntu runs with the 2.6.20 kernel these days. That kernel has the entirety of Fitzhardinge's patch, with the additional element in the struct pt_regs. It may be that a new VESA driver fixes the problem. If I get some time, I'll try and compare the new and old VESA driver. However, from what I currently know, I think the problem most likely went away in Ubuntu, not because of a new VESA driver, but because of the 2.6.20 kernel, and that additional element in struct pt_regs. Printing the register state before and after the vm86 call seems precisely the thing to test. Let me see if i can run that test.
(In reply to comment #1) > Nominating for 5.2. The short summary is the vm86 syscall seems to be stomping > register state before returning to userland, which confuses (among other things) > the VESA DDC code, so the vesa driver can't ask the monitor about capabilities. Adam, I am running gdb on read-edid and captured the 16bit register states for RHEL5 (-58) and upstream kernel (2.6.24-rc4). Of course read-edid is successful in both the cases. I just wanted to see if vm86() is really stomping over any of the 16 bit registers which can potentially confuse the realmode vesa code. As per the gdb output, it does not look like that vm86()is corrupting any of the 16bit registers. Only registers which seem to be being touched are _null_es (in case of rhel5) and _null_fs(in case of 2.6.24-rc4). But this should not make a difference as 16 bit code is not going to load these segment selectors. Instead it would use es, ds, fs, gs (these are stored in the end in struct vm86_regs.) And looking at register states, actual ds, fs, gs, es seem to be fine even after multiple calls to vm86(). I am going to attach two files. One for 2.6.24-rc4 and one for rhel5(-58). These files contain the output from gdb after the call to vm86(). These have been taken for read_edid utility. Do you have more info regarding what registers are stomped by vm86() at what point of time?
Created attachment 289391 [details] rhel5(.58) gdb output for read-edid
Created attachment 289411 [details] 2.6.24-rc4 gdb output for read-edid
I believe that attempting to re-produce the problem when running the read-edid utility will not be helpful. We ALWAYS get good EDID transfer with the small stand-alone read-edid utility. The failure only occurs when the big, messy X server is run. If you give me instructions how you instrumented either the kernel or the app, I'll reproduce that effort with the X server on my "always fails" test setup.
I ran into an interesting problem the moment I upgraded my xserver to 48.26.el5 release. Now if I boot my system, X initializes at 800x600 resolution. If I restart the server it initializes the display at 1024x768. In first case there is no EDID transfer as the the Xorg logs and in second case there is a valid EDID transfer. This problem does not happen always and I could see it only 4 time out of 10-12 reboots. This is not same problem as reported but this looks very similar. I have got Radeon 300 card. I am trying to find a way regarding how to run X under gdb and then debug how vm86 calls are being made. Interesting thing is that this problem happened only after I upgraded my xserver. I am attaching success and failure log.
Created attachment 289806 [details] X logs when server initializes to 800x600 and no EDID data
Created attachment 289807 [details] X server logs when server initializes with 1024x768 with EDID data
Interesting. I have a couple observations: 1. You're running the Radeon driver now, not the VESA driver. But apparently the Radeon driver now needs a successful EDID transfer to get the video modes. 2. This sort of confirms that the fixup for the failed EDID transfer would not be from an updated VESA driver, since you're not using that driver any more. 3. Sorry that it's not failing hard for you. In the early days, mine didn't get a bad EDID transfer every time either. IMPORTANT: Check the hex dump of the EDID data. You may be getting more failures than you think. Look for that hex dump to sometimes be complete, and sometimes to have zeros in it starting at a random point in the block. THATS the manifestation I see. CONGRATULATIONS! You are in fact seeing the exact failure I see. 4. It looks like the Radeon driver does a poor job of noticing a failed EDID transfer. I think it should actually flag the transfer as unsuccessful instead of magically reporting data or not. In the VESA driver, I fed a patch upstream to test the return value of the version number, and print an error rather than just continuing as if the EDID transfer was successful even if the version number came back with garbage.
William, I went little deeper into RADEON driver and found out it is reading EDID data from monitor over I2C/DDC. So RADEON is not making use of int10 and hence vm86 at all on my system. That means EDID trasfer is flaky over I2C too and not just if we are using int10 using vm86(). Well, that would be a different problem altogether. How did you switch drivers (vesa vs fglrx?). I want to forcibly use vesa here instead of readeon and see what does it do in my system.
I didn't do anything to explicitly force the VESA driver. I think that, on install, the device ID was not recognized by any other driver, so VESA was set as the default. Presumably xorg.conf gives the device driver as "radeon" or "fglrx" now, and you could change that to "vesa".
ok. Now I got vesa driver installed which uses real mode bios calls to get EDID data. I am not seeing any issues on my system. I am also able to use gdb now and print register states before after sys_vm86() calls while retrieving EDID. I can't see any corruption happening. I have collected register states for RHEL5 (.58) kernels and 2.6.24-rc4 kernels. I will be attaching the logs. William, I will prepare an Xorg rpm and send you and also tell you the procedure to how to run gdb. Run Xorg under gdb and see if you can still reproduce the issue. If yes, then it should be able to tell us where does it fail while retrieving EDID and what are the register states.
Created attachment 290180 [details] Register states before and after vm86() calls for 58.el5 kernels
Created attachment 290181 [details] Register states before and after vm86() calls for 2.6.24-rc4 kernels
William, This is what I did to compile and run my Xorg server and debug with gdb. You might want to do the same to gather some data on your machine. - Download Xorg source rpm from following link. http://people.redhat.com/vgoyal/.xserver-edid-issue/xorg-x11-server-1.1.1-48.26.el5.src.rpm - Install rpm rpm -ivh xorg-x11-server-1.1.1-48.26.el5.src.rpm - Go to /usr/src/redhat/SPECS/ dir and start building the rpm. rpmbuild -bc xorg-x11-server-1.1.1-48.26.el5.src.rpm - Once it start compiling (after applying patches, doing configure etc), you can Ctrl-C the command and move /usr/src/redhat/BUILD/xorg-server-1.1.1 dir to your working dir. - cd xorg-server-1.1.1 and edit "configure" to get rid of "-O2" flag. I want to get rid of optimization otherwise it becomes difficult to work with gdb and source code. - Run ./configure - Run make - Run make install - Now you have built your own Xorg server and installed it. - SSH to target machine from a different machine and run Xorg under gdb gdb --args /usr/bin/Xorg --dumbSched - You can put breakpoint at vbeDoEDID and follow the whole flow from there. First it does vm86 calls for checking if DDC is supported or not. After that it reads the EDID data using vm86 calls. - You can collect register states before and after vm86 calls. vm86() calls exits frequently for various reasons like signals, interrupts etc and it is restarted by 32bit userspace. I suspect in your env it might be exiting for some reason and returning with bad EDID. You shall have to track the control flow in case of success and failure.
It's been a VERY long time since I ran the X server under gdb. I needed some clues, and although your procedure did not work for me, I was able to come up with one that would. Rather than building the X server the way you told me, I used the X server I'd built, and had been testing with. I was able to run gdb on that X server by just doing rpm -ivh xorg-x11-server-debuginfo-1.1.1-48.13.0.2.i386.rpm which was the debuginfo necessary for the server I was already testing with. When I tried to run xorg having started gdb as you instructed, I got the error: Fatal server error: Unrecognized option: --dumbSched Program exited with code 01. but restarting gdb without that argument resulted in an X server starting up and allowing itself to be tested. I put the breakpoint in vbeDoEDID doing "p/x *ptr" didn't work for me, but "info registers" did. I hit "n" to get up to the call to xf86ExecX86int10(pVbe->pInt10); I did "info registers" before and after the call. I have attached the output of this effort for the 2.6.18 kernel that breaks EDID, and the 2.6.21 kernel for which EDID is successful. In both cases, the contents of the registers seems totally fine. But in thinking this through, that should be no surprise. The problem is NOT that the registers get corrupted inside the X server. The problem is that the registers get corrupted INSIDE THE KERNEL when making the call to the audit subsystem inside the implementation of the int10 code. When you say that, "it is reading EDID data from monitor over I2C/DDC" is it possible that that stuff eventually does an int10 call in the kernel to fetch the data? I call your attention to the simple patch to the 2.6.18 kernel that disables the call to audit from within the int10 code in the kernel (which is in the original bug from which this bug is cloned): https://bugzilla.redhat.com/attachment.cgi?id=158912 Although we cannot ship a kernel with that change, when you make that change, EDID transfers magically become reliable on my system. Bottom line: We need to see what the register corruption is INSIDE THE KERNEL IMPLEMENTATION of the vm86 call. Testing the registers in X will not help. Any quess how we might instrument things inside the kernel without harming what we're already doing? Perhaps this is a job for Xen? Can one single step a kernel from within an emulator? Then again, this all might simply be an MMU race condition which manifested in 2.6.18, and is back underground in 2.6.20 and later.
Created attachment 290211 [details] register dump before and after vm86 call in X server. 2.6.18 kernel
Created attachment 290212 [details] register dump before and after vm86 call in X server. 2.6.21 kernel
(In reply to comment #41) > > When you say that, "it is reading EDID data from monitor over I2C/DDC" is it possible that that stuff > eventually does an int10 call in the kernel to fetch the data? I don't think so. There is I2C protocol which reads the data. AFAIK, real mode bios services are not involved. (int10) > > Any quess how we might instrument things inside the kernel without harming what we're already > doing? Perhaps this is a job for Xen? Can one single step a kernel from within an emulator? > > Then again, this all might simply be an MMU race condition which manifested in 2.6.18, and is > back underground in 2.6.20 and later. Few things. - There are two kind of register states. One is 32bit register state and other is 16bit register state. 32bit register state belongs to the 32bit user space (actual Xorg server code running) and 16bit register state belongs to the real mode BIOS code executing. - 32bit user space, sets the 16bit registers (pInt->cpuRegs) and calls vm86(). This system call sets up various things and then starts executing real mode code. (Remember before doing that it creates the env using 16bit registers). If there is any interrupt, signal etc, real mode code is interrupted and control goes back to 32bit user space. Before returning to user space, kernel also passes the new state of real mode registers back to user space. - 32bit user space makes the vm86() call again with new 16bit register state and it goes on till the real mode code has completed. - If you do just "info registers", then it will give you 32bit register state. Primarily we are not interested in that as if something was wrong with it, user space would have most probably died. - We are interested in 16bit registers. A pointer to 16bit registers is stored in pInt->cpuRegs. - There are different ways of printing this register state depending on which function you are in and where you have put your break point. - You can also put two more break points. One on xf86ExecX86int10() and other on do_vm86(). - You can print real mode registers by using following. (If you are not in vm86_rep()) p/x *(struct vm86_struct*)pInt->cpuRegs - If you have stepped into the function vm86_rep() then you can use p/x *ptr - I think inspecting real mode registers in user space is easier. Even if kernel has corrupted it, we should see it in user space after the call has returned. Remember, vm86() is called thousands of time to finish one EDID retrieval. So you can sample registers in first few invocations and then after the last invocation. If vm86() finished earlier without completing EDID call successfully, we should notice the difference in register states. - I know that by getting rid of audit code, you don't see the problem. I browsed through the audit code quickly and can't find anything which plays around with 16bit register state. Secondly, same audit code is executed in RHEL5 PAE kernel which does not see the problem. So at this point of time it is difficult to conclude anything. Your observations of X using gdb should help though.
Hmmmm... Putting a breakpoint in xf86ExecX86int10 tells me that said routine is called RATHER a lot. You said put another on "do_vm86()" but gdb could not find that routine. So what I did instead (and I hope this is close enough to what we need to do) was to repeat the prodedure above (break at vbeDoEDID, and "n" up to the call to xf86ExecX86int10(pVbe->pInt10); Then I asked to print: *(struct vm86_struct*)pVbe->pInt10->cpuRegs before and after the call as follows: (gdb) n 196 xf86ExecX86int10(pVbe->pInt10); (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs $1 = {regs = {ebx = 0, ecx = 486, edx = 0, esi = 0, edi = 8192, ebp = 0, eax = 79, __null_ds = 0, __null_es = -1069481984, __null_fs = 0, __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}} (gdb) n 198 if ((pVbe->pInt10->ax & 0xff) != 0x4f) { (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs $2 = {regs = {ebx = 258, ecx = 0, edx = 0, esi = 0, edi = 0, ebp = 0, eax = 79, __null_ds = 0, __null_es = -1069481984, __null_fs = 0, __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}} (gdb) I confirmed that the EDID transfer was bad, and then repeated under the 2.6.21 kernel as follows: (gdb) n 196 xf86ExecX86int10(pVbe->pInt10); (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs $1 = {regs = {ebx = 0, ecx = 486, edx = 0, esi = 0, edi = 8192, ebp = 0, eax = 79, __null_ds = 0, __null_es = 0, __null_fs = -1069481984, __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}} (gdb) n 198 if ((pVbe->pInt10->ax & 0xff) != 0x4f) { (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs $2 = {regs = {ebx = 258, ecx = 0, edx = 0, esi = 0, edi = 0, ebp = 0, eax = 79, __null_ds = 0, __null_es = 0, __null_fs = -1069481984, __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}} (gdb) QUESTION: What do you make of these registers.
(In reply to comment #45) > Hmmmm... Putting a breakpoint in xf86ExecX86int10 tells me that said routine is called RATHER a lot. > You said put another on "do_vm86()" but gdb could not find that routine. You are using a version of Xorg compiled with -O2. I think compiler might have optimized and inlined this function that's why you don't find it. > > So what I did instead (and I hope this is close enough to what we need to do) was to repeat the > prodedure above (break at vbeDoEDID, and "n" up to the call to xf86ExecX86int10(pVbe->pInt10); > > Then I asked to print: *(struct vm86_struct*)pVbe->pInt10->cpuRegs > before and after the call as follows: This is close enough, at least the register state after the call has compledte. > > (gdb) n > 196 xf86ExecX86int10(pVbe->pInt10); > (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs > $1 = {regs = {ebx = 0, ecx = 486, edx = 0, esi = 0, edi = 8192, ebp = 0, > eax = 79, __null_ds = 0, __null_es = -1069481984, __null_fs = 0, > __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, > eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, > ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, > screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, > 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, > 4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, > 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}} > (gdb) n > 198 if ((pVbe->pInt10->ax & 0xff) != 0x4f) { > (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs > $2 = {regs = {ebx = 258, ecx = 0, edx = 0, esi = 0, edi = 0, ebp = 0, > eax = 79, __null_ds = 0, __null_es = -1069481984, __null_fs = 0, > __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, > eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, > ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, > screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, > 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, > 4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, > 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}} > (gdb) > > I confirmed that the EDID transfer was bad, and then repeated under the 2.6.21 kernel as follows: > > (gdb) n > 196 xf86ExecX86int10(pVbe->pInt10); > (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs > $1 = {regs = {ebx = 0, ecx = 486, edx = 0, esi = 0, edi = 8192, ebp = 0, > eax = 79, __null_ds = 0, __null_es = 0, __null_fs = -1069481984, > __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, > eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, > ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, > screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, > 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, > 4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, > 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}} > (gdb) n > 198 if ((pVbe->pInt10->ax & 0xff) != 0x4f) { > (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs > $2 = {regs = {ebx = 258, ecx = 0, edx = 0, esi = 0, edi = 0, ebp = 0, > eax = 79, __null_ds = 0, __null_es = 0, __null_fs = -1069481984, > __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, > eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, > ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, > screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, > 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, > 4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, > 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}} > (gdb) > > QUESTION: What do you make of these registers. As per your data, 16bit register states after the last call to vm86() are same both in case of failure (58.el5) and success. With this I would think that vm86() is not doing any register state corruption in el5 kernels. There is no error message also from real mode code. Otherwise some basic checks on eax registers would have failed. Following is the code doing checks. if ((pVbe->pInt10->ax & 0xff) != 0x4f) { xf86DrvMsgVerb(screen,X_INFO,3,"VESA VBE DDC invalid\n"); goto error; } switch (pVbe->pInt10->ax & 0xff00) { case 0x0: xf86DrvMsgVerb(screen,X_INFO,3,"VESA VBE DDC read successfully\n"); tmp = (unsigned char *)xnfalloc(128); memcpy(tmp,page,128); break; case 0x100: xf86DrvMsgVerb(screen,X_INFO,3,"VESA VBE DDC read failed\n"); break; default: xf86DrvMsgVerb(screen,X_INFO,3,"VESA VBE DDC unkown failure %i\n", pVbe->pInt10->ax & 0xff00); break; }
William, I have backported the jermey's patch for RHEL5 (58.el5). Can you please apply this patch, rebuild and see if it works fine. If it is too much of trouble, I will build a binary package and send you. Please let me know. I have gone through jermey's patch and can't notice anything which can help. The only other change which I have done in this patch is to set fs and gs to zero before "jmp resume_userspace". I am just trying to cover up the possibility of audit code playing with fs and gs and that creating some problem.
Created attachment 290461 [details] Jermey's vm86 patch backport for RHEL5-58
Your patch succesfully built for me. Alas, the EDID transfer was bad with the resulting kernel. I think this may indeed mean that the fault lies not directly in the vm86 code, but in an un-related area of the kernel that changed at around 2.6.18-20 -- some sort of race condition when filling buffers with data fetched from real mode.
(In reply to comment #49) > Your patch succesfully built for me. > Alas, the EDID transfer was bad with the resulting kernel. > > I think this may indeed mean that the fault lies not directly in the vm86 code, but in an un-related > area of the kernel that changed at around 2.6.18-20 -- some sort of race condition when filling buffers > with data fetched from real mode. > Hmm.., Looking at so many zeros in EDID buffer, one of the possibilities is that real mode gave up too early for some reason. I am looking at Xorg code and various exit/error paths but no clue yet. At many exit paths they dump the registers but I think you don't see any additional errors in your Xorg.0.log file. One interesting observation on my sytstem. I was comparing the final exit of read edid and Xorg vm86 call. - get-edid, exits the vm86() loop when control returns userspace because of in interrupt of vector 255. - Xorg exits the vm86() loop upon encountering a "hlt" instruction in real mode code. That's kind of confusing to me. I thought that both the software will exit at the same point after successful completion. But they seem to be taking differnt paths. ...
The error checking through all that code is not very good. When I first began working this bug, the X server code ostensibly said that the EDID transfer was successful. Code that could have done basic checking of the contents of the buffer did nothing, and blithely continued to operate on garbage. I submitted an upstream patch to at least error out if the version number of the packet was bad. (That would detect the case of a buffer filled with all zeros.) There may be a checksum embedded in that buffer that could/should be checked, but alas, I was not well enough schooled in the EDID block. Bottom line: The anomaly you find in the two returns may indeed be significant. Perhaps further bench checking will identify where additional error checking code might have detected an abnormal return from real mode.
William, Can you please apply the attached patch to xserver, recompile, install and then capture the Xorg logs both in failure and success case. I think this might give us some idea how the control is flowing in terms of how it returns to user space. I have generated the patches for xor-server-1.1.1-48.26.el5 Thanks Vivek
Created attachment 290688 [details] xorg debug patch1
Created attachment 291734 [details] Log output with the additional debugging output patch.
Vivek, Sorry it took so long for me to comply with your request. Finally I have built an X server with your patch. (Note that I am running an ever so slightly older version: xorg-x11-server 1.1.1-48.13 Your patch went into that version just fine. I believe that there are no differences substantial enough between my version 48.13 and your 48.26 for me to blow away my existing comfy build setup. Attached is the Xorg.0.log output from running that server. What do you make of it?
Created attachment 292069 [details] Debugging output with later kernel and successful EDID transfer
OOPS! I see I didn't read your request carefully enough. Here is Xorg.0.log output from the debugging X server run under the later kernel that gives the successful EDID transfer. I used emacs ediff to make a quick scan for differences. Alas, I can't really detect any, but perhaps your more informed eye will pick something up.
Could we do a check in on this bug. I believe the summary of the situation is: With the Radeon X1300 card under the VESA X driver, the EDID fetch of video modes from the BIOS is flaky. Originally we thought the flakiness was due to incorrect stack discipline when calling auditd from inside the vbe calls to vm86. We confirmed that the problem is not present in kernel rev 2.6.20 and later, and attempted back- porting of likely code from 2.6.20 into 2.6.18. Unfortunately the delta from 2.6.20 directly related to use of syscall_audit from the vm86 calls inside vbe.c did not remedy the problem. So the current situation is that we know there's a problem where the data that comes back from the EDID read gets corrupted -- some or all of the buffer contains 0s. But we do not see what to back port from 2.6.20 to end this corruption. The current recommended work-around is to modify /etc/X11/xorg.conf: In the"ServerLayout" section add a line: Option "Int10Backend" "x86emu" Perhaps all we can do now is to close this bug with WONTFIX because the corruption is too subtle to identify and back-port. Can someone re-confirm that this is our shared understanding of the situation, and either: suggest something else for me to test to see if we've got a new candidate to back port into the 2.6.18 kernel, or affirm that we should live with the situation as-is and close this bug out.
William, Sorry for not being able to get back to you for so long. Got stuck in other issues. I agree with your viewpoint. Initially it was thought that sys_vm86() system call implementation is wrong in RHEL5 and that is causing this. But at this point it does not look like the case. I have carefully reviewed the code, we have tried back-porting jermey's patch and we have also tried comparing the register states before and after the sys_vm86() call. At this point of time, we don't know what is causing corruption in certain instances. I would think of deferring this issue to 5.3 and till then live with x86emu option.
Status update: Neither I nor anyone I've talked to about this bug has an idea where to look for the flaky EDID transfer root cause in the 2.6.18 kernel. It's not present in 2.6.20, but none of the obvious back ports remedied the problem. In the meantime, the radeon driver has made advances, handles the X1300 hardware and many other more recent ATI chips. I have tested this driver under RHEL 5.3 beta and found it to do a good job. It services the hardware and so we don't need to use the VESA driver or the deprecated BIOS interface to the kernel that is responsible for this flaky EDID transfer. We have a serviceable work-around, to use x86-emu, an emerging upstream driver that makes this code path irrelevant, and no insight into how to further debug this problem. I recommend that you CLOSE this bug with status WONTFIX.
After speaking with one of the original reporters we have decided to close this BZ out as "WONTFIX". The problem is very elusive and a fix does not appear to be obvious. With the release of RHEL5.3 there will be a suitable and easy work-around for this issue (either via x86-emu and/or the included radeon driver).