254024 – sys_vm86 syscall in RHEL5 not reliable

Bug 254024 - sys_vm86 syscall in RHEL5 not reliable

Summary: sys_vm86 syscall in RHEL5 not reliable

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	i386
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Don Howard
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	296411 372911 391501 420521 422431 422441 422491 KernelPrio5.3
TreeView+	depends on / blocked

Reported:	2007-08-23 17:39 UTC by Adam Jackson
Modified:	2018-10-20 00:40 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-15 16:44:53 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Attempt to extract syscall exit fix from 2.6.20 kernel (981 bytes, patch) 2007-09-25 23:20 UTC, wdc	no flags	Details \| Diff
rhel5(.58) gdb output for read-edid (3.91 KB, text/plain) 2007-12-14 20:27 UTC, Vivek Goyal	no flags	Details
2.6.24-rc4 gdb output for read-edid (4.20 KB, text/plain) 2007-12-14 20:28 UTC, Vivek Goyal	no flags	Details
X logs when server initializes to 800x600 and no EDID data (53.40 KB, text/plain) 2007-12-17 20:16 UTC, Vivek Goyal	no flags	Details
X server logs when server initializes with 1024x768 with EDID data (59.03 KB, text/plain) 2007-12-17 20:17 UTC, Vivek Goyal	no flags	Details
Register states before and after vm86() calls for 58.el5 kernels (4.80 KB, text/plain) 2007-12-20 19:18 UTC, Vivek Goyal	no flags	Details
Register states before and after vm86() calls for 2.6.24-rc4 kernels (4.71 KB, text/plain) 2007-12-20 19:19 UTC, Vivek Goyal	no flags	Details
register dump before and after vm86 call in X server. 2.6.18 kernel (6.28 KB, application/octet-stream) 2007-12-21 03:22 UTC, wdc	no flags	Details
register dump before and after vm86 call in X server. 2.6.21 kernel (5.20 KB, application/octet-stream) 2007-12-21 03:23 UTC, wdc	no flags	Details
Jermey's vm86 patch backport for RHEL5-58 (10.37 KB, patch) 2007-12-27 22:45 UTC, Vivek Goyal	no flags	Details \| Diff
xorg debug patch1 (1.80 KB, patch) 2008-01-03 00:08 UTC, Vivek Goyal	no flags	Details \| Diff
Log output with the additional debugging output patch. (1.54 MB, application/octet-stream) 2008-01-15 19:57 UTC, wdc	no flags	Details
Debugging output with later kernel and successful EDID transfer (1.36 MB, application/octet-stream) 2008-01-17 20:43 UTC, wdc	no flags	Details
View All

Description Adam Jackson 2007-08-23 17:39:15 UTC

+++ This bug was initially created as a clone of Bug #236416 +++

Description of problem:

I just installed RHEL 5 client, and noticed that sometimes the X resolution is
properly set, as I specified, 
to 1200x1024, but often, upon restart of the X server, it dumbs down the
resolution to 800x600.

I will attach two Xorg.0.log outputs showing how the VESA BBE DDE read is said
to be successful,
but in the dumb-down case no actual data comes in that enables proper
configuration of the monitor.

This problem DOES NOT occur under RHEL 4.5 beta, nor does it occur using a third
party fglrx driver.

Version-Release number of selected component (if applicable):

X Window System Version 7.1.1

How reproducible:

Often.  Note that I've got a VGA LCD attached via an adapter cable to the ATI
Radeon X1300 pro card.
At first, I thought it might be because the adapter or the display or the card
was flaky in reporting the 
VESA data.  But when I NEVER got failures under RHEL 4.5, I began to suspect
something amiss in the
VESA DDI read.

Steps to Reproduce:
1. Start the X server

Actual results:

800x600 resolution, and no actual data from the VESA BBE DDC read showing up in
Xorg.0.log

Expected results:

Proper detection of the monitor through proper data being returned by the VESA
VBE DDC read showing
monitor Manufacture string, and all the other relevant data, and an ultimate
result in proper 
configuration at 1200x1024

Additional info:

-- Additional comment from wdc on 2007-04-13 14:41 EST --
Created an attachment (id=152574)
Log output showing successful VBE DDC read

-- Additional comment from wdc on 2007-04-13 14:46 EST --
Created an attachment (id=152575)
Log output showing VBE DDC read with no data

Note that in this log file, the VESA VBE DDC read is declared successful, but
the Manufacture string, and all the other relevant data needed to configure has
NOT been obtained.

This log file was obtained on the exact same hardware as the Xorg.0.log.good
file.  My process was:

Start RHEL 5.
Notice X was configured correctly.
Log in.
Save the X.org.0 log file as X.org.0.log.good
Log out.  (And the X server restarted as per apparent config defaults.)
Notice that X was dumbed down to 800x600.
Log in.
Save the X.org.0 log file as X.org.0.log.bad

-- Additional comment from jbaron on 2007-05-01 14:24 EST --
hmmm, is it possible that System->Preferences->Screen resolution menu is set to
800x600 thereby over-riding the system configuration on a per-user basis? 

-- Additional comment from wdc on 2007-05-01 15:51 EST --
Well, the screen resolution problem occurs even when logged in as root.

The resolution menu offered by the System Preferences, after the X server has
decided to dumb itself 
down to 800x600 offers no higher resolution than 800x600.

When I first installed RHEL 5, I got 800x600, but I ran some tool and specified
1280x1024, and that's 
when I got to this state of affairs where it sometimes does and sometimes does
not work.  I regret that I 
did not take careful note of which tool I ran. It was probably
"system-config-display".

I am using a Dell Dell E196FP display on the Optiplex 745.
I have now used system-config-display to set that explicitly as the monitor.

Here's the odd thing:

When the VESA data transfer is successful, the monitor clearly reports that its
optimal resolution setting 
is 1280x1024x60, and the setup is correct.

When the VESA data transfer is unsuccessful, the X.org.0.log file reports that the 
1280x1024x75 resolution is being tried multiple times, but that the 1280x1024x60
resolution
is **NEVER** tried.  I wonder why this is so. I also wonder why no 1024x768
resolutions are being tried.

Even when the ACTUAL monitor I'm using is specified explicitly, no higher
resolution than 800x600 is 
offered when the VESA DDB data transfer fails.

I see two questions to answer here:

1. Why doed the VESA DDB transfer sometimes report success when no data is
transferred?

2. Why does the X server never try 1024x768 resolutions, nor 1280x1024x60?  It
tries a WHOLE LOT of 
them, as can be shown in the Xorg.0.log file.

----

Should I also ask you why you are asking me about user-level configuration
settings, when the Xorg.
0.log file already shows that a whole bunch of resolutions, never offered in
those user-level 
configuration commands are being tried, and abandoned for reasons that have
nothing to do with the 
user-level configuration settings, and everything to do with the perceived
capabilities of the monitor?

Or am I completely misreading the Xorg.0.log file here?

-- Additional comment from wdc on 2007-05-10 17:59 EST --
I am disappointed that 10 days have gone by and nobody has followed up.
I guess nobody cares that the latest update to RHEL BROKE X server configuration.

I REALLY would like some help with this.

I've just taken RHEL 4.5, and either I've found a way to more consistently
specify a broken configuration, 
or whatever you broke in RHEL 5 you've BACK PORTED to 4.5, because the RHEL 4.5
beta worked great, 
but the RHEL 4.5 that was released is ALSO BROKEN.

Let's get hopping on understanding this problem and fixing it QUICKLY!

-- Additional comment from wdc on 2007-05-10 18:14 EST --
I've tested with another monitor, the Dell 2007WFP LCD.  Via the VGA connector.
In this case, the VESA 
data seems to be correctly fetched by both RHEL 5 and RHEL 4.5, but the monitor
VERY CAREFULLY 
configures itself to CHOP OFF the topmost 30 or so pixels.  Dell has no vertical
size control so I get my 
choice of having the tool bar or the panel chopped away.  

This is unacceptable, and extremely frustrating.  How can I help MIT customers
adopt RHEL 4.5 and RHEL 
5 when basic X display configuration has been so badly and obviously broken.

Ok, you folks don't see the test case, let's get someone back to me QUICKLY so
both MIT and Red Hat see 
the same symptoms, and pool our collective understanding.

-- Additional comment from wdc on 2007-05-11 19:10 EST --
Created an attachment (id=154575)
Sysreport of target system running RHEL 5

-- Additional comment from wdc on 2007-05-11 19:28 EST --
In the interests of being helpful I have attached sysreport output of the
relevant system.
Probably our next step is to decide if we have one bug or two here. The overall
symptom is that
X is not properly configured.  But that could be due to two separate issues:

1. Failure to get consistently good data from the VESA DDB transfer.
2. X chops off the topmost 50 pixels when exact correct display is specified in
the System-
>Administration->Display tool.

-- Additional comment from wdc on 2007-05-11 19:51 EST --
There's something else interesting going on.  Yesterday the monitor would
configure and chop off the top.
Today I can't seem to establish an xorg.conf that will drive the monitor at that
size any more.  I either get 
800x600, or I get a complaint that I'm driving the monitor too hard.  

I *THINK* it's because the xorg.conf I'm now playing with does not contain
explicit resolution settings, and 
so it's tryign to get them from the failed VESA DDB transfer.

-- Additional comment from alanm on 2007-05-16 15:40 EST --
RHEL problems will get attention if they are filed via your TAM. Since this
works under RHEL4.5 I'll mark this as a regression.

-- Additional comment from pm-rhel on 2007-05-16 15:46 EST --
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

-- Additional comment from tao on 2007-05-16 16:02 EST --
heya Alan,

 Update : RHEL4.5 works with third party driver, and is flaky with our
driver . Hence, regression is not in question .

Additionally, I have posted a query with Dell Partner contact. 

Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by rkhadgar 
 issue 121369

-- Additional comment from tao on 2007-05-16 16:13 EST --
i have confirmed with customer , rhel 4.5 does not work correctly with vesa
driver. flgrx was used .

Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by rkhadgar 
 issue 121369

-- Additional comment from wdc on 2007-05-16 16:34 EST --
MIT does not have a TAM.  It has something to do with the business model of,
Since 1860, compaines 
have paid for the privelege of collaborating with MIT; Why does Red Hat insist
on charging a premium 
price for the privilege of getting bugs taken seriously by the very community
that helped create Linux in 
the first place... But I digress.

Inasmuch as this is a basic problem that will affect MANY users of RHEL 5, it
seems in Red Hat's best 
interest to resolve it quickly.  The position expressed by "pm-rhel" seems quite
wise.

-- Additional comment from tao on 2007-05-16 16:58 EST --
I wasn't aware the MIT doesn't have a TAM. RHEL get bugs get addressed a
lot faster if they are submitted via issue tracker. There are times when
I've seen bugs submitted by BZ on RHEL problems that wind up hanging in
limbo because there isn't an issue tracker ticket associated with it.
Having two tools can be a problem because support uses IT,  engineering
uses BZ and product management makes their decisions using both.

This event sent from IssueTracker by alanm 
 issue 121369

-- Additional comment from tao on 2007-05-17 12:53 EST --
**** Problem Description		
Source :		Service Request 1468792
Created by :		WDC-RHN
Created on :		15-May-2007 20:58:38

<snip>

I just discovered one source of why RHEL 4.5 beta worked so well. I'd
installed the ATI proprietary driver, and FORGOTTEN.  Alas, when I went to
restore the ghost image snapshot, I discovered that ghost does NOT restore
RHEL 4 images.  They won't boot.  Tomorrow I'll re-install RHEL 4.5
beta.

What I *DO* know is that RHEL 4.5 will create a good xorg.conf file when
the Dell2007WFP is connected via DVI.  Alas that xorg.conf file does not
seem to work properly when the monitor is plugged in via VESA.  That
xorg.conf file does not work AT ALL under RHEL 5. 

</snip>

This event sent from IssueTracker by rkhadgar 
 issue 121369

-- Additional comment from rkhadgar on 2007-05-19 07:10 EST --
Updated info from customer, Summarised, Cropped, Chipped and Pasted
-------------------------------------------------------------------------------

 RHEL 4.5 correctly configures and works when connected to DVI. 

 RHEL 4.5 connected to VGA connector chops off the top 50 pixels. 

 The "Perfect" behavior under RHEL 4.5 beta that I originally reported early in
this case was due to the use of the ATI proprietary driver. 

I am relieved to report that this means that I can take RHEL 4.5 roll-out OFF
hold here at MIT because I now see that there is no show stopper, merely a
performance issue. 

-------------------------------------------------------------------------------

RHEL 5, however clearly has breakage in the dynamic X configuration, such that
hardware configuration that works just fine, under RHEL 4.5 does not work at all
under RHEL 5. 

 RHEL 5 dynamic xorg.conf configuration does not work for the ATI Radeon 1300
card.  The VESA DDC transfer fails nearly all the time.

 RHEL 5 will not start X at all when connected to DVI. 

 RHEL 5 configures and runs X at 800x600 when connected via VGA connector. 

 RHEL 5 will run X via VGA connector at at least as high as 1280x1024 with
explicit Modelines. 

-- Additional comment from tao on 2007-05-19 07:41 EST --
Attachin to IT xorg.conf which works fine with VGA connector. 

The same config fails with DVI connector on the same monitor - Dell 200FP.
Xorg log attached to IT for the same.

This event sent from IssueTracker by rkhadgar 
 issue 121369
it_file 91403

-- Additional comment from wdc on 2007-05-21 17:17 EST --
Since last I posted to this bug on 16 May, I've done some more careful testing
and I understand a LOT 
more about this situation.

Bottom Line summary:  The RHEL 4.5 X server is performing acceptably.  The RHEL
5 X server suffers 
from a problem with DDC fetch that ALSO affects Ubuntu 7.04, and perhaps SuSE
SLED 10.1. 
I've searched the x.org bug tree and found two relevant bugs: 
https://bugs.freedesktop.org/show_bug.cgi?id=6886
https://bugs.freedesktop.org/show_bug.cgi?id=10238
I've subscribed to the later one, and we'll see if the X.org folks respond.

Detail:

I needed to be told how to create a baseline xorg.conf file. Once I did that, I
was able to carefully test 
RHEL 4.5 and RHEL 5.  Along the way, I discovered that some of the extremely
good performance I was 
getting under RHEL 4.5 beta was because I'd installed the ATI proprietary driver
but FORGOT.  (Oops.)

The detailed behavior I got while testing RHEL 4.5 is:

On the Optiplex 745 with the ATI Radeon X1300 Pro:
    up to 1280x1024 works via VESA.
    up to 1400x1050 works via DVI
    If your xorg.conf specifies 1400x1050 the VESA display will be too big for
the screen.
    If your xorg.conf specifies 1600x1200 the VESAS display will draw a blank,
but the DVI display will 
know to not use that setting.

This seems reasonable, albeit non-ideal behavior to me.

Creating a baseline xorg.conf file under RHEL 5, I re-ran tests and determined:
    The X server will not run AT ALL when connected via DVI.
    When connected via VESA, the DDC transfer fails, forcing the X server to
dumb down to 800x600.
    If one explicitly provides Modeline directives in the xorg.conf file, the X
server can be driven at up to 
1280x1024 when connected via the VESA port. Perhaps higher resolutions are
possible, but so far
I don't have a Modeline for better than that.
    When connected via DVI, the X server WILL NOT START AT ALL.  The monitor
complains of being 
over-driven.

DDC transfers under RHEL 5, with the X server version 7.1.1 always fail, both on
the VESA port and on 
the DVI port.

Ubuntu  7.10 seems to suffer the same fate.  There is a long winded bug report
about this at:
https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/89853

It is still unclear to me whether Red Hat, the Ubuntu community or X.org do or
do not understand the 
root cause of this problem.  Perhaps between the four of us we can converge on a
useful fix.

-- Additional comment from tao on 2007-05-22 03:19 EST --
150 systems are held back from RHEL5 deployment because of this issue.

**** Problem Description		
Source :		Service Request 1468792
Created by :		WDC-RHN
Created on :		21-May-2007 14:35:26
<snip/>

Q: Would you provide the number of system being held back because of this
issue, as this will help me push up the priority . 
A: This is a difficult number to compute, because we don't know if this
problem affects all systems, or merely the ones that use the Dell 745
hardware.  Let's say that the scope is limited to desktops.  In that case
our Satellite server says 571 systems are registered for RHEL 4 WS.  If
it's just the systems we'd replace with Dell Optiplex 745's that
generally runs to about 150 systems.  As far as I know, the Dell Optiplex
is the SINGLE MOST POPULAR Enterprise desktop system in the world, so
it's probably a good idea to get this working.  Does this help? 

Priority set to: 2

This event sent from IssueTracker by rkhadgar 
 issue 121369

-- Additional comment from tao on 2007-05-22 03:30 EST --
From tech-list
http://post-office.corp.redhat.com/archives/tech-list/2007-May/msg00563.html

From: 	Adam Jackson <ajackson>
Reply-To: 	tech-list
To: 	tech-list
Subject: 	Re: ATI Radeon X1300 Pro card
Date: 	Mon, 21 May 2007 11:25:33 -0400 (20:55 IST)
Mailer: 	Evolution 2.10.0 (2.10.0-2.fc7) 	

On Sat, 2007-05-19 at 15:53 +0530, Ritesh Khadgaray wrote:
> Heya,
> 
>   Is anyone using a ATI Radeon X1300 Pro card with pci-id listing
> "1002:7183" ?
> 
>   I have a customer who has issue using the stated card on RHEL5 with
> DVI connector. With VGA connector, using explicit ModeLine option works
> fine .
> 
>  This card works fine on RHEL4.5 with DVI connector, and with top
> 50pixels chopped off with VGA connector .

Those are R500 cards.  You're at the mercy of whatever the VESA BIOS
implements.  Thanks ATI.

- ajax

This event sent from IssueTracker by rkhadgar 
 issue 121369

-- Additional comment from cra on 2007-05-28 14:25 EST --
wdc and I dug into the X server sources, and produced the attached patch, with
interesting results.

Issues:

1.  The initialization of the EDID buffer carefully memset's 4 bytes to zero
because it uses the size of the pointer to the structure instead of the size of
the structure itself.  However, in our patch we use the constant 128, because
that is the size of an EDID buffer (as described in the EDID documentation we
found on the Web.)

2. When the EDID transfer fails and gives us an EDID buffer full of zeros,
xf86InterpretEDID in interpret_edid.c silently fails and returns NULL.  We
changed the code to report this error condition.

3. The EDID fetch from the BIOS is DEFINITELY flaky in a time-dependent way.
We inserted a sleep(2) into vbeReadEDID in vbe.c which seems to improve things
somewhat, but running Xorg multiple times results in EDID fetches in various
states of completion, with the buffer only being filled up to a certain point,
followed by zeros.  We copied the hex dump code from print_edid.c into vbe.c so
that the EDID buffer could be viewed immediately after the BIOS fetch.

Attached is the patch against xorg-x11-server-1.1.1-48.13.0.1 along with
Xorg.0.log files from successive runs with this patch showing the EDID buffer in
various states of fill.

-- Additional comment from cra on 2007-05-28 14:30 EST --
Created an attachment (id=155549)
Patch to debug EDID BIOS fetch

-- Additional comment from cra on 2007-05-28 14:31 EST --
Created an attachment (id=155550)
Xorg.0.log run 1 showing full EDID read

-- Additional comment from cra on 2007-05-28 14:32 EST --
Created an attachment (id=155551)
Xorg.0.log run 2 showing full EDID read

-- Additional comment from cra on 2007-05-28 14:32 EST --
Created an attachment (id=155552)
Xorg.0.log run 3 showing partial EDID read

-- Additional comment from cra on 2007-05-28 14:33 EST --
Created an attachment (id=155553)
Diff between Xorg.0.log run 2 and run 3

-- Additional comment from cra on 2007-05-28 14:38 EST --
If you remove our "sleep(2);" from vbe.c the hex dump output from the EDID fetch
from the BIOS  pretty much always comes up all zeros.

-- Additional comment from wdc on 2007-05-29 18:04 EST --
Created an attachment (id=155646)
Log of successful DDC read, RHEL 4.5 with debug patch applied.

Today I built the X server under RHEL 4.5, applying the relevant portion of the
debug patch that performs the hex dump of the EDID fetch.  I ran Xorg several
times.	Always the result is the same: PERFECTLY RELIABLE fetch of the EDID
data!

I also looked at the differences in the int10 logic that seems to be doing the
nuts and bolts of the EDID fetch.  Although I might have missed something, I
think they are substantially the same.	This causes me to conclude that what we
have is a KERNEL bug, not an X server bug.  Perhaps something is playing fast
and loose with the real mode emulation that serves the VBE?

Since this problem seems also to affect Ubuntu 7.04 (although I can't get it to
consistently fail), we're probably talking about a kernel bug introduced
between 2.6.9 and 2.6.18.  (The Ubuntu 7.04 Desktop install CD which HAS the
problem uses 2.6.20-15.)

QUESTION: What further steps should I take to clarify that the fault lies in
the kernel and not in X?

-- Additional comment from wdc on 2007-05-30 17:25 EST --
Today I did two things:

1. I experimented under Ubuntu 7.04 to try and learn more -- I got partial EDID
transfers, but no clue 
how to control when the transfers were partial and when they were complete.

2. I found a package called "read-edid" that alleged to use the VM86 code in a
stand-alone mode to 
perform the problematic EDID fetch.

See:http://john.fremlin.de/programs/linux/read-edid/

A debian package was available for Ubuntu.  Running the program ALWAYS gets a
100% good EDID 
fetch.

Building the package from source under RHEL 5, and running it ALSO ALWAYS gets a
100% good EDID 
fetch.

So now the question is, "What is happening to make stand-alone edid-get
successful but X.org fetch 
un-successful?"

Someone suggested that there may be a memory caching issue involved.  get-edid
is a small program, 
wehreas X is rather large, so that's not so far-fetched an idea.

My next task will be to read the get-edid code, and try to understand if it is
doing the same thing the X 
server is doing.

ANY insight froma anyone else reading this bug report would be MOST welcome.

-- Additional comment from tao on 2007-05-30 17:53 EST --
ping. from customer --

So now the question is, "What is happening to make stand-alone edid-get
successful but X.org fetch un-successful?"

This event sent from IssueTracker by rkhadgar 
 issue 121369

-- Additional comment from wdc on 2007-06-12 12:29 EST --
Created an attachment (id=156806)
Run of Xorg 6.8.6 under RH5 -- EDID all zeros

-- Additional comment from wdc on 2007-06-12 12:36 EST --
Created an attachment (id=156807)
Run of Xorg 6.8.6 under RH5 -- EDID partial transfer

I believe this Xorg.0.log output demonstrates we have a bug that WAS NOT
introduced between Xorg 6.8.6 and Xorg 7.1.1.

I tried to build Xorg 6.8.6 under RHEL 5 but hit a wall.
I tried to install RHEL 4.5's 6.8.6 on RHEL 5 but made a mess.
After cleaning up the mess well enough to get 7.1.1 running again I tried a
different tack to get Xorg 6.8.6 just running enough to do the EDID transfer.

Since RHEL 4.5 was in another partition, I ran Xorg out of there.  Additional
arguments were needed.
The command line that got me far enough was:

/rhel4/usr/X11R6/bin/Xorg -config /rhel4/etc/X11/xorg.conf -modulepath
/rhel4/usr/X11R6/lib/modules/

The first new attachment Xorg.0.log-rh5-6.1-a is not sufficient.  It only shows
all zeros in the EDID transfer, and that could be caused by something else not
working as we kludge the Xorg run between major linux versions.

The secont new attachment, however, Xorg.0.log-rh5-6.1-b IS sufficient, I
believe because is shows a PARTIAL EDID transfer.

Xorg could not run enough to really run. (It couldn't find font "fixed" because
of how things are re-organized, but I very strongly believe that it DID run far
enough to do an EDID transfer, and to manifest EXACTLY THE SAME bug we are
experiencing under 7.1.1 under RHEL 5: A timing dependent flaky EDID transfer.

-- Additional comment from tao on 2007-06-12 14:33 EST --
**** Problem Description		
Source :		Service Request 1468792
Created by :		WDC-RHN
Created on :		29-May-2007 18:06:12
Summary of additional work I did today:
I built the X server under RHEL 4.5 with my debug code to do a hex dump of
the EDID fetch.  The fetch is ALWAYS 100% successful under RHEL 4.5.

I then compared the X code that did the int10 call out to the vm86 system
to fetch the data, and I believe the code is pretty much equivalent, so it
may be that we are facing a KERNEL bug that was introduced somewhere
between 2.6.9 and 2.6.18. 

This event sent from IssueTracker by rkhadgar 
 issue 121369

-- Additional comment from tao on 2007-06-12 14:35 EST --
customer is looking for an update .

This event sent from IssueTracker by rkhadgar 
 issue 121369

-- Additional comment from tao on 2007-06-18 14:15 EST --
Customer is not happy that this bug-fix is scheduled for 5.2 , as there is
no viable option with RHEL5 with DVI connector . The posted workaround
only works with VGA conntector.

Additionally, customer has posted information on bugzilla w.r.t. this bug
.

---------------------------------------------------------------------------------

The Email notification system is not working.
I HAD NO CLUE that you'd responded to my issue.

MIT is preparing to do renewal of hardware with Dell Optiplex 745 systems
running Red Hat Linux and Windows Vista.  If we cannot demonstrate that
the hardware works, then our renewal plan may be postponed for a year.

If someone is working on this problem, I would VERY MUCH like to work with
them.
With the large number of hours I've spent working this issue, it would be
nice
to know I was not wasting my time and MIT's resources isolating a known
fault.

I might be able you arrive at a fix for the problem more quickly.

Furthermore, since this seems to be a kernel issue, it might put Red Hat
in a better light in the Linux community if Red Hat produces the fix that
helps not only RHEL, but also Fedora and Ubuntu.

Finally, inasmuch as we have ALREADY discussed how important this is, I am
a bit disappointed to be told, "We have this in the queue for a release 3
months out and don't want to talk to you about it any more unless you
produce compelling reason why we should." 

---------------------------------------------------------------------------------

>  Added Note: Currently, vsync and hsync value hardcoded into xorg.conf
is used 
> as a workaround this issue . 

If you plug in to the DVI port THIS WORK AROUND DOES NOT WORK AT ALL! 

So if Red Hat is still handing out this work-around, it's clear you guys
don't really understand the problem! 

Would you PLEASE connect me up with the people who REALLY ARE working on
the problem so their time, and my time will not be wasted, and so that we
can get this fixed as quickly as possible! 

Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by rkhadgar 
 issue 121369

-- Additional comment from marcobillpeter on 2007-06-19 04:01 EST --
putting this to 5.1

it has the flag regression set, which seems to be right. If not I'll set the
exception flag.

Nevertheless, this is a critical issue for MIT and Daniel Riek will meet with
this client and hear an earful. This problem is part of a showstopper at MIT

thanks - marco

-- Additional comment from syeghiay on 2007-06-19 12:18 EST --
The xorg-x11-drv-vesa package is on the 5.1 approved component list.
Set pm_ack.
Beta Feature Freeze is Jun 27 when errata must be filed.

-- Additional comment from ajackson on 2007-06-25 13:59 EST --
Out of curiosity, does it work reliably when using a xen kernel, or on non-x86?

The reason I ask is, vm86 is known to be unreliable when using xen, and is
simply unavailable on other arches.  So for everything other than baremetal i386
kernels, we use a x86 real-mode emulator to execute VBE calls.  The logs given
appear to all be from non-xen machines.  I would be thrilled to learn that the
emulator is more reliable.

There is also an option to force use of the emulator, by saying:

Option "Int10Backend" "x86emu"

in the ServerLayout section of xorg.conf.

-- Additional comment from tao on 2007-06-25 15:04 EST --
Checking with customer if this options helps.
Customer is _not_ using xen.

This event sent from IssueTracker by rkhadgar 
 issue 121369

-- Additional comment from wdc on 2007-06-27 21:33 EST --
It may indeed be that the emulator is more reliable.

I've just added that line to the xorg.conf file, and run the X server a couple
times.
Previously the EDID buffer would contain a random amount of data and the rest be
all zeros.
This time the EDID buffer was consistently full and the data remained the the same
across multiple runs.

This is good evidence that the problem is in the vm86old code. (We're guessing
that the code for 
auditing has inappropriately messed up the registers, and plan to build a kernel
to test that theory in  a 
few days.)

The problem here though, is that people will not be able to run X far enough to
put in a fix.  What 
options do you think should be pursued to help people get a default install of
RHEL 5 and the other 
2.6.18+ kernels to get something that works from the get go?

-- Additional comment from tao on 2007-07-03 11:17 EST --
For self-reference. sorry for the spam.

**** Fact		
Source :		Service Request 1468792
Created by :		RKHADGAR
Created on :		29-Jun-2007 08:21:44

> "Add this non-standard line to the xorg.conf file"?
* A temporary workaround would be to automate this process, by adding the
below to kickstart script in post-install section.

system-config-display --noui

sed -i 's/Section \\"ServerLayout\\"/Section
\\"ServerLayout\\"\\n\\tOption \\"Int10Backend\\" \\"x86emu\\"/g'
xorg.conf 

**** Problem Description		
Source :		Service Request 1468792
Created by :		WDC-RHN
Created on :		29-Jun-2007 16:22:31
Thanks for replying quickly and trying to be further help, but we need to
work harder on a solution.

1. No, I did not explain our test sufficiently in my earlier terse note. 
I am physically in San Francisco CA (at the opposite coast from Boston,
the other side of the country from my office in Boston), so I have NOT
plugged in the monitor to the DVI connector.  I ran a simple test of
starting the Xorg server remotely and examining the Xorg.0.log file for
the EDID output to see if it was getting through.

I believe there is a SECOND X server bug that will need to be investigated
after the EDID bug in the kernel is fixed, because I have seen RHEL 5 get
perfectly sensible, totally detailed VESA data, but then NOT USE IT to
properly configure the monitor.

2. At this time, of the hundreds of customers at MIT that install RHEL,
only a handful use kickstart.  Your solution will be part of a total
solution, but I fear people's habit will require us to do something that
does not involve creating a special kickstart setup, and then convincing
hundreds of customers to stop "Using the CD like I could with Ubuntu"
and do something totally new and different.

I think the way forward is to agree upon the sensible sequence of steps
that will result in RHEL 5 U2 distribution media incorporating a fix that
will just work.

(Ideally it would be RHEL 5 U1, but I think it is too late in the U1
development cycle to reasonably ask for that.)

This event sent from IssueTracker by rkhadgar 
 issue 121369

-- Additional comment from wdc on 2007-07-10 18:46 EST --
Created an attachment (id=158912)
Patch to cut out audit call in the int10 emulator.

Today we built a kernel with the attached patch that disables the code that
called audit_syscall_exit.
Although those nasty error messages about freeing multiple audit contexts came
back, the EDID transfers
were once again 100% successful.  (Yes, I was careful to use an xorg.conf file
with x86emu disabled.  I tested a stock kernel build to confirm I had a good
build process, and that the stock kernel tickled the bug.)

So it seems that the way audit_syscall_exit is called is trashing the registers
and making the EDID transfer flaky.  This is probably appropriately classified
as a regression and probably needs to be fast-tracked to the original author so
he or she can fix up the call.	We have a very reproducible test case and test
setup to test candidate kernel patches.  (We didn't feel we understood things
well enough to propose a change ourselves.)

-- Additional comment from wdc on 2007-07-10 18:58 EST --
I have a bug open at kernel.org where I asked for help looking at this.  I'll
mention there that this 
regression is the root cause.  Would it be appropriate for Red Hat to weigh in
and lobby for examination 
of that bug?
http://bugzilla.kernel.org/show_bug.cgi?id=8633

Now that we understand the root cause, and have a work-around, what next steps
should we take?

Ideally the kernel regression will eventually be remedied.

Should we consider lobbying freedesktop.org to make the x86emu as int10backend
the default for x86 
in addition to everything else?

There are additional bugs in the X server, once the EDID data is acquired with
100% fidelity:

1. Plugged into the VESA connector, 1400x1024 resolution will configure if
requested, but it will chop 
off the topmost quarter inch and the leftmost inch of pixels.  Modern Dell LCDs
no longer support the 
ability to control the vertical or horizontal size so this is an unpleasant
state of affairs.

2. The EDID data provides a detailed modeline for 1680x1050 operation which is
ignored.

I guess I should take these up with freedesktop.org.  Do people think I should
open a Red Hat bugzilla 
bug on these two issues?

Finally there is the issue that the X server does not properly report the EDID
transfer failure.  I will take 
the freedesktop.org bug I have open about this and lobby for my patch to be
considered as a remedy.  
Here too, I wonder if Red Hat weighing in on the bug would be useful?
https://bugs.freedesktop.org/show_bug.cgi?id=10238

Mr. Jackson et. al., what do you advise as the best way forward?

-- Additional comment from ajackson on 2007-07-11 15:53 EST --
(In reply to comment #44)

> Now that we understand the root cause, and have a work-around, what next steps
should we take?
> 
> Ideally the kernel regression will eventually be remedied.
> 
> Should we consider lobbying freedesktop.org to make the x86emu as int10backend
the default for x86 
> in addition to everything else?

We're already doing this for Fedora 7 and later, and I'm certainly telling
everyone I can upstream that vm86 is insane.  I wish I'd flipped this switch
before FC6, so it would have been incorporated in EL5, but the fear that the
emulator would prove to be a regression relative to EL4's behaviour was too
high.  (And justified, it turns out, since several x86emu bugs have been fixed
since 5.0.)

In the meantime, I'm investigating a way to magically invoke the x86emu backend
for DDC transfers if the vm86 method fails.  It's slightly hairy due to
namespace issues but I think it's doable.  (Setting devel ack for 5.1, we should
include this if I get it working.)

> There are additional bugs in the X server, once the EDID data is acquired with
100% fidelity:
> 
> 1. Plugged into the VESA connector, 1400x1024 resolution will configure if
requested, but it will chop 
> off the topmost quarter inch and the leftmost inch of pixels.  Modern Dell
LCDs no longer support the 
> ability to control the vertical or horizontal size so this is an unpleasant
state of affairs.
> 
> 2. The EDID data provides a detailed modeline for 1680x1050 operation which is
ignored.

The X logs in this bz seem to all show the use of the vesa driver.  The vesa
bios interface is limited in terms of output setup capability.  In particular,
there are two sets of modes: the set that the monitor reports it can display,
and the set that the bios reports it can configure.  It's literally not possible
to ask the bios to set up a mode outside its list, so the best we can do with
the vesa driver - or any other driver that uses the vesa bios mode setting
interface - is pick a "good" mode that happens to be in both lists.

So regarding these two issues, assuming they're occuring with the vesa driver. 
The first sounds like we're either picking a mode that's larger than the monitor
- in which case, 5.1 includes a vesa driver update that should address this
issue - or that the mode we're selecting is not being programmed properly by the
video bios, in which case we're just out of luck.  The second problem sounds
like the 1680x1050 mode is advertised by the monitor but not by the bios, in
which case we are again out of luck.

If my assumptions are incorrect here, I would certainly like to see an X log of
the failure case(s).

In general, these limitations mean that although the vesa driver is supported,
it's not recommended for regular use, and we strongly prefer that people use
native drivers wherever possible.  The configuration infrastructure in EL5
should be smart enough to pick the correct native driver when one is available.

> Finally there is the issue that the X server does not properly report the EDID
transfer failure.  I will take 
> the freedesktop.org bug I have open about this and lobby for my patch to be
considered as a remedy.  
> Here too, I wonder if Red Hat weighing in on the bug would be useful?
> https://bugs.freedesktop.org/show_bug.cgi?id=10238

That looks pretty good; I'll take it up upstream.  Thanks!

-- Additional comment from wdc on 2007-07-11 16:21 EST --
Invoking x86emu if the DDC fails sounds hairy, scary and a lot of work.  Thanks
for putting in the effort 
to make it right!

Indeed the X resolution issues I am having are occurring with the VESA driver. 
Apparently the x.org ATI 
driver does not yet know about the R500 chip set that the x1300 and x1400 use. 
The reverse 
engineered driver will, I'm sure, eventually benefit this driver.  It will be
interesting to test the RHEL 5.1 
X server to see which driver it picks.

I'll attach Xorg.0.log output showing the 1680x1050 mode that the EDID fetch
offers, and how it's not 
used.  I'm still not sure I'm totally up to speed on reading the log output, so
I'd be grateful if you'd call 
my attention to the lines where the BIOS denies support for that mode.  Is it in
those long, detailed 
segments?  Indeed I see a 1600x1200 go by, and a 1400x1050 go by, but indeed no
1680x1050.

-- Additional comment from wdc on 2007-07-11 16:27 EST --
Created an attachment (id=159000)
Log of proffered but unused 1680x1050 resolution

See lines 461 and 462:

  (II) VESA(0): h_active: 1680	h_sync: 1728  h_sync_end 1760 h_blank_end  
1840 h_border: 0
  (II) VESA(0): v_active: 1050	v_sync: 1053  v_sync_end 1059 v_blanking:  

and line 488:

  (II) VESA(0): Modeline "1680x1050"  119.00  1680 1728 1760 1840  1050 10
53 1059 1080 -hsync +vsync

Here the VESA transfer offers the mode.  Why exactly isn't it being used?

-- Additional comment from wdc on 2007-07-11 17:32 EST --
I just had a thought!

How will you detect a bad EDID transfer?  The kernel bug causes the transfer to
OFTEN come up all zeros, 
but sometimes it gets a partial transfer padded out with zeros.  Does the EDID
block have a checksum in it 
that you can compute and test?  The current code just looks at the first few
bytes for a version number 
and uses that to decide the transfer was good.

If you can't detect a zero-padded partial transfer, then your additional work to
use x86emu may be 
wasted.

-- Additional comment from benl on 2007-07-12 12:11 EST --
+ qa_ack for rhel-5.1.0
QA: we'll need some feedback from the customer on this one.

-- Additional comment from ajackson on 2007-07-26 13:42 EST --
(In reply to comment #47)
> Created an attachment (id=159000) [edit]
> Log of proffered but unused 1680x1050 resolution
> 
> See lines 461 and 462:
>  
>   (II) VESA(0): h_active: 1680	h_sync: 1728  h_sync_end 1760 h_blank_end  
> 1840 h_border: 0
>   (II) VESA(0): v_active: 1050	v_sync: 1053  v_sync_end 1059 v_blanking:  
>  
> and line 488:
>  
>   (II) VESA(0): Modeline "1680x1050"  119.00  1680 1728 1760 1840  1050 10
> 53 1059 1080 -hsync +vsync
>  
> Here the VESA transfer offers the mode.  Why exactly isn't it being used?

That's the EDID block's mode list.  Remember, I can only set modes to things in
the intersection of: in the VESA BIOS's mode list, and within the capabilities
reported by EDID.

So, yeah, 1680x1050 in the monitor, but not in the video BIOS, means no
1680x1050 for you.

(In reply to comment #48)
> How will you detect a bad EDID transfer?  The kernel bug causes the transfer
to OFTEN come up all zeros, 
> but sometimes it gets a partial transfer padded out with zeros.  Does the EDID
block have a checksum in it 
> that you can compute and test?  The current code just looks at the first few
bytes for a version number 
> and uses that to decide the transfer was good.

Yes, there is a checksum.  The last byte is set such that a cumulative sum of
all bytes in the block, modulo 256, is 0.

We do use this to reject bad EDID blocks.  See DDC_checksum() in
hw/xfree86/ddc/edid.c, and its caller in hw/xfree86/ddc/xf86DDC.c.

-- Additional comment from wdc on 2007-07-26 17:04 EST --
I've looked at the code in xf86DDC.c, but there's something that confuses me:

How come I never saw a checksum error report in the log?  Clearly I was getting
bad EDID reads.

What determines if the code that's doing the EDID fetch is from
hw/xfree86/vbe/vbe.c where it can 
silently fail (unless you've taken my patch ;-)  ) and where no checksum is
computed in the readEDID 
routine,
versus the code that's in hw/xfree86/ddc/edid.c?

Or are you saying that you plan to add checksum stuff like in ddc/... to vbe/...?

----

Thanks also for the clarification about the BIOS thing.

-- Additional comment from wdc on 2007-08-02 18:42 EST --
Andrew:

I just installed the X server and VESA driver from the RHEL 5.1 beta.
Alas, it does one thing that is admittedly more correct but less desirable to me:

Previously, somehow the server would see that the display could handle
1600x1080, and even though 
no 1400x1024 mode was specifically offered, it would configure that mode.  (This
got us into trouble 
when connected to the analog VESA port, but worked just fine on the digital port.)

Now, because there is no exact match, the display that used to be 1400x1024 is
configured for 
1280x1024.

By the same token, that particular monitor offers 1600x1050, but not 1600x1200,
so even though the 
vesa driver is improved and has a 1600x1200 mode, 1600x1050 is not configured
because it is not an 
exact match.

Wasn't there partial match code being worked on?  I thought it was already in place.
Somebody is suffering with the latest Ubuntu because their card support
1280x1024, but their display 
only support 1280x800.  That run ends up finding no matching modes whatsoever.

I am concerned here that people will have gotten used to running 1400x1050 on
these monitors under 
RHEL 4, but will not get degraded resolution of 1280x1024 after "upgrading" to
RHEL 5.1.

I will attach the xorg.conf file and the Xorg.0.log files so that this all can
be rigorously documented.

-- Additional comment from wdc on 2007-08-02 18:49 EST --
Created an attachment (id=160558)
xorg.conf file used for testing RHEL 5.1 beta X server

-- Additional comment from wdc on 2007-08-02 18:50 EST --
Created an attachment (id=160559)
Log of run of RHEL 5.0 debugging X server.  It sets 1400x1050.

-- Additional comment from wdc on 2007-08-02 18:51 EST --
Created an attachment (id=160561)
Log of run  of RHEL 5.1 X server and vesa driver.  Configs 1280x1024

-- Additional comment from wdc on 2007-08-10 13:52 EST --
Sorry to be a pest here.  I expect there are many important issues being worked
as RHEL 5.1 beta testing 
proceeds.  I am concerned that people are going to consider this an imporper
regression in behavior.
If there were a plan of attack in addressing it, I might be able to help do the
work.

-- Additional comment from ajackson on 2007-08-14 10:00 EST --
The patch looks something like:

http://people.redhat.com/ajackson/omg-vbe-hax.patch

Utterly untested atm; going to try to hit that today.

-- Additional comment from wdc on 2007-08-14 13:18 EST --
Although I've not bench checked it carefully, the patch looks plausible.

The issue that concerns me is not so much the EDID thing at the moment, but that
the VESA update to the 
X server currently on track for dissemination as part of the RHEL 5.1 update
does a worse job than the 
present one at finding the highest resolution even when the EDID transfer is
100% successful.

Andrew, should I open a different bug about that?  What do you think is the way
I can be most helpful in 
identifying the root cause and fixing the new regression?

-- Additional comment from ajackson on 2007-08-15 12:02 EST --
(My name's Adam, btw.)

(In reply to comment #58)
> Although I've not bench checked it carefully, the patch looks plausible.
> 
> The issue that concerns me is not so much the EDID thing at the moment, but
that the VESA update to the 
> X server currently on track for dissemination as part of the RHEL 5.1 update
does a worse job than the 
> present one at finding the highest resolution even when the EDID transfer is
100% successful.

Yeah, that's intentional.

The issue is that you _really_ want to try for strict intersection of modes
between the monitor and the video BIOS in this case.  There do exist monitors
where the EDID list is literally all it can do.  Worse, there are monitors where
if (like your example) there's a VBIOS mode between the two largest EDID modes
like so:

   VBIOS         EDID
A:               1680x1050
B: 1400x1050
C:               1280x1024

and you attempt to set mode B, then the monitor will try to sync as though it's
mode C and the rest will just be off the screen.  Or go blank.  Either one is
unacceptable.

The other case we ran into was some laptop panels, which give you a
mostly-nonconformant EDID block that just contains a mode for the panel size and
nothing else, and of course no matching mode in the VBIOS.  In that case, strict
intersection of mode lists would mean the server just fails to start.

So the new heuristic is: Attempt strict intersection.  If doing so produces a
non-empty mode list, then use it.  Otherwise, revalidate the VBIOS mode list
against a range-based model of the EDID properties (using the sync ranges from
EDID if available, otherwise synthesizing them from an assumed minimum size of
640x480@60 and a max of whatever the EDID block reports as maximum), in the hope
that _something_ will survive validation and work.

This seems to be the least wrong thing to do.  Nonconformant panels get a best
effort, conformant panels get whatever the best intersection of BIOS and EDID
modes is, and we don't go wrong trying to do something the monitor doesn't
explicitly claim it's capable of doing.  This does mean some setups that used to
work at mode B (in the example above) now won't, but they'll still light up; in
exchange, some panels that would fail to do the right thing in mode B now do _a_
right thing, even if that happens to be mode C.  The vesa driver is intended to
be a conservative fallback driver anyway, so the real solution to the mode B
scenario is to use a native driver that doesn't use the VBIOS for output setup.

-- Additional comment from wdc on 2007-08-15 13:11 EST --
Thanks very much for taking the time to provide a detailed clarification.
In light of those details, I'd have to agree that the new behavior is the least
wrong thing to do.

-- Additional comment from ajackson on 2007-08-23 13:37 EST --
After some technical review, I've concluded that the patch in comment #57 is a
bad idea.  The act of initializing an int10 context on a non-primary card has
the side effect of posting the card.  This will blow away any state set up by
the driver prior to the VBE DDC call, which will almost certainly mean bad
rendering at best, and failure to launch or system hang at worst.

There's a more invasive change one could do where you'd set up the shadow x86emu
context _really_ early, and make sure to use the same maps for both vm86 and
x86emu execution, but that seems like a ton of work for very little return. 
Particularly since we know newer kernels have a working vm86 syscall.  Fixing
the kernel definitely seems like the right thing here.

Comment 1 Adam Jackson 2007-08-23 17:43:51 UTC

Nominating for 5.2.  The short summary is the vm86 syscall seems to be stomping
register state before returning to userland, which confuses (among other things)
the VESA DDC code, so the vesa driver can't ask the monitor about capabilities.
 Relevant links from the above discussion:

https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=158912
http://bugzilla.kernel.org/show_bug.cgi?id=8633

Note that current Rawhide kernels are fine here, due to some churn in the area.

Comment 2 RHEL Program Management 2007-08-23 17:50:28 UTC

This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 3 Adam Jackson 2007-08-23 17:57:17 UTC

Moved the Issue Tracker number to this bug from the original.

Comment 4 wdc 2007-08-23 18:23:06 UTC

Re: Comment #1:  Are you sure that there are Rawhide kernels that are ok?
I'd be willing to test one, but I'm not convinced that the problem has actually been fixed.
(Perhaps there have been conversations outside the buzilla and other trackers that I've not been privvy to.)

Comment 5 Adam Jackson 2007-08-29 18:03:24 UTC

(In reply to comment #4)
> Re: Comment #1:  Are you sure that there are Rawhide kernels that are ok?
> I'd be willing to test one, but I'm not convinced that the problem has
actually been fixed.

Apologies, I'm not actually sure.  I had discussed this on IRC with another
engineer, who pointed out several changes in do_sys_vm86() that looked likely
related.

RHEL5 had:
 	if (unlikely(current->audit_context))
 		audit_syscall_exit(AUDITSC_RESULT(eax), eax);

Whereas recentish rawhide has:
        if (unlikely(current->audit_context))
                audit_syscall_exit(AUDITSC_RESULT(0), 0);

So the rawhide version is almost certainly touching less register state. 
Between that and not seeing any reports of this failure on rawhide I probably
jumped to conclusions.

But yes, testing this on rawhide would be valuable.

Comment 6 Jeff Burke 2007-09-25 19:01:25 UTC

William,
     Currently the patch that #if 0 the audit_ssycall_exit code will not pass
internal code review. So we will need to get a valid patch for RHEL5.2. Having
the results of the rawhide kernel will get us one step closer in getting a fix
into the kernel in the 5.2 timeframe.

     Please let us know as soon as you get those results.

Thanks,
Jeff

Comment 7 wdc 2007-09-25 23:20:15 UTC

Created attachment 206251 [details]
Attempt to extract syscall exit fix from 2.6.20 kernel

Acknowledged.  I'd planned on getting back to testing sooner, but got swamped
here with some RHEL 5
customer documentation tasks.	My friend Chuck and I have tried some testing,
and here is where we've gotten to today.

One point of clarification:  The patch with the #if 0 in it was explicitly not
for adoption.  It was illustrating the minimal scope of the broken code.  A
careful reading of our bug report said that we were unsure what the correct fix
was, and wanted to assist with the testing of a fix.  But, as you ultimately
concluded testing with a new kernel was the right next step.

Question about a kernel to test with:  Up until now I'd let others do testing
with beta components.  I understand the term "rawhide" refers in a generic way
to the bleeding edge beta.  But that would be Fedora not RHEL 5, right? 
Indeed, the 2.6.23 kernel at
download.fedora.redhat.com:/pub/fedora/linux/development/i386/os/Packages/ will
not install under RHEL 5 without an update to mkinitrd.
The newer mkinitrd requires half a dozen lower level libraries to be updated
before it will install.
This seems like it is mutilating too much of the RHEL 5 environment.

Bottom line:  Is there a "rawhide" kernel that I can just drop into RHEL 5, or
did you indeed mean that I should install Fedora with a 2.6.18 kernel, re-run
all my tests to confirm I've got a clean failure, and then try the Fedora
2.6.23 kernel and see if the failure goes away?

----

Trying a rawhide kernel may be moot, however because of some other things we
have learned today.

Back on December 7, 2006, Jeremy Fitzhardinge checked in some changes to vm86.c
that look 100% relevant to our problem.  Excerpt from the kernel.org
ChangeLog-2.6.20:
(http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.20)

commit 49d26b6eaa8e970c8cf6e299e6ccba2474191bf5
Author: Jeremy Fitzhardinge <jeremy>
Date:	Thu Dec 7 02:14:03 2006 +0100

    [PATCH] i386: Update sys_vm86 to cope with changed pt_regs and %gs usage
   
    sys_vm86 uses a struct kernel_vm86_regs, which is identical to pt_regs, but

    adds an extra space for all the segment registers.	Previously this
structure
    was completely independent, so changes in pt_regs had to be reflected in
    kernel_vm86_regs.  This changes just embeds pt_regs in kernel_vm86_regs,
and
    makes the appropriate changes to vm86.c to deal with the new naming.
   
    Also, since %gs is dealt with differently in the kernel, this change
adjusts
    vm86.c to reflect this.
   
    While making these changes, I also cleaned up some frankly bizarre code
which
    was added when auditing was added to sys_vm86.
   
    Signed-off-by: Jeremy Fitzhardinge <jeremy>
    Signed-off-by: Andi Kleen <ak>
    Cc: Chuck Ebbert <76306.1226>
    Cc: Zachary Amsden <zach>
    Cc: Jan Beulich <jbeulich>
    Cc: Andi Kleen <ak>
    Cc: Al Viro <viro.org.uk>
    Cc: Jason Baron <jbaron>
    Cc: Chris Wright <chrisw>
    Signed-off-by: Andrew Morton <akpm>

----

Chuck and I made an attempt to isolate just the relevant change and test with
it.
The result was that the EDID transfer always failed.  So either Fitzhardinge's
amendment to audit_syscall_exit was insufficient, or we didn't take enough of
the patch to get correct operation.
Indeed our understanding of Fitzhardinge's work is poor, and it is most likely
that we incorrectly extracted his cleanup of the audit code. Attached is that
trial patch.

Chuck and I are evaluating what the right next step would be:
To do that Fedora testing path?
To re-examine Fitzhardinge's patch and more fully understand his audit fix.
To ask Andi Kleen at kernel.org to re-examine our bug report in the context of
the Fitzhardinge patch he signed off on in December 2006?

What do you recommend as a next step?  Owing to the relevance of this patch, is
there value to involving Jason Baron of Red Hat who was on the reviewer list of
Fitzhardinge's patch?

My concern is that we fully enough understand what is going on to clearly
demonstrate that the problem is fixed, not just driven underground by code
having landed in a different place.  Therefore, I see value in getting the
minimal correct chunk of Fitzhardinge's changes onto my test bed with minimal
other changes.

Comment 8 Jeff Burke 2007-09-26 00:38:06 UTC

William,
   Thanks for the detaied feedback. It looks like the patch set you made
reference to:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=49d26b6eaa8e970c8cf6e299e6ccba2474191bf5
Is part of the 2.6.20 baseline. So you should be able to use our RT Vanilla
kernel to test with. Please note this is an _unofficial_ kernel to be use only
for testing this issue. You can download the kernel from here:
http://people.redhat.com/jburke/kernel-rt-vanilla-2.6.21-39.el5rt.i686.rpm

   The above kernel should just install ontop of RHEL5. 

Thanks Jeff

Comment 9 wdc 2007-09-26 18:52:59 UTC

Thanks VERY much for that kernel.  It made things much easier, and saved us a
lot of work.

GOOD NEWS:  The kernel indeed ran without trouble on the RHEL 5 test system.
GREAT NEWS:  The problem with EDID transfers through vm86.c appears to be
resolved.  I ran the X server several times, and each time the EDID fetch was
100% complete.

I strongly suspect that Fitzhardinge's patch remedied the problem.

Any guess when 2.6.20 or later will make it to public RHEL 5?

Comment 10 Jeff Burke 2007-09-26 19:17:55 UTC

William,
   That is good news. So we have a starting point for to backport a fix. 

   Q. Any guess when 2.6.20 or later will make it to public RHEL 5?

   A. We never do major kernel upgrades in a release. So 2.6.18 is the kernel
for the life span of RHEL5. But we will do backports for fixes into the 2.6.18
RHEL kernel.

   I think the next step is for me to isolate the portions of the patch set that
 actually fixes the issue. Or come up with an alternative patch. 

   I will get back to you when I have a test kernel. If you don't hear back from
me in an acceptable time frame please feel free to ping me on the issue.

Thanks,
Jeff

Comment 11 wdc 2007-09-26 20:47:48 UTC

That sounds like the sensible way forward.

Comment 12 Jason Baron 2007-09-26 21:33:11 UTC

hi William,

what version of rhel5 kernel are you running? Since we weren't able to reproduce
this yet in house, and there have been a number of audit changes, perhaps you
can try to reproduce this on the latest rhel5 beta kernel link below. thanks.

http://people.redhat.com/dzickus/el5/51.el5/i686/kernel-2.6.18-51.el5.i686.rpm

Comment 13 Jason Baron 2007-09-27 13:59:33 UTC

Also, another variable here that might be interesting is wether turning off
auditing works around this? you can disable audit by doing: /sbin/chkconfig
auditd off and then rebooting. You can verify audit on/off by doing:
/sbin/auditctl -s and verifying that enable=0. thanks.

Comment 14 wdc 2007-10-01 22:16:42 UTC

Jason,

The kernel that we have been working with up to now was 2.6.18-6.

I fetched your 2.6.18-51 kernel, and the EDID fetch fails with 100% repeatability for me.
(But it *IS* nice to not have to wait for the SATA timeouts.  I'm really looking forward to THAT fix
showing up in the production kernel asap.)

Turning off the audit demon has no effect.  The EDID fetch through vm86 still occurrs
with 100% repeatability.

Though kernel.org, we've been in touch with Jeremy Fitzhardinge in an attempt to craft the minimal 
effective patch.  Alas our second attempt has also been unsuccessful.  My current plan is to show the 
patch I tried to Fitzhardinge and have him tell me how I got it wrong.

Comment 15 wdc 2007-10-01 22:23:34 UTC

Grrr!  Typo.
The kernel that we have been working with is 2.6.18-8.

Comment 16 wdc 2007-10-05 00:04:53 UTC

Status update:  A couple folks from kernel.org proposed a couple variations on the minimal patch to 
correct the audit_syscall_exit code.  Sadly, none of those candidtate patches worked.  I'm pursuing as a 
next step an integration of the whole pt_regs patch (commit 
49d26b6eaa8e970c8cf6e299e6ccba2474191bf5) from kernel.org to see if it has a beneficial effect.

Comment 17 Jeff Burke 2007-10-19 13:32:27 UTC

William,
     When you have time can you please test the following kernel.
http://people.redhat.com/jburke/kernel-2.6.18-53.el5.bz254024.1.i686.rpm
please post the test results. 

Thanks in advance,
Jeff

Comment 18 wdc 2007-10-19 23:06:59 UTC

I am sorry to report that, although the kernel seemed to run just fine, the EDID transfer still came back 
wrong.

Comment 19 Jeff Burke 2007-10-26 22:22:48 UTC

William,
   I appreciate you running the test kernels. If you wouldn't mind I have one
additional test for you to try when you have time.
http://people.redhat.com/jburke/kernel-2.6.18-53.el5.bz254024.3.i686.rpm
Again post your results when you can.

Thanks in advance,
Jeff

Comment 20 wdc 2007-10-26 23:59:29 UTC

Alas, still no joy.  The EDID transfer still came up all zeros.

Here's an observation I tripped over in other testing.
I'd updated an RHEL 4 WS system to RHEL 5 Server. (I was seeing what would happen to apps like Open 
Office across such an update.  It wasn't pretty.  I opened a trouble ticket.  But I digress...)

That system had two kernels installed the PAE and the vanilla 2.6.18-8 kernel.

The vanilla kernel consistently had bad EDID transfers.
but booting the PAE kernel on that setup had consistently good EDID transfers!

I'll also note in passing that the EDID transfers are so flaky that RHEL 5 x86_64 gets sufficiently 
confused that it won't start X at all with the xorg.conf it creates at install time.  Let me say this again:
I'm booting RHEL 5 i386 and RHEL 5 i86_64 out of different partitions of the same system.  I did
clean installs from RHEL 5.0 media.  But the i386 will actually start X, but the x86_64 wont until I hand 
tool the config file.  This one is probably due to some Xorg app behaving differently in 64 bits.
Sheesh!

Comment 21 Jeff Burke 2007-11-06 19:54:04 UTC

William,
    Sorry to rehash this at this point in time but I would like to clear up some
confusion. Your Comment #20 makes me think we have a disconnect. The patch you
posted to the bz in Comment #7 is for i386 the bz is opened for i386. The
kernel-rt-vanilla-2.6.21-39.el5rt.i686.rpm I pointed you at in Comment #8 was
for i686. The git commit from Fitzhardinge's
49d26b6eaa8e970c8cf6e299e6ccba2474191bf5 was for i386. So I have been only
looking at i386. 

    Please correct any of the following statements below that are not true and
add any additional data you think is relevant.

1.) 2.6.18-8.EL.i686 kernel will start X ... sometimes?
2.) 2.6.18-8.ELPAE.i686 kernel works without issue.
3.) 2.6.18-8.ELxen.i686 kernel unknown
4.) 2.6.18-8.EL.x86_64 kernel fails always.
5.) 2.6.18-8.ELxen.x86_64 kernel unknown.
6.) 2.6.18-53.EL.i686 ?
7.) 2.6.18-53.EL.x86_64 ?
8.) Regardless of kernel. the 3rd party EDID application aways works.

Thanks in advance,
Jeff

Comment 22 wdc 2007-11-06 21:54:37 UTC

Don't panic.  The i386 subdirectory is stuff that's built also for the i686 kernel.
When I build my test kernel I build the i686 kernel, but it uses code from i386 for vm86.c.

I've not run the stand alone read-edid under multiple kernels.  I stopped testing with it when I satisfied 
myself that it aways succeeded whereas X always had a bad EDID fetch under the 2.6.18-8.EL.i686 
kernel.

Your case #1 above, I'd not say "will start X ... sometimes".  X starts, it just has a flaky EDID read that 
one either can or cannot work around in xorg.conf.  I suggest the most correct way to characterize the 
test is,
"Full and correct EDID fetch of BIOS data from the video card."  

So I'd recast your list as follows:

1.) 2.6.18-8.EL.i686 kernel flaky EDID fetch under X. Fine under get-edid.
2.) 2.6.18-8.ELPAE.i686 kernel apparent 100% reliable EDID fetch under X.
3.) 2.6.18-8.ELxen.i686 kernel unknown
4.) 2.6.18-8.EL.x86_64 kernel flaky EDID fetch under X, but not rigorously tested.  (X seems harder
to configure under RHEL 5 and 4.5 on the x86_64 platform but that may be unrelated to this bug.)
5.) 2.6.18-8.ELxen.x86_64 kernel unknown.
6.) 2.6.18-53.EL.i686 kernel flaky EDID fetch under X.  
7.) 2.6.18-53.EL.x86_64 unknown.

Comment 23 wdc 2007-11-07 21:00:01 UTC

In comment #16, I said I'd pursue an attempt to integrate the whole pt_regs patch.
I finally got some time and attempted that today.

Unfortunately, Fitzhardinge's whole patch is based on the 2.6.20 kernel's definition of struct pt_regs,
which has the element:
      int xgs;
which is missing from the 2.6.18 kernel.

Alas, I don't know enough about register saving and restoring to have what I would consider a useful 
suggestion going forward.

Perhaps having that extra register in the pt struct is allowing space to save a register that the audit
code trashed in 2.6.18?

Perhaps the fix to this bug is in some seemingly unrelated region of the 2.6.20 kernel?

Jeff:  If there are specific things that you've been integrating into the candidate test kernels you've been 
sending me, I'd be happy to bench check or review them for applicability to this problem.

There ARE a couple more radical positions we could take with this issue:

1. Discourage use of vm86.c for EDID transfers, and push for use of the emulator.
2. Encourage use of ATI drivers that use the native register set and abandon the VESA compatibility 
layer as something that is these days getting insufficient attention.

It would be nice if we had a clearer sense of why 2.6.20 works whereas 2.6.18 does not.
Is there a 2.6.20 withOUT Fitzhardinge's patch, built for RHE5 that I could test.  At least then we'd have 
a single delta across which to test.

Comment 24 Vivek Goyal 2007-12-12 19:56:50 UTC

William,

Ubuntu bug 89853 (https://launchpad.net/ubuntu/+source/xorg/+bug/89853) got
solved by an updated vesa driver rpm.

xserver-xorg-video-vesa 1.3.0-1ubuntu5

I know in this bug folks could read EDID properly and it might not be the same
issue. Still it might be worth to test out new vesa driver once to see if it helps.

Few things which are confusing.

- RHEL 5 PAE kernel works fine. That means not necessarily vm86() implementation
  is bad as code base is same.

- read-geteid works fine. Why there is an issue with X? I think I should print
register states before vm86() call and after and see if there are any registers
which have not been restored after the system call.

- Jeff mentioned that he backported Jermey's upstream patch to RHEL5 and it did
not help. That means issue is probably somewhere else.

I am still trying to find out a machine with ATI Radeon 1300 card to see if I
can reproduce the issue here.

Comment 25 wdc 2007-12-12 20:24:05 UTC

Ubuntu runs with the 2.6.20 kernel these days.  That kernel has the entirety of
Fitzhardinge's patch, with the additional element in the struct pt_regs.

It may be that a new VESA driver fixes the problem.  If I get some time, I'll
try and compare the new and old VESA driver.  However, from what I currently
know, I think the problem most likely went away in Ubuntu, not because of a new
VESA driver, but because of the 2.6.20 kernel, and that additional element in
struct pt_regs.

Printing the register state before and after the vm86 call seems precisely the
thing to test.  Let me see if i can run that test.

Comment 26 Vivek Goyal 2007-12-14 20:25:47 UTC

(In reply to comment #1)
> Nominating for 5.2.  The short summary is the vm86 syscall seems to be stomping
> register state before returning to userland, which confuses (among other things)
> the VESA DDC code, so the vesa driver can't ask the monitor about capabilities.

Adam,

I am running gdb on read-edid and captured the 16bit register states for RHEL5
(-58) and upstream kernel (2.6.24-rc4). Of course read-edid is successful in
both the cases. I just wanted to see if vm86() is really stomping over any of
the 16 bit registers which can potentially confuse the realmode vesa code. 

As per the gdb output, it does not look like that vm86()is corrupting any
of the 16bit registers. Only registers which seem to be being touched are
_null_es (in case of rhel5) and _null_fs(in case of 2.6.24-rc4). But this should
not make a difference as 16 bit code is not going to load these segment
selectors. Instead it would use es, ds, fs, gs (these are stored in the end in
struct vm86_regs.) And looking at register states, actual ds, fs, gs, es seem to
be fine even after multiple calls to vm86().

I am going to attach two files. One for 2.6.24-rc4 and one for rhel5(-58). These
files contain the output from gdb after the call to vm86(). These have been
taken for read_edid utility.

Do you have more info regarding what registers are stomped by vm86() at what
point of time?

Comment 27 Vivek Goyal 2007-12-14 20:27:44 UTC

Created attachment 289391 [details]
rhel5(.58) gdb output for read-edid

Comment 28 Vivek Goyal 2007-12-14 20:28:57 UTC

Created attachment 289411 [details]
2.6.24-rc4 gdb output for read-edid

Comment 29 wdc 2007-12-14 21:24:15 UTC

I believe that attempting to re-produce the problem when running the read-edid utility
will not be helpful.  We ALWAYS get good EDID transfer with the small stand-alone read-edid
utility.  The failure only occurs when the big, messy X server is run.

If you give me instructions how you instrumented either the kernel or the app, I'll reproduce
that effort with the X server on my "always fails"  test setup.

Comment 30 Vivek Goyal 2007-12-17 20:14:27 UTC

I ran into an interesting problem the moment I upgraded my xserver to 48.26.el5
release. Now if I boot my system, X initializes at 800x600 resolution. If I
restart the server it initializes the display at 1024x768. In first case there
is no EDID transfer as the the Xorg logs and in second case there is a valid
EDID transfer.

This problem does not happen always and I could see it only 4 time out of 10-12
reboots. This is not same problem as reported but this looks very similar. I
have got Radeon 300 card.

I am trying to find a way regarding how to run X under gdb and then debug how
vm86 calls are being made. Interesting thing is that this problem happened only
after I upgraded my xserver.

I am attaching success and failure log.

Comment 31 Vivek Goyal 2007-12-17 20:16:35 UTC

Created attachment 289806 [details]
X logs when server initializes to 800x600 and no EDID data

Comment 32 Vivek Goyal 2007-12-17 20:17:15 UTC

Created attachment 289807 [details]
X server logs when server initializes with 1024x768 with EDID data

Comment 33 wdc 2007-12-17 22:45:26 UTC

Interesting.  I have a couple observations: 

1. You're running the Radeon driver now, not the VESA driver.  But apparently the Radeon driver now
needs a successful EDID transfer to get the video modes.

2. This sort of confirms that the fixup for the failed EDID transfer would not be from an updated VESA 
driver, since you're not using that driver any more.

3. Sorry that it's not failing hard for you.  In the early days, mine didn't get a bad EDID transfer every 
time either.

IMPORTANT:  Check the hex dump of the EDID data.  You may be getting more failures than you think.
Look for that hex dump to sometimes be complete, and sometimes to have zeros in it starting at a
random point in the block.  THATS the manifestation I see.

CONGRATULATIONS!  You are in fact seeing the exact failure I see.

4. It looks like the Radeon driver does a poor job of noticing a failed EDID transfer.  I think it should 
actually flag the transfer as unsuccessful instead of magically reporting data or not.  In the VESA driver,
I fed a patch upstream to test the return value of the version number, and print an error rather than just 
continuing as if the EDID transfer was successful even if the version number came back with garbage.

Comment 34 Vivek Goyal 2007-12-19 19:32:30 UTC

William,

I went little deeper into RADEON driver and found out it is reading EDID data
from monitor over I2C/DDC. So RADEON is not making use of int10 and hence vm86
at all on my system. That means EDID trasfer is flaky over I2C too and not just
if we are using int10 using vm86(). Well, that would be a different problem
altogether. 

How did you switch drivers (vesa vs fglrx?). I want to forcibly use vesa here
instead of readeon and see what does it do in my system.

Comment 36 wdc 2007-12-19 21:43:24 UTC

I didn't do anything to explicitly force the VESA driver.
I think that, on install, the device ID was not recognized by any other driver, so VESA was set as the 
default.

Presumably xorg.conf gives the device driver as "radeon" or "fglrx" now, and you could change that to 
"vesa".

Comment 37 Vivek Goyal 2007-12-20 19:17:25 UTC

ok. Now I got vesa driver installed which uses real mode bios calls to get EDID
data. I am not seeing any issues on my system. 

I am also able to use gdb now and print register states before after sys_vm86()
calls while retrieving EDID. I can't see any corruption happening. I have
collected register states for RHEL5 (.58) kernels and 2.6.24-rc4 kernels. I will
be attaching the logs.

William, I will prepare an Xorg rpm and send you and also tell you the procedure
to how to run gdb. Run Xorg under gdb and see if you can still reproduce the
issue. If yes, then it should be able to tell us where does it fail while
retrieving EDID and what are the register states.

Comment 38 Vivek Goyal 2007-12-20 19:18:44 UTC

Created attachment 290180 [details]
Register states before and after vm86() calls for 58.el5 kernels

Comment 39 Vivek Goyal 2007-12-20 19:19:27 UTC

Created attachment 290181 [details]
Register states before and after vm86() calls for 2.6.24-rc4 kernels

Comment 40 Vivek Goyal 2007-12-20 22:00:03 UTC

William,

This is what I did to compile and run my Xorg server and debug with gdb. You
might want to do the same to gather some data on your machine.

- Download Xorg source rpm from following link.
http://people.redhat.com/vgoyal/.xserver-edid-issue/xorg-x11-server-1.1.1-48.26.el5.src.rpm

- Install rpm
rpm -ivh xorg-x11-server-1.1.1-48.26.el5.src.rpm

- Go to /usr/src/redhat/SPECS/ dir and start building the rpm.

rpmbuild -bc xorg-x11-server-1.1.1-48.26.el5.src.rpm

- Once it start compiling (after applying patches, doing configure etc), you can
Ctrl-C the command and move /usr/src/redhat/BUILD/xorg-server-1.1.1 dir to
your working dir.

- cd xorg-server-1.1.1 and edit "configure" to get rid of "-O2" flag. I want to
get rid of optimization otherwise it becomes difficult to work with gdb and
source code.

- Run ./configure

- Run make

- Run make install

- Now you have built your own Xorg server and installed it.

- SSH to target machine from a different machine and run Xorg under gdb
gdb --args /usr/bin/Xorg --dumbSched

- You can put breakpoint at vbeDoEDID and follow the whole flow from there.
First it does vm86 calls for checking if DDC is supported or not. After that
it reads the EDID data using vm86 calls.

- You can collect register states before and after vm86 calls. vm86() calls
exits frequently for various reasons like signals, interrupts etc and it is
restarted by 32bit userspace. I suspect in your env it might be exiting for some
reason and returning with bad EDID. You shall have to track the control flow in
case of success and failure.

Comment 41 wdc 2007-12-21 03:21:41 UTC

It's been a VERY long time since I ran the X server under gdb. I needed some clues, and although your
procedure did not work for me, I was able to come up with one that would.

Rather than building the X server the way you told me, I used the X server I'd built, and had been testing
with. I was able to run gdb on that X server by just doing
rpm -ivh xorg-x11-server-debuginfo-1.1.1-48.13.0.2.i386.rpm
which was the debuginfo necessary for the server I was already testing with.

When I tried to run xorg having started gdb as you instructed, I got the error:
Fatal server error:
Unrecognized option: --dumbSched
Program exited with code 01.
but restarting gdb without that argument resulted in an X server starting up and allowing itself
to be tested.

I put the breakpoint in vbeDoEDID

doing "p/x *ptr" didn't work for me, but "info registers" did.
I hit "n" to get up to the call to xf86ExecX86int10(pVbe->pInt10);
I did "info registers" before and after the call.
I have attached the output of this effort for the 2.6.18 kernel that breaks EDID, and the
2.6.21 kernel for which EDID is successful.

In both cases, the contents of the registers seems totally fine.

But in thinking this through, that should be no surprise. The problem is NOT that the registers
get corrupted inside the X server. The problem is that the registers get corrupted INSIDE THE KERNEL
when making the call to the audit subsystem inside the implementation of the int10 code.

When you say that, "it is reading EDID data from monitor over I2C/DDC" is it possible that that stuff
eventually does an int10 call in the kernel to fetch the data?

I call your attention to the simple patch to the 2.6.18 kernel that disables the call to audit from within
the int10 code in the kernel (which is in the original bug from which this bug is cloned):
https://bugzilla.redhat.com/attachment.cgi?id=158912
Although we cannot ship a kernel with that change, when you make that change, EDID transfers
magically become reliable on my system.

Bottom line:
We need to see what the register corruption is INSIDE THE KERNEL IMPLEMENTATION of the vm86 call.

Testing the registers in X will not help.

Any quess how we might instrument things inside the kernel without harming what we're already
doing? Perhaps this is a job for Xen? Can one single step a kernel from within an emulator?

Then again, this all might simply be an MMU race condition which manifested in 2.6.18, and is
back underground in 2.6.20 and later.

Comment 42 wdc 2007-12-21 03:22:45 UTC

Created attachment 290211 [details]
register dump before and after vm86 call in X server.  2.6.18 kernel

Comment 43 wdc 2007-12-21 03:23:21 UTC

Created attachment 290212 [details]
register dump before and after vm86 call in X server.  2.6.21 kernel

Comment 44 Vivek Goyal 2007-12-21 18:18:55 UTC

(In reply to comment #41)
> 
> When you say that, "it is reading EDID data from monitor over I2C/DDC" is it
possible that that stuff
> eventually does an int10 call in the kernel to fetch the data?

I don't think so. There is I2C protocol which reads the data. AFAIK, real mode 
bios services are not involved. (int10)

> 
> Any quess how we might instrument things inside the kernel without harming
what we're already 
> doing?  Perhaps this is a job for Xen?  Can one single step a kernel from
within an emulator?
> 
> Then again, this all might simply be an MMU race condition which manifested in
2.6.18, and is
> back underground in 2.6.20 and later.

Few things.

- There are two kind of register states. One is 32bit register state and other
is 16bit register state. 32bit register state belongs to the 32bit user space
(actual Xorg server code running) and 16bit register state belongs to the real
mode BIOS code executing.

- 32bit user space, sets the 16bit registers (pInt->cpuRegs) and calls vm86().
This system call sets up various things and then starts executing real mode
code. (Remember before doing that it creates the env using 16bit registers). If
there is any interrupt, signal etc, real mode code is interrupted and control
goes back to 32bit user space. Before returning to user space, kernel also
passes the new state of real mode registers back to user space. 

- 32bit user space makes the vm86() call again with new 16bit register state and
it goes on till the real mode code has completed.

- If you do just "info registers", then it will give you 32bit register state.
Primarily we are not interested in that as if something was wrong with it, user
space would have most probably died.

- We are interested in 16bit registers. A pointer to 16bit registers is stored
  in pInt->cpuRegs.

- There are different ways of printing this register state depending on which
function you are in and where you have put your break point.

- You can also put two more break points. One on xf86ExecX86int10() and other on
do_vm86().

- You can print real mode registers by using following. (If you are not in
vm86_rep())
   p/x *(struct vm86_struct*)pInt->cpuRegs

- If you have stepped into the function vm86_rep() then you can use
  p/x *ptr

- I think inspecting real mode registers in user space is easier. Even if kernel
has corrupted it, we should see it in user space after the call has returned.
Remember, vm86() is called thousands of time to finish one EDID retrieval. So
you can sample registers in first few invocations and then after the last
invocation. If vm86() finished earlier without completing EDID call
successfully, we should notice the difference in register states.

- I know that by getting rid of audit code, you don't see the problem. I browsed
through the audit code quickly and can't find anything which plays around with
16bit register state. Secondly, same audit code is executed in RHEL5 PAE kernel
which does not see the problem. So at this point of time it is difficult to
conclude anything. Your observations of X using gdb should help though.

Comment 45 wdc 2007-12-21 22:49:26 UTC

Hmmmm...  Putting a breakpoint in xf86ExecX86int10 tells me that said routine is called RATHER a lot.
You said put another on "do_vm86()" but gdb could not find that routine.

So what I did instead (and I hope this is close enough to what we need to do) was to repeat the 
prodedure above (break at vbeDoEDID, and "n" up to the call to xf86ExecX86int10(pVbe->pInt10);

Then I asked to print: *(struct vm86_struct*)pVbe->pInt10->cpuRegs
before and after the call as follows:

(gdb) n
196         xf86ExecX86int10(pVbe->pInt10);
(gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs
$1 = {regs = {ebx = 0, ecx = 486, edx = 0, esi = 0, edi = 8192, ebp = 0, 
    eax = 79, __null_ds = 0, __null_es = -1069481984, __null_fs = 0, 
    __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, 
    eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, 
    ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, 
  screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 
      4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 
      4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 
      4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}}
(gdb) n
198         if ((pVbe->pInt10->ax & 0xff) != 0x4f) {
(gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs
$2 = {regs = {ebx = 258, ecx = 0, edx = 0, esi = 0, edi = 0, ebp = 0, 
    eax = 79, __null_ds = 0, __null_es = -1069481984, __null_fs = 0, 
    __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, 
    eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, 
    ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, 
  screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 
      4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 
      4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 
      4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}}
(gdb) 

I confirmed that the EDID transfer was bad, and then repeated under the 2.6.21 kernel as follows:

(gdb) n
196         xf86ExecX86int10(pVbe->pInt10);
(gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs
$1 = {regs = {ebx = 0, ecx = 486, edx = 0, esi = 0, edi = 8192, ebp = 0, 
    eax = 79, __null_ds = 0, __null_es = 0, __null_fs = -1069481984, 
    __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, 
    eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, 
    ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, 
  screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 
      4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 
      4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 
      4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}}
(gdb) n
198         if ((pVbe->pInt10->ax & 0xff) != 0x4f) {
(gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs
$2 = {regs = {ebx = 258, ecx = 0, edx = 0, esi = 0, edi = 0, ebp = 0, 
    eax = 79, __null_ds = 0, __null_es = 0, __null_fs = -1069481984, 
    __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, 
    eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, 
    ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, 
  screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 
      4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 
      4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 
      4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}}
(gdb) 

QUESTION:  What do you make of these registers.

Comment 46 Vivek Goyal 2007-12-26 21:14:15 UTC

(In reply to comment #45)
> Hmmmm...  Putting a breakpoint in xf86ExecX86int10 tells me that said routine
is called RATHER a lot.
> You said put another on "do_vm86()" but gdb could not find that routine.

You are using a version of Xorg compiled with -O2. I think compiler might have
optimized and inlined this function that's why you don't find it.

> 
> So what I did instead (and I hope this is close enough to what we need to do)
was to repeat the 
> prodedure above (break at vbeDoEDID, and "n" up to the call to
xf86ExecX86int10(pVbe->pInt10);
> 
> Then I asked to print: *(struct vm86_struct*)pVbe->pInt10->cpuRegs
> before and after the call as follows:

This is close enough, at least the register state after the call has compledte.

> 
> (gdb) n
> 196         xf86ExecX86int10(pVbe->pInt10);
> (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs
> $1 = {regs = {ebx = 0, ecx = 486, edx = 0, esi = 0, edi = 8192, ebp = 0, 
>     eax = 79, __null_ds = 0, __null_es = -1069481984, __null_fs = 0, 
>     __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, 
>     eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, 
>     ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, 
>   screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 
>       4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 
>       4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 
>       4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}}
> (gdb) n
> 198         if ((pVbe->pInt10->ax & 0xff) != 0x4f) {
> (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs
> $2 = {regs = {ebx = 258, ecx = 0, edx = 0, esi = 0, edi = 0, ebp = 0, 
>     eax = 79, __null_ds = 0, __null_es = -1069481984, __null_fs = 0, 
>     __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, 
>     eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, 
>     ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, 
>   screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 
>       4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 
>       4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 
>       4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}}
> (gdb) 
> 
> I confirmed that the EDID transfer was bad, and then repeated under the 2.6.21
kernel as follows:
> 
> (gdb) n
> 196         xf86ExecX86int10(pVbe->pInt10);
> (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs
> $1 = {regs = {ebx = 0, ecx = 486, edx = 0, esi = 0, edi = 8192, ebp = 0, 
>     eax = 79, __null_ds = 0, __null_es = 0, __null_fs = -1069481984, 
>     __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, 
>     eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, 
>     ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, 
>   screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 
>       4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 
>       4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 
>       4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}}
> (gdb) n
> 198         if ((pVbe->pInt10->ax & 0xff) != 0x4f) {
> (gdb) p *(struct vm86_struct*)pVbe->pInt10->cpuRegs
> $2 = {regs = {ebx = 258, ecx = 0, edx = 0, esi = 0, edi = 0, ebp = 0, 
>     eax = 79, __null_ds = 0, __null_es = 0, __null_fs = -1069481984, 
>     __null_gs = 0, orig_eax = -1, eip = 1536, cs = 0, __csh = 0, 
>     eflags = 209410, esp = 4096, ss = 256, __ssh = 0, es = 0, __esh = 0, 
>     ds = 64, __dsh = 0, fs = 0, __fsh = 0, gs = 0, __gsh = 0}, flags = 0, 
>   screen_bitmap = 0, cpu_type = 5, int_revectored = {__map = {4294967295, 
>       4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 
>       4294967295}}, int21_revectored = {__map = {4294967295, 4294967295, 
>       4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}}
> (gdb) 
> 
> QUESTION:  What do you make of these registers.

As per your data, 16bit register states after the last call to vm86() are same
both in case of failure (58.el5) and success. With this I would think that
vm86() is not doing any register state corruption in el5 kernels.

There is no error message also from real mode code. Otherwise some basic checks
on eax registers would have failed. Following is the code doing checks.

    if ((pVbe->pInt10->ax & 0xff) != 0x4f) {
        xf86DrvMsgVerb(screen,X_INFO,3,"VESA VBE DDC invalid\n");
        goto error;
    }
    switch (pVbe->pInt10->ax & 0xff00) {
    case 0x0:
        xf86DrvMsgVerb(screen,X_INFO,3,"VESA VBE DDC read successfully\n");
        tmp = (unsigned char *)xnfalloc(128);
        memcpy(tmp,page,128);
        break;
    case 0x100:
        xf86DrvMsgVerb(screen,X_INFO,3,"VESA VBE DDC read failed\n");
        break;
    default:
        xf86DrvMsgVerb(screen,X_INFO,3,"VESA VBE DDC unkown failure %i\n",
                       pVbe->pInt10->ax & 0xff00);
        break;
    }

Comment 47 Vivek Goyal 2007-12-27 22:43:49 UTC

William,

I have backported the jermey's patch for RHEL5 (58.el5). Can you please apply
this patch, rebuild and see if it works fine. If it is too much of trouble, I
will build a binary package and send you. Please let me know.

I have gone through jermey's patch and can't notice anything which can help. The
only other change which I have done in this patch is to set fs and gs to zero
before "jmp resume_userspace". I am just trying to cover up the possibility of
audit code playing with fs and gs and that creating some problem.

Comment 48 Vivek Goyal 2007-12-27 22:45:13 UTC

Created attachment 290461 [details]
Jermey's vm86 patch backport for RHEL5-58

Comment 49 wdc 2007-12-28 00:27:39 UTC

Your patch succesfully built for me.
Alas, the EDID transfer was bad with the resulting kernel.

I think this may indeed mean that the fault lies not directly in the vm86 code, but in an un-related
area of the kernel that changed at around 2.6.18-20 -- some sort of race condition when filling buffers
with data fetched from real mode.

Comment 50 Vivek Goyal 2007-12-31 20:39:18 UTC

(In reply to comment #49)
> Your patch succesfully built for me.
> Alas, the EDID transfer was bad with the resulting kernel.
> 
> I think this may indeed mean that the fault lies not directly in the vm86
code, but in an un-related
> area of the kernel that changed at around 2.6.18-20 -- some sort of race
condition when filling buffers
> with data fetched from real mode.
> 

Hmm.., Looking at so many zeros in EDID buffer, one of the possibilities is that
real mode gave up too early for some reason. I am looking at Xorg code and
various exit/error paths but no clue yet. At many exit paths they dump the
registers but I think you don't see any additional errors in your Xorg.0.log file.

One interesting observation on my sytstem. I was comparing the final exit of
read edid and
Xorg vm86 call.

- get-edid, exits the vm86() loop when control returns userspace because of
  in interrupt of vector 255.

- Xorg exits the vm86() loop upon encountering a "hlt" instruction in real mode
code. 

That's kind of confusing to me. I thought that both the software will exit at
the same point after successful completion. But they seem to be taking differnt
paths. ...

Comment 51 wdc 2008-01-01 00:24:54 UTC

The error checking through all that code is not very good.
When I first began working this bug, the X server code ostensibly said that the EDID transfer
was successful.  Code that could have done basic checking of the contents of the buffer did nothing,
and blithely continued to operate on garbage.  I submitted an upstream patch to at least
error out if the version number of the packet was bad.  (That would detect the case of a buffer
filled with all zeros.)

There may be a checksum embedded in that buffer that could/should be checked, but alas, I was not 
well enough schooled in the EDID block.

Bottom line:  The anomaly you find in the two returns may indeed be significant.  Perhaps further bench 
checking will identify where additional error checking code might have detected an abnormal return 
from real mode.

Comment 52 Vivek Goyal 2008-01-03 00:07:02 UTC

William,

Can you please apply the attached patch to xserver, recompile, install and then
capture the Xorg logs both in failure and success case.

I think this might give us some idea how the control is flowing in terms of how
it returns to user space.

I have generated the patches for xor-server-1.1.1-48.26.el5

Thanks
Vivek

Comment 53 Vivek Goyal 2008-01-03 00:08:40 UTC

Created attachment 290688 [details]
xorg debug patch1

Comment 55 wdc 2008-01-15 19:57:55 UTC

Created attachment 291734 [details]
Log output with the additional debugging output patch.

Comment 56 wdc 2008-01-15 19:58:38 UTC

Vivek,

Sorry it took so long for me to comply with your request.
Finally I have built an X server with your patch.
(Note that I am running an ever so slightly older version:
xorg-x11-server 1.1.1-48.13

Your patch went into that version just fine. I believe that there
are no differences substantial enough between my version 48.13
and your 48.26 for me to blow away my existing comfy build setup.

Attached is the Xorg.0.log output from running that server.

What do you make of it?

Comment 57 wdc 2008-01-17 20:43:06 UTC

Created attachment 292069 [details]
Debugging output with later kernel and successful EDID transfer

Comment 58 wdc 2008-01-17 20:44:58 UTC

OOPS!  I see I didn't read your request carefully enough.
Here is Xorg.0.log output from the debugging X server run under the later kernel
that gives the successful EDID transfer.

I used emacs ediff to make a quick scan for differences.  Alas, I can't really
detect any, but perhaps your more informed eye will pick something up.

Comment 59 wdc 2008-03-21 16:59:36 UTC

Could we do a check in on this bug.

I believe the summary of the situation is:

    With the Radeon X1300 card under the VESA X driver, the EDID fetch of video modes from the BIOS is 
flaky.
    Originally we thought the flakiness was due to incorrect stack discipline when calling auditd from 
inside the vbe calls to vm86.
    We confirmed that the problem is not present in kernel rev 2.6.20 and later, and attempted back-
porting of likely code from 2.6.20 into 2.6.18.
    Unfortunately the delta from 2.6.20 directly related to use of syscall_audit from the vm86 calls inside 
vbe.c did not remedy the problem.

So the current situation is that we know there's a problem where the data that comes back from the 
EDID read gets corrupted -- some or all of the buffer contains 0s.  But we do not see what to back port 
from 2.6.20 to end this corruption.

The current recommended work-around is to modify /etc/X11/xorg.conf:
In the"ServerLayout" section add a line:
    Option "Int10Backend" "x86emu"

Perhaps all we can do now is to close this bug with WONTFIX because the corruption is too subtle
to identify and back-port.

Can someone re-confirm that this is our shared understanding of the situation, and either:
     suggest something else for me to test to see if we've got a new candidate to back port into the 
2.6.18 kernel, or
      affirm that we should live with the situation as-is and close this bug out.

Comment 61 Vivek Goyal 2008-03-28 20:26:11 UTC

William,

Sorry for not being able to get back to you for so long. Got stuck in other issues.

I agree with your viewpoint. Initially it was thought that sys_vm86() system
call implementation is wrong in RHEL5 and that is causing this. But at this
point it does not look like
the case. I have carefully reviewed the code, we have tried back-porting
jermey's patch and
we have also tried comparing the register states before and after the sys_vm86()
call.

At this point of time, we don't know what is causing corruption in certain
instances. I would think of deferring this issue to 5.3 and till then live with
x86emu option.

Comment 65 wdc 2009-01-15 04:06:48 UTC

Status update:

Neither I nor anyone I've talked to about this bug has an idea where to look for the flaky EDID transfer root cause in the 2.6.18 kernel.  It's not present in 2.6.20, but none of the obvious back ports remedied the problem.

In the meantime, the radeon driver has made advances, handles the X1300 hardware and many other more recent ATI chips.  I have tested this driver under RHEL 5.3 beta and found it to do a good job. It services the hardware and so we don't need to use the VESA driver or the deprecated BIOS interface to the kernel that is responsible for this flaky EDID transfer.

We have a serviceable work-around, to use x86-emu, an emerging upstream driver that makes this code path irrelevant, and no insight into how to further debug this problem. I recommend that you CLOSE this bug with status WONTFIX.

Comment 66 David Mair 2009-01-15 16:44:53 UTC

After speaking with one of the original reporters we have decided to close this BZ out as "WONTFIX".  The problem is very elusive and a fix does not appear to be obvious.  With the release of RHEL5.3 there will be a suitable and easy work-around for this issue (either via x86-emu and/or the included radeon driver).

Note You need to log in before you can comment on or make changes to this bug.