Bug 601790

Summary: RHEL6 DVD Install regression between nightly 20100603 and 20100607
Product: Red Hat Enterprise Linux 6 Reporter: Zachary Amsden <zamsden>
Component: xorg-x11-drv-nouveauAssignee: Ben Skeggs <bskeggs>
Status: CLOSED CURRENTRELEASE QA Contact: desktop-bugs <desktop-bugs>
Severity: urgent Docs Contact:
Priority: low    
Version: 6.0CC: jbastian, notting, syeghiay, vbenes
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: xorg-x11-drv-nouveau-0.0.16-8.20100423git13c1043.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-10 21:56:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg output
none
dmesg from failed X
none
/tmp/X.log from failed X server
none
/tmp/X.log from nouveau.noaccel=1 (working X server)
none
dmesg 0603 (working)
none
X.log 0603 none

Description Zachary Amsden 2010-06-08 15:52:15 UTC
Description of problem:

I have two images:

RHEL6.0-20100603.n.0-Server-x86_64-DVD1.iso
RHEL6.0-20100607.n.0-Server-x86_64-DVD1.iso

The first runs the installer fine on my physical hardware; the second gives a blackscreen when X11 is supposed to start.

Version-Release number of selected component (if applicable):

VGA card is nVidia G86 Quadro NVS 290 (PCI ID 10de:042f), kernel driver: nouveau

How reproducible: 100%


Steps to Reproduce:
1.  Try install DVD...
2.
3.
  
Actual results:

Unable to install RHEL6

Expected results:

Able to install RHEL6

Additional info:

I can attach any hardware profile info as needed, please let me know.

Comment 2 RHEL Program Management 2010-06-08 16:13:20 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Ben Skeggs 2010-06-09 04:38:45 UTC
Would it be possible for you to recover the output of "dmesg" and /var/log/Xorg.0.log over ssh so there's a clue as to what may have went wrong?

Comment 4 Zachary Amsden 2010-06-09 10:24:01 UTC
I actually tried.  No attempt to switch consoles was successful.

It's possible with a hacked initrd that I could get network access, but even so, it's difficult to figure how to activate it specifically at that time...

I think the best bet here is binary search; 20100603 and 20100607 are close enough to have good search potential.

Comment 5 Ben Skeggs 2010-06-09 22:58:50 UTC
Well you see, I've booted the relevant packages on a few of my own machines and they work fine.  So I need more info to know what could have possibly went wrong.  If you plug into a wired network can you get access?

Comment 6 Zachary Amsden 2010-06-09 23:09:51 UTC
I am plugged into a wired network, but this happens so early in the install, I don't think it's active.  Being unable to change consoles after the crash / hang doesn't help either.  I'm downloading the 20100605 DVD now to see if I can pinpoint the regression a bit closer, but I have things I can try:

1) interrupt install early and get network going (although, an irq lockout seems to be happening as keyboard doesn't work)
2) force memory to be detected as lower than usual, resulting in text mode install, then boot with a full install and active networking.

Any suspicious changes went in recently to X or drivers?  Because the regression is 100% boolean pass / fail just by switching between these two DVDs.

Comment 7 Ben Skeggs 2010-06-09 23:21:42 UTC
There's one recent nouveau change that went into kernel -32, but I can't think of any way it'd have caused this particular problem.

Comment 8 Zachary Amsden 2010-06-09 23:27:57 UTC
okay, was able to sneak in a command (dhclient eth0) before the crash; saw the X cursor for sure before all went blank.  Keyboard leds still responsive.

However, seeing as nothing is set up on the box to accept incoming connections, it's still useless even with networking.  It does respond to pings however.

Switching to console 2, I can get beeps when hitting ctrl-G; perhaps I can blind type some commands.  So far I haven't been able to successfully ssh out; is ssh available there in the installer shell?

So at least console switching works, but VGA restore to text mode is definitely broken.

Comment 9 Ben Skeggs 2010-06-09 23:54:21 UTC
How about if you boot with nouveau.noaccel=1 in your boot options?

Comment 10 Zachary Amsden 2010-06-10 00:03:33 UTC
okay, this sucks, I spent 30 minute typing commands blindly, was able to get ssh to connect to port 22 (watching packet dump), but for some reason the client disconnects - no useful information on server, it must be a client side failure - and of course I can't see the error message.

Comment 11 Zachary Amsden 2010-06-10 00:06:37 UTC
nouveau.noaccel=1 works

Comment 12 Ben Skeggs 2010-06-10 00:17:50 UTC
Okay, this is really weird then, there's been no nouveau changes *at all* that effect that area recently.  Can I see your dmesg output from that regardless?

Comment 13 Ben Skeggs 2010-06-10 00:21:41 UTC
Oh, and what kernel and xorg-x11-drv-nouveau versions are on each of the DVD images?

Comment 14 Zachary Amsden 2010-06-10 00:50:12 UTC
Created attachment 422740 [details]
dmesg output

Comment 15 Zachary Amsden 2010-06-10 01:04:41 UTC
kernel-2.6.32-33.el6.x86_64.rpm
xorg-x11-drivers-7.3-13.2.el6.x86_64.rpm
xorg-x11-drv-nouveau-0.0.16-6.20100423git13c1043.el6.x86_64.rpm
xorg-x11-server-Xorg-1.7.7-4.el6.x86_64.rpm

strange to see a git tag in the nouveau package version.. did an old version get pulled in somehow?

Comment 16 Ben Skeggs 2010-06-10 01:11:40 UTC
Nope, nouveau doesn't technically have a "release" as of yet upstream, so there's no version numbers, I used the git tag of the commit the package was based on instead.

How about the working install image?

One thing I thought of too, if you do an install, do you see the problem still on the installed system?  It may be easier to track down the exact cause that way.

Comment 17 Zachary Amsden 2010-06-10 01:18:06 UTC
I can't do an install; the installer now wants to wipe out my drive, which I can't accept.  See bug 602497.

BTW, I figured out what went wrong with scp / ssh: it asks "Are you sure you want to connect" the first time, so I must answer "yes" before the password.

I could blind scp off stuff from /tmp or /var/log in the failed case if that would help...

Comment 18 Ben Skeggs 2010-06-10 01:29:25 UTC
Yes, that could be *very* helpful potentially.  If the GPU hung or something, it *should* have reported the hang to the driver at least.  /var/log/Xorg.0.log (maybe Xorg.0.log.old too if it exists) and /var/log/messages are the most useful.

Comment 19 Zachary Amsden 2010-06-10 01:30:49 UTC
believe it or not, that worked.  Taking a diff of /tmp/X.log, I see this in the failed version

(II) AIGLX error: dlopen of /usr/lib64/dri/nouveau_dri.so failed (/usr/lib64/dri/nouveau_dri.so: cannot open shared object file: No such file or directory)
(II) AIGLX: reverting to software rendering


Is it just a missing file?

Comment 20 Ben Skeggs 2010-06-10 01:38:19 UTC
Nope, that's not an error at all.  It's just saying there's no 3D driver available is all.

What did you have to do to make it work?

Comment 21 Zachary Amsden 2010-06-10 01:43:52 UTC
Created attachment 422741 [details]
dmesg from failed X

Comment 22 Zachary Amsden 2010-06-10 01:44:23 UTC
Created attachment 422742 [details]
/tmp/X.log from failed X server

Comment 23 Zachary Amsden 2010-06-10 01:45:15 UTC
Created attachment 422743 [details]
/tmp/X.log from nouveau.noaccel=1 (working X server)

Comment 24 Zachary Amsden 2010-06-10 01:46:35 UTC
Oh, I meant the blind switch to VT-2 and scp trick worked...

Comment 25 Zachary Amsden 2010-06-10 01:54:35 UTC
FWIW, the failure case uses GPU Channel 2, opens and closes it twice, and after the second close, it immediately fails.  What it would have done next is:

(==) NOUVEAU(0): DPMS enabled

BTW, the card has two ports, if I'm reading it right, the output looks like it fails upon initializing the second screen (I have no display connected to it however).

Comment 26 Ben Skeggs 2010-06-10 02:05:04 UTC
Hmm, can I see your X log from the working install DVD too please?  I'm not really sure why the channel gets closed and reopened in a single X invocation, but, that gives me something to test against nouveau to see if that case actually works.

Comment 27 Zachary Amsden 2010-06-10 20:21:17 UTC
okay, major change discovered.. the X server went from 1.7.6 (working) to 1.7.7 (fails)

uploading dmesg and X.log from 0603 DVD

Comment 28 Zachary Amsden 2010-06-10 20:22:15 UTC
Created attachment 423035 [details]
dmesg 0603 (working)

Comment 29 Zachary Amsden 2010-06-10 20:22:50 UTC
Created attachment 423037 [details]
X.log 0603

Comment 30 Ben Skeggs 2010-06-10 23:30:13 UTC
*** Bug 602760 has been marked as a duplicate of this bug. ***

Comment 31 Ben Skeggs 2010-06-11 04:15:03 UTC
Okay.  At some point between those two nightlys, something's changed to cause an X server regeneration to happen.

I've tracked down a bug in the X server that nouveau somehow triggers, which appears to be the likely candidate for what you're seeing.

Comment 32 Zachary Amsden 2010-06-11 04:29:37 UTC
Let me know if there's anything else I can do to help debug / test this.

I've found a workaround for the install issue and went ahead with a new install from 20100607 with the VESA driver.

Comment 33 Ben Skeggs 2010-06-11 04:39:49 UTC
Thanks, I'll push a new server build if/when I get the acks for this bug.  Once it makes it into a nightly it'd be great if you could ack/nack that it's fixed :)

Comment 34 Ben Skeggs 2010-06-11 12:04:34 UTC
Just an update, the server fix was a side-effect, and shouldn't have happened.  The correct fix will be in xorg-x11-drv-nouveau.

Comment 35 Vladimir Benes 2010-06-11 12:38:31 UTC
we have the same HW as reproducer has.. Will test soon if we can reproduce

Comment 36 Vladimir Benes 2010-06-11 12:38:54 UTC
s/reproducer/reporter :)

Comment 37 Ben Skeggs 2010-06-11 13:07:17 UTC
I've built xorg-x11-drv-nouveau-0.0.16-8.20100423git13c1043.el6 now with what I
hope is the fix for this.  Once it appears in a nightly, it'd be great if you
could ack/nack the fix :)

Comment 38 Zachary Amsden 2010-06-11 19:41:39 UTC
It's far easier for me to install the package, downloading a full DVD takes about 4 hours here.  I've had no choice to do a full install unfortunately, which resulted in me finding this bug, but now that I'm installed, I'd gladly just switch X drivers.

BTW, the framebuffer seemed to work fine with nouveau kernel driver, I got fancy graphics on the install of a spinning wheel, whee!  But this didn't jive well with the install of X with basic video, the VESA driver complained that the nouveau driver had taken it's resources and so X11 refused to start.

I fixed that, but I don't know if normal users would be able to:

mv /lib/modules/tab/and/search/to/find/the/path/of/nouveau /lib/nouveau-old.ko
dracut -o nouveau -f /boot/initramfs-xxx $(uname -r)

Feel free to cc me on the bugfix, I used to hack video drivers for fun in another life.

Comment 39 Jeff Bastian 2010-06-11 20:32:39 UTC
I can reproduce this problem too on a system with a Quadro FX 570 card.  I can download the ISO images in about an hour so I'll test this next week when it appears in a nightly build.

Comment 41 Jeff Bastian 2010-06-23 13:29:52 UTC
I just tested RHEL 6.0 20100622.n.0 nightly (which includes nouveau 0.0.16-8.20100423git13c1043.el6) and the Anaconda GUI started correctly.

Comment 42 Vladimir Benes 2010-07-01 14:26:13 UTC
I've just tested using 6.0 20100630.n.0 nightly on NVS 290 (Dell Precision T5400). Everything works as expected 
-> VERIFIED

Comment 44 releng-rhel@redhat.com 2010-11-10 21:56:24 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.