Bug 452136

Summary: RHEL5.2 - rx2660 will not install in graphics mode
Product: Red Hat Enterprise Linux 5 Reporter: Alan Matsuoka <alanm>
Component: xorg-x11-drv-atiAssignee: Adam Jackson <ajax>
Status: CLOSED NOTABUG QA Contact: desktop-bugs <desktop-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.2CC: tao
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-06-19 18:38:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 391501    
Attachments:
Description Flags
sosreport-rx2660.3611111111-825161-284597.tar.bz2
none
in house sysreport
none
Xorg.0.log from in house system
none
hp-merlion-01-softlockup.txt
none
sosreport-hp-merlion-01.186429-296757-f529cd.tar.bz2 none

Description Alan Matsuoka 2008-06-19 15:26:51 UTC
Description of problem:  Customer with rx2660 and trying to install RHEL5.2  in
graphics mode.   Graphics will not come up, console appears to hang.  
Installing in text mode via MP (er, ILO2) does work properly.   

Once installed, the problems persist.  Sometimes the graphics console comes up,
other times X is running but there is no output on the screen.   The
/var/log/Xorg.0.log simply terminates with the message of 'Backtrace', but no
backtrace.

WTEC reproduced what appears to be the same problem.  Installation worked via
graphics mode properly, however once the system was installed, the Xserver would
not start reliably.  In these cases the following was observed:

- the signal light on the monitor would go amber, indicating no sync from the
VGA card

- the Xserver would be running in apparent loop

(EE) RADEON(0): Idle timed out, resetting engine...
(**) RADEON(0): DC flush timeout: ffffffff
(**) RADEON(0): EngineRestore (32/32)
(**) RADEON(0): Idle timed out: 127 entries, stat=0xffffffff

- If killed, the Xserver could not be restarted.  X would show:

(EE) No devices detected.

- dmesg would sometimes show soft lockups:

BUG: soft lockup - CPU#0 stuck for 10s! [X:7588]
Modules linked in: ipt_MASQUERADE iptable_nat ip_nat bridge autofs4 hidp rfcomm
l2cap bluetooth sunrpc ip_conntrack_ftp ip_conntrack_netbios_ns ipt_REJECT
xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp
ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api vfat fat
dm_multipath button parport_pc lp parport joydev sr_mod cdrom shpchp sg tg3
dm_snapshot dm_zero dm_mirror dm_mod usb_storage cciss mptspi scsi_transport_spi
mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd
ohci_hcd ehci_hcd

Pid: 7588, CPU 0, comm:   X
psr : 0000141008526010 ifs : 8000000000000001 ip  : [<a0000001002ca002>]    Not
tainted
ip is at __ia64_inb+0x82/0xc0
unat: 0000000000000000 pfs : 0000000000000388 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 0000000000555699
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001004dc4c0 b6  : a0000001002c9f80 b7  : a000000100201b80
f6  : 000000000000000000000 f7 : 000000000000000000000
f8  : 000000000000000000000 f9 : 000000000000000000000
f10 : 000000000000000000000 f11 : 000000000000000000000
r1  : a000000100be0270 r2  : c003fffffc000000 r3  : 0000000000000001
r8  : 00000000000000ff r9  : a000000100a3fd38 r10 : 00000000000000f2
r11 : 0000000000000fff r12 : e0000100762dfe20 r13 : e0000100762d8000
r14 : c003fffffc0f23c8 r15 : 00000000000f2000 r16 : a000000100a3fd30
r17 : 00000000000003c8 r18 : a000000100a3fd30 r19 : 0000000000000000
r20 : a000000100a3fd30 r21 : 0000000000ffffff r22 : e00001007586d0b0
r23 : e00001007586d120 r24 : a0000001002d8480 r25 : a0000001009de128
r26 : e0000100f20f36a0 r27 : a000000100201b80 r28 : 0000000000000100
r29 : a0000001002bb780 r30 : 0000000000000004 r31 : 0000000000000692

Call Trace:
[<a000000100013ae0>] show_stack+0x40/0xa0
sp=e0000100762dfa80 bsp=e0000100762d9518
[<a0000001000143e0>] show_regs+0x840/0x880
sp=e0000100762dfc50 bsp=e0000100762d94c0
[<a0000001000e8510>] softlockup_tick+0x2b0/0x320
sp=e0000100762dfc50 bsp=e0000100762d9478
[<a000000100093cf0>] run_local_timers+0x30/0x60
sp=e0000100762dfc50 bsp=e0000100762d9458
[<a000000100093da0>] update_process_times+0x80/0x100
sp=e0000100762dfc50 bsp=e0000100762d9420
[<a0000001000376a0>] timer_interrupt+0x180/0x360
sp=e0000100762dfc50 bsp=e0000100762d93d8
[<a0000001000e8bb0>] handle_IRQ_event+0x90/0x120
sp=e0000100762dfc50 bsp=e0000100762d9398
[<a0000001000e8d70>] __do_IRQ+0x130/0x420
sp=e0000100762dfc50 bsp=e0000100762d9350
[<a000000100011750>] ia64_handle_irq+0xf0/0x1a0
sp=e0000100762dfc50 bsp=e0000100762d9320
[<a00000010000c020>] __ia64_leave_kernel+0x0/0x280
sp=e0000100762dfc50 bsp=e0000100762d9320
[<a0000001002ca000>] __ia64_inb+0x80/0xc0
sp=e0000100762dfe20 bsp=e0000100762d9318
[<a0000001004dc4c0>] ia64_pci_legacy_read+0x100/0x140
sp=e0000100762dfe20 bsp=e0000100762d92e0
[<a0000001002d8530>] pci_read_legacy_io+0xb0/0xe0
sp=e0000100762dfe20 bsp=e0000100762d92a8
[<a000000100201cd0>] read+0x150/0x240
sp=e0000100762dfe20 bsp=e0000100762d9268
[<a000000100164880>] vfs_read+0x200/0x3a0
sp=e0000100762dfe20 bsp=e0000100762d9218
[<a000000100164f50>] sys_read+0x70/0xe0
sp=e0000100762dfe20 bsp=e0000100762d9198
[<a00000010000bdb0>] __ia64_trace_syscall+0xd0/0x110
sp=e0000100762dfe30 bsp=e0000100762d9198
[<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
sp=e0000100762e0000 bsp=e0000100762d9198

- lspci -nvv (from sosreport) shows errors

lspci -nvv:

00:01.0 ff00: 103c:1303 (rev ff) (prog-if ff)
        !!! Unknown header type 7f

00:01.1 0780: 103c:1302 (rev ff) (prog-if ff)
        !!! Unknown header type 7f

00:01.2 0700: 103c:1048 (rev ff) (prog-if ff)
        !!! Unknown header type 7f

00:02.0 0c03: 1033:0035 (rev ff) (prog-if ff)
        !!! Unknown header type 7f

00:02.1 0c03: 1033:0035 (rev ff) (prog-if ff)
        !!! Unknown header type 7f

00:02.2 0c03: 1033:00e0 (rev ff) (prog-if ff)
        !!! Unknown header type 7f


00:03.0 0300: 1002:515e (rev ff) (prog-if ff)
        !!! Unknown header type 7f


- At times, the serial console would become unresponsive

How reproducible:

Somewhat random.  Once it occurs it becomes more consistent.
Frequency higher if kernel does NOT direct output to a serial console.


Steps to Reproduce:

- ensure primary console is VGA (through conconfig in EFI)
- install RHEL5.2 on rx2660 via graphics head

Actual results:

Some graphics hangs, blackouts.

Expected results:

Works.

Additional info:

We have a system in Atlanta that shows this problem.   Am attempting to gather
kdump information.

Customer in Mexico experiencing problem.  

sosreport attached is from system in Atlanta lab.

On Mon, 2008-06-16 at 21:39 +0000, Red Hat Issue Tracker wrote:
> I'll see if I can track down one of our rx2660's here so I can hand an
> engineer a system that reproduces the problem. In the meantime, how
> critical is this issue to the customer? These days one typically
> doesn't associate Itaniums with desk-side systems that are used
> graphically...

The criticality is unclear.  This is a new customer to HP,
and the first experience they had with the machine was the
problems during installation.   As a result, they are left
with very bad impressions of the hardware and with the
Red Hat OS.

You may need to reboot several times.  Concurrent access
to the MP console may also be necessary, but it is not
clear.   In my testing after a cold boot, I had 6 or 7
boots without an issue, and then 5 or 6 with the problem
followed by 2 or 3 with no problem.

I think think this needs at least a medium priority since:

- there are no release notes on this issue and no warning to
 avoid graphical installation or usage

- the kernel soft-lockups could lead to real panic

- the serial console has also become inoperable during the
 soft-lockups, reducing the amount of control the user has
 over the system

My guess is that this is somehow related to the ILO2
functionality that is new with this machine.   The lspci
output seems to illustrate that.

Rick

Rick,

I seem to have duplicated the problem our customer sees. Here's a couple of
notes from my testing:

- The issue seems to occur far more regularly when the remote console is open
and being used, however it did occur once without the console open as well.
- It also seems to occur more regularly with the Xen kernel, however I was able
to trigger it with the bare-metal kernel as well (Xorg.0.log and softlockup
message attached from that session)
- The softlockups do cause the system as a whole to become extremely slow until
it begins printing the errors to the X log, at which point it seems to return to
some  semblance of normalcy.

I'll ask Doug (the Integrity onsite engineer) if he's ever come across anything
like this before, and escalate to SEG at the same time as well.

-David

Problem Summary: Graphics often don't work on HP rx2660 systems. Usually
accompanied by soft lockups (possibly unrelated?) and always by the RADEON
errors in the Xorg.0.log.

Supporting Materials: sosreports, X log, traceback messages.

Reproducer: You can reproduce the problem simply by starting/stopping (init 3/5)
X on a system. You don't even have to be looking at a monitor - I wasn't.

hp-merlion-01 in RHTS will reproduce the problem. hp-rx2660-03 in RHTS is the
same model system and should as well.

Requested Action from SEG: Fix/escalate.

Comment 1 Alan Matsuoka 2008-06-19 15:26:52 UTC
Created attachment 309851 [details]
sosreport-rx2660.3611111111-825161-284597.tar.bz2

Comment 2 Alan Matsuoka 2008-06-19 15:29:02 UTC
Created attachment 309853 [details]
in house sysreport

Comment 3 Alan Matsuoka 2008-06-19 15:29:59 UTC
Created attachment 309855 [details]
Xorg.0.log from in house system

Comment 4 Alan Matsuoka 2008-06-19 15:30:40 UTC
Created attachment 309856 [details]
hp-merlion-01-softlockup.txt

Comment 5 Alan Matsuoka 2008-06-19 15:31:50 UTC
Created attachment 309857 [details]
sosreport-hp-merlion-01.186429-296757-f529cd.tar.bz2

Comment 6 RHEL Program Management 2008-06-19 15:58:51 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 7 Doug Chapman 2008-06-19 18:38:18 UTC
It appears that this is due to a bad BIOS firmware rev on the VGA controller.  I
re-flashed the card on the system here at Red Hat and it now works fine.  I am
working with HP support so they can get the customer system fixed as well.

Closing as NOTABUG.