Bug 451745

Summary: a check for a buggy HP SAL caused problems booting as a guest in a virtual machine
Product: Red Hat Enterprise Linux 5 Reporter: Luming Yu <luyu>
Component: kernelAssignee: Luming Yu <luyu>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: achiang, alex_williamson, cward, dchapman, dzickus, gbeshers, peterm, tony.luck
Target Milestone: ---Keywords: OtherQA
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 19:58:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
a back port
none
move SAL_CACHE_FLUSH check later
none
a back port
none
a bac port to fix ia64_xen boot hang none

Description Luming Yu 2008-06-17 02:21:16 UTC
Description of problem:
a check for a buggy HP SAL caused problems booting as a guest in a virtual machine.

upstream patch:
http://tinyurl.com/5n8el5

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Luming Yu 2008-06-18 14:26:14 UTC
Created attachment 309738 [details]
a back port

Comment 2 Luming Yu 2008-06-19 08:34:05 UTC
With this back port patch applied, the 2.6.18-94.el5 kernel just booted to hang
in check_sal_cache_flush in the loop waiting a timer IPI (just after
platform_send_ipi call).

I tested upstream 2.6.26-rc6 , boot just fine.

I noticed that 2.6.26-rc6 configured as _DIG, 2.6.18-94.el5 configured as
_GENERIC.. Then, I changed 2.6.18-94.el5 to _DIG, and re-tested, still hang..

So just taking this back port patch looks not promising..We probably need
others...but I have no clue now..

Comment 3 Luming Yu 2008-06-19 08:36:41 UTC
add Doug since this is HP related issue...

Comment 4 Tony Luck 2008-06-19 17:24:48 UTC
Upstream both DIG and GENERIC kernels boot on my Tiger and HP test boxes.

Does RHEL5 call the check_sal_flush() function earlier than mainline 
(specifically has machvec been set up before the call)?  If not then I could 
see that there would be a problem with using platform_send_ipi() in this patch. 
Otherwise I'm a bit puzzled why this isn't working.


Comment 5 Alex Chiang 2008-06-19 17:58:01 UTC
Hi Luming,

A few questions --

1. Is your hang in the virtual guest only? Or does it occur on bare metal too?

2. What virtualization technology are we talking about here? xen? kvm?

3. Where exactly is the kernel hanging? Does it hang *before* the call to
SAL_CACHE_FLUSH or afterwards?

Thanks.

Comment 6 Doug Chapman 2008-06-19 18:24:19 UTC
(In reply to comment #4)
> 
> Does RHEL5 call the check_sal_flush() function earlier than mainline 
> (specifically has machvec been set up before the call)?  If not then I could 
> see that there would be a problem with using platform_send_ipi() in this patch. 
> Otherwise I'm a bit puzzled why this isn't working.
> 

Tony,

Yes, it does appear that RHEL5 is calling check_sal_cache_flush() earlier than
upstream.

In RHEL5 it is called via setup_arch()->ia64_sal_init(). 
check_sal_cache_flush() is the last line in ia64_sal_init()

upstream it is called via setup_arch() directly just a few lines later than
ia64_sal_init().

I will try moving check_sal_cache_flush() along with Luming's patch to see if
that resolves the issue.




Comment 7 Luming Yu 2008-06-20 01:48:17 UTC
So I will try this patch. fa1d19e5d9a94120f31e5783ab44758f46892d94

[IA64] move SAL_CACHE_FLUSH check later in boot

The check to see if the firmware drops interrupts during a
SAL_CACHE_FLUSH is done to early in the boot. SAL_CACHE_FLUSH expects
to be able to make PAL calls in virtual mode, on some cell based
machines a fault occurs causing a MCA. This patch moves the check
after mmu_context_init so the TLB and VHPT are properly setup.

Signed-off-by Troy Heber <troy.heber>
Signed-off-by: Tony Luck <tony.luck>

Comment 8 Luming Yu 2008-06-20 02:39:37 UTC
Created attachment 309893 [details]
move SAL_CACHE_FLUSH check later

This back port patch fixes  the boot hang

Comment 9 Alex Chiang 2008-06-25 20:56:01 UTC
Hi Luming,

You will probably want this patch as well, so we don't break sn2:

2826f8c0f4c97b7db33e2a680f184d828eb7a785
[IA64] Fix boot failure on ia64/sn2

Thanks.


Comment 10 Luming Yu 2008-06-26 07:48:25 UTC
Created attachment 310312 [details]
a back port

a back port of upstream described in comment above..

Comment 11 RHEL Program Management 2008-06-26 16:47:59 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 13 Luming Yu 2008-07-24 07:49:07 UTC
This patch series is causing issues on ia64-xen (doesn't boot due to unsupported
ipi 0xef). 

Comment 14 Luming Yu 2008-07-24 08:12:53 UTC
By checking code in function: ia64_send_ipi (arch/ia64/kernel/irq_64.c
2.6.18-94.el5), IA64_TIMER_VECTOR is actually NOT supported by current rhel 5
xen code. The patch "[RHEL 5.3 PATCH 1/2]  bz 451745:  Update
check_sal_cache_flush to use platform_send_ipi" would cause ia64 xen kernel boot
hang because this patch assumes platform_send_ipi IA64_TIMER_VECTOR work, and
has a loop waiting for the arrival of IA64_TIMER ipi to break it.

I'm not faimilar with xen upstream status, and don't know if it is still a
problem in xen upstream. 

Alex, would you please help check if xen upstream fixes the problem?
Please provide a pointer to upstrem fix if you want me to back port.

--Luming

Comment 15 Doug Chapman 2008-07-24 16:06:16 UTC
Luming,

I just started digging into the original patch:

http://tinyurl.com/5n8el5

It appears this is to fix just the HP rx5670.  We don't support that system past
RHEL4 so we would never run xen on it.  Is there a reason you are posting this?
 Was it a request from someone at HP.  

My suggestion is we do not include this in RHEL5.

- Doug


Comment 16 Doug Chapman 2008-07-24 18:48:19 UTC
OK, we know how to fix this now.  I worked with Alex Williamson and Alex Chaing
back at HP and we have this patch which has now been submitted upstream:

http://xen.markmail.org/message/2xwp64qu3e7k4545

This needs to be included to make this work under xen.



Comment 17 Luming Yu 2008-07-25 08:14:42 UTC
Created attachment 312632 [details]
a bac port to fix ia64_xen boot hang

Comment 18 Don Zickus 2008-08-22 19:49:52 UTC
in kernel-2.6.18-105.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 19 Don Zickus 2008-09-03 03:40:18 UTC
in kernel-2.6.18-107.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 22 Chris Ward 2008-11-14 14:04:55 UTC
~~~ Attention Partners! ~~~

Please test this URGENT / HIGH priority bug at your earliest convenience to ensure it makes it into the upcoming RHEL 5.3 release. The fix should be present in the Partner Snapshot #2 (kernel*-122), available NOW at ftp://partners.redhat.com. As we are approaching the end of the RHEL 5.3 test cycle, it is critical that you report back testing results as soon as possible. 

If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla Keywords field to indicate this. If you find that this issue has not been properly fixed, set the bug status to ASSIGNED with a comment describing the issues you encountered.

All NEW issues encountered (not part of this bug fix) should have a new bug created with the proper keywords and flags set to trigger a review for their inclusion in the upcoming RHEL 5.3 or other future release. Post a link in this bugzilla pointing to the new issue to ensure it is not overlooked.

For any additional questions, speak with your Partner Manager.

Comment 23 Chris Ward 2008-11-18 18:13:19 UTC
~~ Snapshot 3 is now available ~~ 

Snapshot 3 is now available for Partner Testing, which should contain a fix that resolves this bug. ISO's available as usual at ftp://partners.redhat.com. Your testing feedback is vital! Please let us know if you encounter any NEW issues (file a new bug) or if you have VERIFIED the fix is present and functioning as expected (add PartnerVerified Keyword).

Ping your Partner Manager with any additional questions. Thanks!

Comment 24 Luming Yu 2008-11-24 07:15:32 UTC
Confirmed patch is in the -123 kernel.

Comment 26 errata-xmlrpc 2009-01-20 19:58:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html