Bug 451745
Summary: | a check for a buggy HP SAL caused problems booting as a guest in a virtual machine | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Luming Yu <luyu> | ||||||||||
Component: | kernel | Assignee: | Luming Yu <luyu> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 5.3 | CC: | achiang, alex_williamson, cward, dchapman, dzickus, gbeshers, peterm, tony.luck | ||||||||||
Target Milestone: | --- | Keywords: | OtherQA | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | ia64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2009-01-20 19:58:31 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Luming Yu
2008-06-17 02:21:16 UTC
Created attachment 309738 [details]
a back port
With this back port patch applied, the 2.6.18-94.el5 kernel just booted to hang in check_sal_cache_flush in the loop waiting a timer IPI (just after platform_send_ipi call). I tested upstream 2.6.26-rc6 , boot just fine. I noticed that 2.6.26-rc6 configured as _DIG, 2.6.18-94.el5 configured as _GENERIC.. Then, I changed 2.6.18-94.el5 to _DIG, and re-tested, still hang.. So just taking this back port patch looks not promising..We probably need others...but I have no clue now.. add Doug since this is HP related issue... Upstream both DIG and GENERIC kernels boot on my Tiger and HP test boxes. Does RHEL5 call the check_sal_flush() function earlier than mainline (specifically has machvec been set up before the call)? If not then I could see that there would be a problem with using platform_send_ipi() in this patch. Otherwise I'm a bit puzzled why this isn't working. Hi Luming, A few questions -- 1. Is your hang in the virtual guest only? Or does it occur on bare metal too? 2. What virtualization technology are we talking about here? xen? kvm? 3. Where exactly is the kernel hanging? Does it hang *before* the call to SAL_CACHE_FLUSH or afterwards? Thanks. (In reply to comment #4) > > Does RHEL5 call the check_sal_flush() function earlier than mainline > (specifically has machvec been set up before the call)? If not then I could > see that there would be a problem with using platform_send_ipi() in this patch. > Otherwise I'm a bit puzzled why this isn't working. > Tony, Yes, it does appear that RHEL5 is calling check_sal_cache_flush() earlier than upstream. In RHEL5 it is called via setup_arch()->ia64_sal_init(). check_sal_cache_flush() is the last line in ia64_sal_init() upstream it is called via setup_arch() directly just a few lines later than ia64_sal_init(). I will try moving check_sal_cache_flush() along with Luming's patch to see if that resolves the issue. So I will try this patch. fa1d19e5d9a94120f31e5783ab44758f46892d94 [IA64] move SAL_CACHE_FLUSH check later in boot The check to see if the firmware drops interrupts during a SAL_CACHE_FLUSH is done to early in the boot. SAL_CACHE_FLUSH expects to be able to make PAL calls in virtual mode, on some cell based machines a fault occurs causing a MCA. This patch moves the check after mmu_context_init so the TLB and VHPT are properly setup. Signed-off-by Troy Heber <troy.heber> Signed-off-by: Tony Luck <tony.luck> Created attachment 309893 [details]
move SAL_CACHE_FLUSH check later
This back port patch fixes the boot hang
Hi Luming, You will probably want this patch as well, so we don't break sn2: 2826f8c0f4c97b7db33e2a680f184d828eb7a785 [IA64] Fix boot failure on ia64/sn2 Thanks. Created attachment 310312 [details]
a back port
a back port of upstream described in comment above..
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This patch series is causing issues on ia64-xen (doesn't boot due to unsupported ipi 0xef). By checking code in function: ia64_send_ipi (arch/ia64/kernel/irq_64.c 2.6.18-94.el5), IA64_TIMER_VECTOR is actually NOT supported by current rhel 5 xen code. The patch "[RHEL 5.3 PATCH 1/2] bz 451745: Update check_sal_cache_flush to use platform_send_ipi" would cause ia64 xen kernel boot hang because this patch assumes platform_send_ipi IA64_TIMER_VECTOR work, and has a loop waiting for the arrival of IA64_TIMER ipi to break it. I'm not faimilar with xen upstream status, and don't know if it is still a problem in xen upstream. Alex, would you please help check if xen upstream fixes the problem? Please provide a pointer to upstrem fix if you want me to back port. --Luming Luming, I just started digging into the original patch: http://tinyurl.com/5n8el5 It appears this is to fix just the HP rx5670. We don't support that system past RHEL4 so we would never run xen on it. Is there a reason you are posting this? Was it a request from someone at HP. My suggestion is we do not include this in RHEL5. - Doug OK, we know how to fix this now. I worked with Alex Williamson and Alex Chaing back at HP and we have this patch which has now been submitted upstream: http://xen.markmail.org/message/2xwp64qu3e7k4545 This needs to be included to make this work under xen. Created attachment 312632 [details]
a bac port to fix ia64_xen boot hang
in kernel-2.6.18-105.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 in kernel-2.6.18-107.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 ~~~ Attention Partners! ~~~ Please test this URGENT / HIGH priority bug at your earliest convenience to ensure it makes it into the upcoming RHEL 5.3 release. The fix should be present in the Partner Snapshot #2 (kernel*-122), available NOW at ftp://partners.redhat.com. As we are approaching the end of the RHEL 5.3 test cycle, it is critical that you report back testing results as soon as possible. If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla Keywords field to indicate this. If you find that this issue has not been properly fixed, set the bug status to ASSIGNED with a comment describing the issues you encountered. All NEW issues encountered (not part of this bug fix) should have a new bug created with the proper keywords and flags set to trigger a review for their inclusion in the upcoming RHEL 5.3 or other future release. Post a link in this bugzilla pointing to the new issue to ensure it is not overlooked. For any additional questions, speak with your Partner Manager. ~~ Snapshot 3 is now available ~~ Snapshot 3 is now available for Partner Testing, which should contain a fix that resolves this bug. ISO's available as usual at ftp://partners.redhat.com. Your testing feedback is vital! Please let us know if you encounter any NEW issues (file a new bug) or if you have VERIFIED the fix is present and functioning as expected (add PartnerVerified Keyword). Ping your Partner Manager with any additional questions. Thanks! Confirmed patch is in the -123 kernel. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html |