Bug 476184
Summary: | RHEL5.3 pv guests crash randomly on reboot orders. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Gurhan Ozen <gozen> | ||||
Component: | kernel-xen | Assignee: | Rik van Riel <riel> | ||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.3 | CC: | bburns, dzickus, jburke, prarit, syeghiay, xen-maint | ||||
Target Milestone: | rc | Keywords: | Regression | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-01-20 20:04:11 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Gurhan Ozen
2008-12-12 10:16:21 UTC
Is this a regression from 5.2? (In reply to comment #2) > Is this a regression from 5.2? As far as, I know, yes it is. I didn't hit this with 5.2 . However, I don't know if we ran any 5.2 tests on this particular box. Gurhan, could I trick you into bisecting the 5.3 development kernels to see when the problem started? I don't think we changed any Xen cpufreq code between 5.2 and 5.3, but maybe some related code changes broke stuff. Yes you can. Let me know what you'd like me to do. To begin, I would like to know the kernel version that started breaking :) This is most easily achieved by picking a kernel somewhere halfway in-between 5.2 and 5.3. If that one is good, pick the halfway point between that and 5.3, etc. until you find the broken kernel. It shouldn't be more than a handful of installs and reboots. I just hope brew hasn't thrown out too many of the intermediate kernels :( Rik, I have bad news for you. I installed a 5.2 pv guest, and tried all these kernels on it: # rpm -qa | grep kernel-xen kernel-xen-2.6.18-121.el5 kernel-xen-2.6.18-124.el5 kernel-xen-2.6.18-92.el5 kernel-xen-2.6.18-105.el5 kernel-xen-2.6.18-120.el5 kernel-xen-2.6.18-122.el5 kernel-xen-2.6.18-125.el5 kernel-xen-2.6.18-115.el5 kernel-xen-2.6.18-123.el5 kernel-xen-2.6.18-126.el5 Yes, including up to -126 which had crashed on the bug report. It doesn't crash on 5.2 guest. However, 5.3 guest crashes. Just to make sure that there wasn't something funky with the guest itself, i installed another 5.3 pv guest, and was able to reproduce the issue with the new guest as well. So 5.3 distro crashes, but 5.2 doesn't , even with the 5.3 kernel. Any tips to zoom in what might be causing this? <jarod> riel: the change from built-in powernow-k8 to modular might be a good area to look at closer... <riel> jarod: *nod* <riel> except ... 5.2 userspace with 5.3 kernel works fine <jarod> 5.2 userspace wouldn't ever load powernow-k8 <riel> good point <jarod> (if my memory serves correctly) <riel> I wonder why it's trying to do ACPI-anything at all in a xenU <riel> there should not be any ACPI thing visible <jarod> so w/5.2 userspace, setting DRIVER=powernow-k8 (iirc) in the config file should cause it to get loaded <riel> gozen_: could you try ^^^ ? :) <jarod> and I suspect the problem would probably appear again <jarod> I'm pretty sure the only relevant change in cpuspeed from 5.2 to 5.3 was the logic in the initscript to modprobe powernow-k8 when needed <riel> sounds fair Ok, so I did the same thing for the 5.3 installation and tried with these kernels: # rpm -q kernel-xen kernel-xen-2.6.18-126.el5 kernel-xen-2.6.18-92.el5 kernel-xen-2.6.18-105.el5 kernel-xen-2.6.18-115.el5 kernel-xen-2.6.18-120.el5 kernel-xen-2.6.18-122.el5 kernel-xen-2.6.18-125.el5 kernel-xen-2.6.18-124.el5 This issue seems to started to with 2.6.18-125 kernel, anything before -125 is fine. To be sure, i rebooted -124 kernel 270 times. riel, trying jarod's suggestion, I was able to crash 5.2 userspace too! add DRIVER=powernow-k8 in /etc/sysconfig/cpuspeed and this problem happens in 5.2 userspace as well. With some gdbing on the 126 debuginfo package, the oops is pinpointed to the memcmp in find_psb_table: (gdb) list *0x1a3c 0x1a3c is in powernowk8_cpu_init (arch/i386/kernel/cpu/cpufreq/powernow-k8.c:701). 696 for (i = 0xc0000; i < 0xffff0; i += 0x10) { 697 /* Scan BIOS looking for the signature. */ 698 /* It can not be at ffff0 - it is too big. */ 699 700 psb = phys_to_virt(i); 701 if (memcmp(psb, PSB_ID_STRING, PSB_ID_STRING_LEN) != 0) 702 continue; 703 704 dprintk("found PSB header at 0x%p\n", psb); 705 Created attachment 327006 [details]
proposed patch to avoid the issue
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. Oh boy, the patch linux-2.6-i386-Add-check-for-dmi_data-in-powernow_k8-driver.patch from July 2008 removes roughly the same code that my workaround patch adds. Prarit, why did you remove those lines of code? <prarit> riel: clalance & I are chatting about it now (along with dzickus, gozen, and jarod). <prarit> clalance & I don't think your suggested patch is correct. <riel> prarit: ok, then I'll reassign the bug to you <prarit> riel: sure :) <riel> prarit: your patch removed that check initially :) <riel> what I don't know is why it took until -125 to show up a sa regression <prarit> riel: yeah... <riel> prarit: what would your proposed fix be? <riel> or ... what is wrong about the patch I proposed? :) --- fbl is now known as fbl_bbl --- lvmguy_dinner is now known as lvmguy <-- ootpa (~ltroan.redhat.com) has left #kernel (Leaving) <prarit> riel: In theory on PV guests, there is no dmi data. So all calls to get anything from dmi_data should be NULL, right? <prarit> Therefore the powernow-k8 driver should have failed to load because of this piece of code: <prarit> if (preregister_acpi_perf == 1 && cpu_family == CPU_OPTERON) { <prarit> char * dmi_data = dmi_get_system_info(DMI_BIOS_VENDOR); <prarit> printk("%s: dmi_data = %s\n", __FUNCTION__, dmi_data); <prarit> if (dmi_data && !strncmp(dmi_data, "Hewlett-Packard", 15)) { <prarit> #ifdef CONFIG_XEN <prarit> /* Disable cpufreq for HP AMD Opteron systems */ <prarit> printk("%s: This BIOS is %s .... disabling cpufreq " <prarit> "support\n", __FUNCTION__, dmi_data); <prarit> return -EPERM; <prarit> #else <prarit> But the code is continuing to execute. <riel> where is that code? <prarit> arch/i386/kernel/cpu/cpufreq/powernow-k8.c <riel> what function or line? <riel> oh found it, in powernowk8_init() <prarit> powerno... <prarit> :) <prarit> Sorry for lag riel -- we're chatting on this end. <riel> can you try "dmidecode" on gozen's test guest? <riel> just to be sure --- clark_lunch is now known as clark <riel> prarit: oh wait - I see <riel> prarit: if !dmi_data, that return -EPERM is never taken :) <riel> prarit: and we fall through to the next code <prarit> .... clalance has a good issue: How did this EVER work? <prarit> Because this seems to have just started failing... <riel> yeah, pure luck <riel> apparently gozen_ sometimes needs to reboot the guest quite a few times before it hits <riel> at least we now know the culprit <riel> and the fix - reinstate the Xen test your patch removed <prarit> riel: Maybe I'm being dense ;) -- I agree that the code is incorrect, but ... I don't see what is left to chance that this sometimes occurs and sometimes succeeds. <riel> it sure explains why there's no obvious culprit to be found in the -125 changes <riel> prarit: I'm not sure either - we have 1 hour to find out <riel> prarit: or we could spend that hour verifying that reinstating that !is_initial_xendomain() test fixes things <prarit> I'm worried there is some random corruption going on :/ It should always work or always fail. <prarit> It seems like doing that would be a band-aid ... /me is nervous <riel> BIOS-provided physical RAM map: <riel> Xen: 0000000000000000 - 000000001fc00000 (usable) <riel> no BIOS area in a domU e820 map <riel> so what is at the BIOS addresses is rather random <-- apuch_laptop has quit (Ping timeout: 622 seconds) <-- vfalico has quit (Ping timeout: 240 seconds) <prarit> riel: We think we know whats going on -- will drop you an email with patch in 5 mins. <riel> prarit: ok sweet <prarit> riel: Basically it's a "what you just said" patch in kernel-2.6.18-127.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html |