Description of problem: I didn't hit this everytime, but one of several tries are likely to hit. Installed 5.3 x86_64 dom0/pv guest. When virsh reboot $guest is issued the guest sometime crashes with the following backtrace: Checking for hardware changes [ OK ] Unable to handle kernel paging request at ffff8800000ce000 RIP: [<ffffffff8020bbb1>] memcmp+0x8/0x22 PGD f5f067 PUD f60067 PMD f61067 PTE 0 Oops: 0000 [1] SMP last sysfs file: /class/net/eth0/address CPU 0 Modules linked in: powernow_k8 freq_table dm_multipath scsi_dh scsi_mod parport_pc lp parport xennet pcspkr dm_snapshot dm_zero dm_mirror dm_log dm_mod xenblk ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 817, comm: modprobe Not tainted 2.6.18-126.el5xen #1 RIP: e030:[<ffffffff8020bbb1>] [<ffffffff8020bbb1>] memcmp+0x8/0x22 RSP: e02b:ffff88001d365bf0 EFLAGS: 00010206 RAX: 0000000000000041 RBX: 0000000000000000 RCX: 000000000000000a RDX: 000000000000000a RSI: ffffffff881760fd RDI: ffff8800000ce000 RBP: ffff88001dd394c0 R08: 0000000000000001 R09: ffff880000098e00 R10: 0000000000000003 R11: 0000000000000000 R12: ffff8800000ce000 R13: 0000000000000000 R14: ffff880000098e00 R15: 00000000fffffff4 FS: 00002ae0b8bc7240(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process modprobe (pid: 817, threadinfo ffff88001d364000, task ffff88001f64b820) Stack: ffffffff88174a3c 000000001d365c78 0000000000000003 0000000000000001 ffff88001fdb78c0 0000000000000001 ffff88000001fa70 0000000000000001 0000000000000000 ffff88000001fa68 Call Trace: [<ffffffff88174a3c>] :powernow_k8:powernowk8_cpu_init+0x55c/0xdec [<ffffffff802855c8>] __wake_up_common+0x3e/0x68 [<ffffffff8028816d>] __cond_resched+0x1c/0x44 [<ffffffff80263a0d>] _spin_lock_irq+0x9/0x14 [<ffffffff80262099>] wait_for_completion+0xa1/0xaa [<ffffffff80263a0d>] _spin_lock_irq+0x9/0x14 [<ffffffff8026349f>] __down_write_nested+0x35/0x9a [<ffffffff804043f3>] cpufreq_add_dev+0x174/0x57f [<ffffffff8021a69c>] vsnprintf+0x559/0x59e [<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14 [<ffffffff80217548>] release_console_sem+0x1b1/0x205 [<ffffffff8028b9f5>] vprintk+0x308/0x329 [<ffffffff80261ead>] thread_return+0x96/0x113 [<ffffffff80286bc9>] task_rq_lock+0x3f/0x71 [<ffffffff8028830a>] set_cpus_allowed+0xb2/0xbf [<ffffffff8028ba68>] printk+0x52/0xc6 [<ffffffff8039fb09>] sysdev_driver_register+0x61/0xbd [<ffffffff80403423>] cpufreq_register_driver+0xb9/0x194 [<ffffffff802a01a7>] sys_init_module+0xaf/0x1e8 [<ffffffff8025f106>] system_call+0x86/0x8b [<ffffffff8025f080>] system_call+0x0/0x8b Code: 0f b6 17 29 c2 89 d0 75 10 48 ff c7 48 ff c6 48 ff c9 48 85 RIP [<ffffffff8020bbb1>] memcmp+0x8/0x22 RSP <ffff88001d365bf0> CR2: ffff8800000ce000 <0>Kernel panic - not syncing: Fatal exception Version-Release number of selected component (if applicable): # rpm -qa | grep xen xen-devel-3.0.3-79.el5 xen-devel-3.0.3-79.el5 xen-libs-3.0.3-79.el5 xen-3.0.3-79.el5 xen-debuginfo-3.0.3-79.el5 kernel-xen-2.6.18-126.el5 xen-debuginfo-3.0.3-79.el5 kernel-xen-devel-2.6.18-126.el5 xen-libs-3.0.3-79.el5 How reproducible: Reliably. As said before one of every several reboot commands are likely to hit this. Additional info: It's a 32-cpu (don't know how many cores) system: processor : 31 vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : Quad-Core AMD Opteron(tm) Processor 8356 stepping : 3 cpu MHz : 2300.080 cache size : 512 KB physical id : 31 siblings : 1 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall mmxext fxsr_opt lm 3dnowext 3dnow constant_tsc pni monitor cx16 lahf_lm cmp_legacy svm cr8_legacy bogomips : 5752.70 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc [6] [7] [8] 132 GB physical memory.
Is this a regression from 5.2?
(In reply to comment #2) > Is this a regression from 5.2? As far as, I know, yes it is. I didn't hit this with 5.2 . However, I don't know if we ran any 5.2 tests on this particular box.
Gurhan, could I trick you into bisecting the 5.3 development kernels to see when the problem started? I don't think we changed any Xen cpufreq code between 5.2 and 5.3, but maybe some related code changes broke stuff.
Yes you can. Let me know what you'd like me to do.
To begin, I would like to know the kernel version that started breaking :) This is most easily achieved by picking a kernel somewhere halfway in-between 5.2 and 5.3. If that one is good, pick the halfway point between that and 5.3, etc. until you find the broken kernel. It shouldn't be more than a handful of installs and reboots. I just hope brew hasn't thrown out too many of the intermediate kernels :(
Rik, I have bad news for you. I installed a 5.2 pv guest, and tried all these kernels on it: # rpm -qa | grep kernel-xen kernel-xen-2.6.18-121.el5 kernel-xen-2.6.18-124.el5 kernel-xen-2.6.18-92.el5 kernel-xen-2.6.18-105.el5 kernel-xen-2.6.18-120.el5 kernel-xen-2.6.18-122.el5 kernel-xen-2.6.18-125.el5 kernel-xen-2.6.18-115.el5 kernel-xen-2.6.18-123.el5 kernel-xen-2.6.18-126.el5 Yes, including up to -126 which had crashed on the bug report. It doesn't crash on 5.2 guest. However, 5.3 guest crashes. Just to make sure that there wasn't something funky with the guest itself, i installed another 5.3 pv guest, and was able to reproduce the issue with the new guest as well. So 5.3 distro crashes, but 5.2 doesn't , even with the 5.3 kernel. Any tips to zoom in what might be causing this?
<jarod> riel: the change from built-in powernow-k8 to modular might be a good area to look at closer... <riel> jarod: *nod* <riel> except ... 5.2 userspace with 5.3 kernel works fine <jarod> 5.2 userspace wouldn't ever load powernow-k8 <riel> good point <jarod> (if my memory serves correctly) <riel> I wonder why it's trying to do ACPI-anything at all in a xenU <riel> there should not be any ACPI thing visible <jarod> so w/5.2 userspace, setting DRIVER=powernow-k8 (iirc) in the config file should cause it to get loaded <riel> gozen_: could you try ^^^ ? :) <jarod> and I suspect the problem would probably appear again <jarod> I'm pretty sure the only relevant change in cpuspeed from 5.2 to 5.3 was the logic in the initscript to modprobe powernow-k8 when needed <riel> sounds fair
Ok, so I did the same thing for the 5.3 installation and tried with these kernels: # rpm -q kernel-xen kernel-xen-2.6.18-126.el5 kernel-xen-2.6.18-92.el5 kernel-xen-2.6.18-105.el5 kernel-xen-2.6.18-115.el5 kernel-xen-2.6.18-120.el5 kernel-xen-2.6.18-122.el5 kernel-xen-2.6.18-125.el5 kernel-xen-2.6.18-124.el5 This issue seems to started to with 2.6.18-125 kernel, anything before -125 is fine. To be sure, i rebooted -124 kernel 270 times.
riel, trying jarod's suggestion, I was able to crash 5.2 userspace too! add DRIVER=powernow-k8 in /etc/sysconfig/cpuspeed and this problem happens in 5.2 userspace as well.
With some gdbing on the 126 debuginfo package, the oops is pinpointed to the memcmp in find_psb_table: (gdb) list *0x1a3c 0x1a3c is in powernowk8_cpu_init (arch/i386/kernel/cpu/cpufreq/powernow-k8.c:701). 696 for (i = 0xc0000; i < 0xffff0; i += 0x10) { 697 /* Scan BIOS looking for the signature. */ 698 /* It can not be at ffff0 - it is too big. */ 699 700 psb = phys_to_virt(i); 701 if (memcmp(psb, PSB_ID_STRING, PSB_ID_STRING_LEN) != 0) 702 continue; 703 704 dprintk("found PSB header at 0x%p\n", psb); 705
Created attachment 327006 [details] proposed patch to avoid the issue
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
Oh boy, the patch linux-2.6-i386-Add-check-for-dmi_data-in-powernow_k8-driver.patch from July 2008 removes roughly the same code that my workaround patch adds. Prarit, why did you remove those lines of code?
<prarit> riel: clalance & I are chatting about it now (along with dzickus, gozen, and jarod). <prarit> clalance & I don't think your suggested patch is correct. <riel> prarit: ok, then I'll reassign the bug to you <prarit> riel: sure :) <riel> prarit: your patch removed that check initially :) <riel> what I don't know is why it took until -125 to show up a sa regression <prarit> riel: yeah... <riel> prarit: what would your proposed fix be? <riel> or ... what is wrong about the patch I proposed? :) --- fbl is now known as fbl_bbl --- lvmguy_dinner is now known as lvmguy <-- ootpa (~ltroan.redhat.com) has left #kernel (Leaving) <prarit> riel: In theory on PV guests, there is no dmi data. So all calls to get anything from dmi_data should be NULL, right? <prarit> Therefore the powernow-k8 driver should have failed to load because of this piece of code: <prarit> if (preregister_acpi_perf == 1 && cpu_family == CPU_OPTERON) { <prarit> char * dmi_data = dmi_get_system_info(DMI_BIOS_VENDOR); <prarit> printk("%s: dmi_data = %s\n", __FUNCTION__, dmi_data); <prarit> if (dmi_data && !strncmp(dmi_data, "Hewlett-Packard", 15)) { <prarit> #ifdef CONFIG_XEN <prarit> /* Disable cpufreq for HP AMD Opteron systems */ <prarit> printk("%s: This BIOS is %s .... disabling cpufreq " <prarit> "support\n", __FUNCTION__, dmi_data); <prarit> return -EPERM; <prarit> #else <prarit> But the code is continuing to execute. <riel> where is that code? <prarit> arch/i386/kernel/cpu/cpufreq/powernow-k8.c <riel> what function or line? <riel> oh found it, in powernowk8_init() <prarit> powerno... <prarit> :) <prarit> Sorry for lag riel -- we're chatting on this end. <riel> can you try "dmidecode" on gozen's test guest? <riel> just to be sure --- clark_lunch is now known as clark <riel> prarit: oh wait - I see <riel> prarit: if !dmi_data, that return -EPERM is never taken :) <riel> prarit: and we fall through to the next code <prarit> .... clalance has a good issue: How did this EVER work? <prarit> Because this seems to have just started failing... <riel> yeah, pure luck <riel> apparently gozen_ sometimes needs to reboot the guest quite a few times before it hits <riel> at least we now know the culprit <riel> and the fix - reinstate the Xen test your patch removed <prarit> riel: Maybe I'm being dense ;) -- I agree that the code is incorrect, but ... I don't see what is left to chance that this sometimes occurs and sometimes succeeds. <riel> it sure explains why there's no obvious culprit to be found in the -125 changes <riel> prarit: I'm not sure either - we have 1 hour to find out <riel> prarit: or we could spend that hour verifying that reinstating that !is_initial_xendomain() test fixes things <prarit> I'm worried there is some random corruption going on :/ It should always work or always fail. <prarit> It seems like doing that would be a band-aid ... /me is nervous <riel> BIOS-provided physical RAM map: <riel> Xen: 0000000000000000 - 000000001fc00000 (usable) <riel> no BIOS area in a domU e820 map <riel> so what is at the BIOS addresses is rather random <-- apuch_laptop has quit (Ping timeout: 622 seconds) <-- vfalico has quit (Ping timeout: 240 seconds) <prarit> riel: We think we know whats going on -- will drop you an email with patch in 5 mins. <riel> prarit: ok sweet <prarit> riel: Basically it's a "what you just said" patch
in kernel-2.6.18-127.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html