804347 – Crash early in xen dom0 boot with 3.2.10-3 kernel

Bug 804347 - Crash early in xen dom0 boot with 3.2.10-3 kernel

Summary: Crash early in xen dom0 boot with 3.2.10-3 kernel

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	806245 807401 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-03-18 00:26 UTC by Michael Young
Modified:	2012-04-25 04:24 UTC (History)
CC List:	13 users (show)
Fixed In Version:	kernel-3.3.0-8.fc17
Clone Of:
Clones:	806245 (view as bug list)
Environment:
Last Closed:	2012-04-01 00:27:17 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
x86: add io_apic_ops to allow interception (4.41 KB, patch) 2012-03-26 14:56 UTC, Konrad Rzeszutek Wilk	no flags	Details \| Diff
x86/apic_ops: Replace apic_ops with x86_apic_ops (8.26 KB, patch) 2012-03-26 14:57 UTC, Konrad Rzeszutek Wilk	no flags	Details \| Diff
xen/x86: Implement x86_apic_ops (2.55 KB, patch) 2012-03-26 14:57 UTC, Konrad Rzeszutek Wilk	no flags	Details \| Diff
Xeon - xm dmesg (7.42 KB, text/plain) 2012-04-02 12:12 UTC, Volnei	no flags	Details
E5700 - xm dmesg (4.30 KB, text/plain) 2012-04-02 12:13 UTC, Volnei	no flags	Details
View All

Description Michael Young 2012-03-18 00:26:20 UTC

I get the following backtrace when booting 3.2.10-3.fc16.x86_64 (also 3.2.10-1 and 3.2.9-4 but not 3.2.9-2) as dom0 under xen. Some experimentation shows the crash occurs if the x86-ioapic-add-register-checks-for-bogus-io-apic-entries.patch patch is present in the kernel, but boots normally if it isn't.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
IP: [<ffffffff8134e51f>] xen_irq_init+0x1f/0xb0
PGD 0
Oops: 0002 [#1] SMP
CPU 0
Modules linked in:

Pid: 1, comm: swapper/0 Not tainted 3.2.10-3.fc16.x86_64 #1 Dell Inc. Inspiron 1525                  /0U990C
RIP: e030:[<ffffffff8134e51f>]  [<ffffffff8134e51f>] xen_irq_init+0x1f/0xb0
RSP: e02b: ffff8800d42cbb70  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 00000000ffffffef RCX: 0000000000000001
RDX: 0000000000000040 RSI: 00000000ffffffef RDI: 0000000000000001
RBP: ffff8800d42cbb80 R08: ffff8800d6400000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffef
R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000010
FS:  0000000000000000(0000) GS:ffff8800df5fe000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0:000000008005003b
CR2: 0000000000000040 CR3: 0000000001a05000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper/0 (pid: 1, threadinfo ffff8800d42ca000, task ffff8800d42d0000)
Stack:
 00000000ffffffef 0000000000000010 ffff8800d42cbbe0 ffffffff8134f157
 ffffffff8100a9b2 ffffffff8182ffd1 00000000000000a0 00000000829e7384
 0000000000000002 0000000000000010 00000000ffffffff 0000000000000000
Call Trace:
 [<ffffffff8134f157>] xen_bind_pirq_gsi_to_irq+0x87/0x230
 [<ffffffff8100a9b2>] ? check_events+0x12+0x20
 [<ffffffff814bab42>] xen_register_pirq+0x82/0xe0
 [<ffffffff814bac1a>] xen_register_gsi.part.2+0x4a/0xd0
 [<ffffffff814bacc0>] acpi_register_gsi_xen+0x20/0x30
 [<ffffffff8103036f>] acpi_register_gsi+0xf/0x20
 [<ffffffff8131abdb>] acpi_pci_irq_enable+0x12e/0x202
 [<ffffffff814bc849>] pcibios_enable_device+0x39/0x40
 [<ffffffff812dc7ab>] do_pci_enable_device+0x4b/0x70
 [<ffffffff812dc878>] __pci_enable_device_flags+0xa8/0xf0
 [<ffffffff812dc8d3>] pci_enable_device+0x13/0x20
 [<ffffffff812d7cf8>] pci_enable_bridges+0x48/0x90
 [<ffffffff81b1942e>] pci_assign_unassigned_resources+0x1f0/0x224
 [<ffffffff813914c7>] ? put_device+0x17/0x20
 [<ffffffff81165d0b>] ? kfree+0x3b/0x150
 [<ffffffff812dfb8a>] ? pci_get_subsys+0x8a/0xc0
 [<ffffffff81b2aa76>] ? pcibios_allocate_bus_resources+0x8d/0x8d
 [<ffffffff81b2aae8>] pcibios_assign_resources+0x72/0x76
 [<ffffffff81b271f3>] ? parse_pmtmr+0x56/0x56
 [<ffffffff81002042>] do_one_initcall+0x42/0x180
 [<ffffffff81aebce7>] kernel_init+0xde/0x158
 [<ffffffff81066537>] ? schedule_tail+0x27/0xb0
 [<ffffffff815eec34>] kernel_thread_helper+0x4/0x10
 [<ffffffff815ecce3>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff815e4f7c>] ? retint_restore_args+0x5/0x6
 [<ffffffff815eec30>] ? gs_change+0x13/0x13

Comment 1 Michael Young 2012-03-18 23:56:56 UTC

I did a bit of hacking to get the kernel warning without the crash. The context for the warning is
[    0.000000] ACPI: PM-Timer IO Port: 0x1008
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[    0.000000] BIOS bug: APIC version is 0 for CPU 0/0x0, fixing up to 0x10
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
[    0.000000] ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
[    0.000000] I/O APIC 0xfec00000 regs return all ones, skipping!
[    0.000000] IOAPIC[0]: apic_id 2, version 255, address 0xfec00000, GSI 0-255
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[    0.000000] SMP: Allowing 2 CPUs, 0 hotplug CPUs
[    0.000000] nr_irqs_gsi: 272
I suspect the crash occurs a bit later when the pirqs are mapped
[    0.000000] NR_IRQS:16640 nr_irqs:512 16
[    0.000000] xen: sci override: global_irq=9 trigger=0 polarity=0
[    0.000000] xen: registering gsi 9 triggering 0 polarity 0
[    0.000000] xen: --> pirq=9 -> irq=9 (gsi=9)
[    0.000000] xen: acpi sci 9
[    0.000000] xen: --> pirq=1 -> irq=1 (gsi=1)
[    0.000000] xen: --> pirq=2 -> irq=2 (gsi=2)
[    0.000000] xen: --> pirq=3 -> irq=3 (gsi=3)
[    0.000000] xen: --> pirq=4 -> irq=4 (gsi=4)
[    0.000000] xen: --> pirq=5 -> irq=5 (gsi=5)
[    0.000000] xen: --> pirq=6 -> irq=6 (gsi=6)
[    0.000000] xen: --> pirq=7 -> irq=7 (gsi=7)
[    0.000000] xen: --> pirq=8 -> irq=8 (gsi=8)
[    0.000000] xen_map_pirq_gsi: returning irq 9 for gsi 9
[    0.000000] xen: --> pirq=9 -> irq=9 (gsi=9)
[    0.000000] xen: --> pirq=10 -> irq=10 (gsi=10)
[    0.000000] xen: --> pirq=11 -> irq=11 (gsi=11)
[    0.000000] xen: --> pirq=12 -> irq=12 (gsi=12)
[    0.000000] xen: --> pirq=13 -> irq=13 (gsi=13)
[    0.000000] xen: --> pirq=14 -> irq=14 (gsi=14)
[    0.000000] xen: --> pirq=15 -> irq=15 (gsi=15)

Comment 2 Michael Young 2012-03-19 11:55:09 UTC

I tried a scratch build without the x86-ioapic-add-register-checks-for-bogus-io-apic-entries.patch patch and that boots successfully as dom0.

Comment 3 Josh Boyer 2012-03-19 13:31:41 UTC

Adding Konrad to CC.  This patch is already queued for upstream, so I've emailed Suresh as well.

Comment 4 Konrad Rzeszutek Wilk 2012-03-19 19:43:24 UTC

Thanks. Responded on the email thread. Will instrument it a bit to see if my theory holds true.

Comment 5 Volnei 2012-03-22 13:34:36 UTC

Hi,

Unfortunately the problem continues to happen.
Looking at the changelog of the new kernel didn't see the references to the patch cited by Michael Young.

kernel-3.3.0-4.fc16.x86_64
xen-4.1.2-6.fc16.x86_64


Is there any estimate of when this problem will be solved?

Thank a lot

Comment 6 Konrad Rzeszutek Wilk 2012-03-22 15:38:26 UTC

Wating on Ingo to Ack these patches: https://lkml.org/lkml/2012/3/21/632

Comment 7 Konrad Rzeszutek Wilk 2012-03-22 15:39:39 UTC

there is also the quick-n-dirty-hack: https://lkml.org/lkml/2012/3/20/349

Comment 8 Dave Jones 2012-03-22 16:45:49 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 9 Dave Jones 2012-03-22 16:50:22 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 10 Michael Young 2012-03-22 16:55:22 UTC

It crashes the same way with kernel-3.3.0-4.fc16 (as expected).

Comment 11 Dave Jones 2012-03-22 17:00:12 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 12 Michael Young 2012-03-23 00:45:27 UTC

I have done another scratch build - http://koji.fedoraproject.org/koji/taskinfo?taskID=3923761 - with the patches from https://lkml.org/lkml/2012/3/21/632 - this boots successfully as a dom0.

Comment 13 Konrad Rzeszutek Wilk 2012-03-23 20:44:57 UTC

*** Bug 806245 has been marked as a duplicate of this bug. ***

Comment 14 Konrad Rzeszutek Wilk 2012-03-26 14:56:55 UTC

Created attachment 572774 [details]
x86: add io_apic_ops to allow interception

Comment 15 Konrad Rzeszutek Wilk 2012-03-26 14:57:25 UTC

Created attachment 572775 [details]
x86/apic_ops: Replace apic_ops with x86_apic_ops

Comment 16 Konrad Rzeszutek Wilk 2012-03-26 14:57:54 UTC

Created attachment 572776 [details]
xen/x86: Implement x86_apic_ops

Comment 17 Josh Boyer 2012-03-27 17:49:09 UTC

I've committed these patches to F16 now.  Should be in the next update.

Comment 18 Gerry Reno 2012-03-27 18:18:26 UTC

*** Bug 807401 has been marked as a duplicate of this bug. ***

Comment 19 Fedora Update System 2012-03-29 23:09:09 UTC

kernel-3.3.0-8.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.3.0-8.fc16

Comment 20 Fedora Update System 2012-03-29 23:11:57 UTC

kernel-3.3.0-8.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.3.0-8.fc17

Comment 21 Fedora Update System 2012-03-30 03:00:52 UTC

Package kernel-3.3.0-8.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.3.0-8.fc17'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-4888/kernel-3.3.0-8.fc17
then log in and leave karma (feedback).

Comment 22 Gerry Reno 2012-03-30 20:16:14 UTC

Did kernel-PAE also get submitted?

On F16 I just did a 'yum --enablerepo=updates-testing list kernel-PAE' but do not see any newer kernel-PAE.

.

Comment 23 Josh Boyer 2012-03-30 20:51:45 UTC

(In reply to comment #22)
> Did kernel-PAE also get submitted?
> 
> On F16 I just did a 'yum --enablerepo=updates-testing list kernel-PAE' but do
> not see any newer kernel-PAE.

They are all built together, so yes.  Maybe your mirror is a bit stale.  It can be found here:

http://dl.fedoraproject.org/pub/fedora/linux/updates/testing/17/i386/kernel-PAE-3.3.0-8.fc17.i686.rpm

I don't believe the F16 version has been pushed yet.

Comment 24 Fedora Update System 2012-04-01 00:27:17 UTC

kernel-3.3.0-8.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 25 Volnei 2012-04-02 12:12:13 UTC

Created attachment 574485 [details]
Xeon - xm dmesg

Comment 26 Volnei 2012-04-02 12:13:08 UTC

Created attachment 574486 [details]
E5700 - xm dmesg

Comment 27 Volnei 2012-04-02 12:13:23 UTC

Hi,

thanks a lot. Great job!
Attached I sending my dmesg e  xm dmesg, to yours appreciation.

Intel Xeon and E5700.

Comment 28 Fedora Update System 2012-04-12 02:58:35 UTC

kernel-3.3.0-8.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 29 Bill McGonigle 2012-04-24 06:39:05 UTC

This is still a problem with current Fedora 15 (still on maintenance).  It looks like Suresh's patch didn't get reverted until 3.2.15, which we don't have.  I can confirm that 2.6.43.2-6.fc15.x86_64 in testing boots OK, though I don't know if the decision to jump to 3.3 on f15 has been made yet.

Comment 30 Michael Young 2012-04-24 07:48:11 UTC

I have the impression that the F15 3.3 kernels are ready but they don't get enough testing before they are obsoleted by the next update. If you want to give the 2.6.43.2-6.fc15 kernel positive karma you can do so at
https://admin.fedoraproject.org/updates/FEDORA-2012-6406/kernel-2.6.43.2-6.fc15

Comment 31 Josh Boyer 2012-04-24 11:40:24 UTC

(In reply to comment #30)
> I have the impression that the F15 3.3 kernels are ready but they don't get
> enough testing before they are obsoleted by the next update. If you want to
> give the 2.6.43.2-6.fc15 kernel positive karma you can do so at
> https://admin.fedoraproject.org/updates/FEDORA-2012-6406/kernel-2.6.43.2-6.fc15

Yes, that's exactly what is happening.  It's rather frustrating.

Comment 32 Bill McGonigle 2012-04-25 04:24:51 UTC

Thanks for the explanation guys.

>billmcgonigle - 2012-04-25 04:20:47
>stable for 24 hours on a Xen server. Fixes Xen crash on boot regression >currently shipping in F15!
>bodhi - 2012-04-25 04:20:48
>Critical path update approved
>bodhi - 2012-04-25 04:20:50
>This update has reached the stable karma threshold and will be pushed to the >stable updates repository 

oh, so this one's gonna be my fault. ;)

Note You need to log in before you can comment on or make changes to this bug.