Bug 906225 - kdump doesn't work on UEFI platforms
Summary: kdump doesn't work on UEFI platforms
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 18
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Lingzhu Xiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-01-31 08:25 UTC by Lingzhu Xiang
Modified: 2013-07-04 20:56 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-03-11 01:22:56 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
3.7.0-0.rc2.git4.2.fc19.x86_64.log (8.50 KB, text/plain)
2013-02-06 10:00 UTC, Lingzhu Xiang
no flags Details

Comment 1 WANG Chao 2013-02-05 09:52:27 UTC
Hi, can you try appending '--console-serial' to KEXEC_ARGS in /etc/sysconfig/kdump and also append earlyprintk=serial to kernel cmdline? Let's see where kexec is stuck.

Thanks
Chao

Comment 2 Lingzhu Xiang 2013-02-05 10:06:25 UTC
Here we go:

[   38.411424] SysRq : Trigger a crash
[   38.414941] BUG: unable to handle kernel NULL pointer dereference at           (null)
[   38.422785] IP: [<ffffffff8139f286>] sysrq_handle_crash+0x16/0x20
[   38.428885] PGD 472bf0067 PUD 472b95067 PMD 0 
[   38.433377] Oops: 0002 [#1] SMP 
[   38.436638] Modules linked in: vfat fat cdc_ether iTCO_wdt usbnet coretemp iTCO_vendor_support mii bnx2 ioatdma shpchp i2c_i801 i7core_edac dca lpc_ich mfd_core edac_core kvm_intel kvm crc32c_intel microcode serio_raw mgag200 i2c_algo_bit drm_kms_helper ttm mptsas drm scsi_transport_sas mptscsih i2c_core mptbase
[   38.464820] CPU 1 
[   38.466661] Pid: 752, comm: bash Not tainted 3.7.4-204.fc18.x86_64 #1 IBM System x3550 M3 -[7944I21]-/69Y4438     
[   38.477169] RIP: 0010:[<ffffffff8139f286>]  [<ffffffff8139f286>] sysrq_handle_crash+0x16/0x20
[   38.485690] RSP: 0018:ffff88047304be38  EFLAGS: 00010092
[   38.490990] RAX: 000000000000000f RBX: ffffffff81c84f60 RCX: 000000000000000d
[   38.498106] RDX: 000000000000005a RSI: 0000000000000046 RDI: 0000000000000063
[   38.505223] RBP: ffff88047304be38 R08: ffffffff81e40460 R09: 0000000000000534
[   38.512339] R10: 0000000000000002 R11: 0000000000000533 R12: 0000000000000063
[   38.519454] R13: 0000000000000286 R14: 0000000000000000 R15: 000000000000000a
[   38.526570] FS:  00007fea5b683740(0000) GS:ffff880277c20000(0000) knlGS:0000000000000000
[   38.534641] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   38.540373] CR2: 0000000000000000 CR3: 0000000472d3d000 CR4: 00000000000007e0
[   38.547490] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   38.554607] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   38.561723] Process bash (pid: 752, threadinfo ffff88047304a000, task ffff880473989720)
[   38.569705] Stack:
[   38.571717]  ffff88047304be78 ffffffff8139f9a7 ffff880473989720 0000000000000002
[   38.579171]  ffff880273248200 00007fea5b68a000 0000000000000002 ffff88047304bf50
[   38.586626]  ffff88047304bea8 ffffffff8139fa5a ffff880273248200 00007fea5b68a000
[   38.594081] Call Trace:
[   38.596527]  [<ffffffff8139f9a7>] __handle_sysrq+0x127/0x190
[   38.602175]  [<ffffffff8139fa5a>] write_sysrq_trigger+0x4a/0x50
[   38.608084]  [<ffffffff811f9398>] proc_reg_write+0x78/0xb0
[   38.613560]  [<ffffffff8119510c>] vfs_write+0xac/0x180
[   38.618690]  [<ffffffff81195452>] sys_write+0x52/0xa0
[   38.623732]  [<ffffffff81639a1e>] ? do_page_fault+0xe/0x10
[   38.629209]  [<ffffffff8163e059>] system_call_fastpath+0x16/0x1b
[   38.635202] Code: ef e8 9f f7 ff ff eb c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 66 66 66 66 90 55 c7 05 44 0d aa 00 01 00 00 00 48 89 e5 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 66 66 66 66 90 55 31 c0 48 89 e5 
[   38.655330] RIP  [<ffffffff8139f286>] sysrq_handle_crash+0x16/0x20
[   38.661515]  RSP <ffff88047304be38>
[   38.664997] CR2: 0000000000000000
I'm in purgatory
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 3.7.4-204.fc18.x86_64 (mockbuild.fedoraproject.org) (gcc version 4.7.2 20121109 (Red Hat 4.7.2-8) (GCC) ) #1 SMP Wed Jan 23 16:44:29 UTC 2013
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.7.4-204.fc18.x86_64 root=/dev/mapper/fedora_ibm--x3550m3--02-root ro rd.md=0 rd.dm=0 rd.lvm.lv=fedora_ibm-x3550m3-02/swap rd.lvm.lv=fedora_ibm-x3550m3-02/root rd.luks=0 vconsole.keymap=us debug console=ttyS0,115200 LANG=en_US.UTF-8  earlyprintk=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off memmap=exactmap memmap=368K@64K memmap=130676K@786432K acpi_rsdp=0x7f7fe014 elfcorehdr=917108K memmap=4K#432K memmap=4K#636K memmap=1024K#2087804K memmap=128K#2088828K
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000100-0x000000000006bfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000006c000-0x000000000006cfff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000000006d000-0x000000000009efff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009f000-0x000000000009ffff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007d456fff] usable
[    0.000000] BIOS-e820: [mem 0x000000007d457000-0x000000007d480fff] reserved
[    0.000000] BIOS-e820: [mem 0x000000007d481000-0x000000007d836fff] usable
[    0.000000] BIOS-e820: [mem 0x000000007d837000-0x000000007d8e6fff] reserved
[    0.000000] BIOS-e820: [mem 0x000000007d8e7000-0x000000007f5eefff] usable
[    0.000000] BIOS-e820: [mem 0x000000007f5ef000-0x000000007f6defff] reserved
[    0.000000] BIOS-e820: [mem 0x000000007f6df000-0x000000007f7defff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000007f7df000-0x000000007f7fefff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000007f7ff000-0x000000007f7fffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000080000000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000ff800000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000047fffffff] usable
[    0.000000] bootconsole [earlyser0] enabled
[    0.000000] e820: last_pfn = 0x480000 max_arch_pfn = 0x400000000
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] e820: user-defined physical RAM map:
[    0.000000] user: [mem 0x0000000000010000-0x000000000006bfff] usable
[    0.000000] user: [mem 0x000000000006c000-0x000000000006cfff] ACPI data
[    0.000000] user: [mem 0x000000000009f000-0x000000000009ffff] ACPI data
[    0.000000] user: [mem 0x0000000030000000-0x0000000037f9cfff] usable
[    0.000000] user: [mem 0x000000007f6df000-0x000000007f7fefff] ACPI data
[    0.000000] DMI 2.5 present.
[    0.000000] DMI: IBM System x3550 M3 -[7944I21]-/69Y4438     , BIOS -[D6E148BUS-1.08]- 06/25/2010
[    0.000000] e820: update [mem 0x00000000-0x0000ffff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] No AGP bridge found
[    0.000000] e820: last_pfn = 0x37f9d max_arch_pfn = 0x400000000
[    0.000000] MTRR default type: uncachable
[    0.000000] MTRR fixed ranges enabled:
[    0.000000]   00000-9FFFF write-back
[    0.000000]   A0000-FFFFF uncachable
[    0.000000] MTRR variable ranges enabled:
[    0.000000]   0 base 0000000000 mask FF80000000 write-back
[    0.000000]   1 base 0100000000 mask FF00000000 write-back
[    0.000000]   2 base 0200000000 mask FE00000000 write-back
[    0.000000]   3 base 0400000000 mask FC00000000 write-back
[    0.000000]   4 base 0094000000 mask FFFF000000 write-combining
[    0.000000]   5 disabled
[    0.000000]   6 disabled
[    0.000000]   7 disabled
[    0.000000]   8 disabled
[    0.000000]   9 disabled
[    0.000000] x86 PAT enabled: cpu 0, old 0x7010600070106, new 0x7010600070106
[    0.000000] initial memory mapped: [mem 0x00000000-0x1fffffff]
[    0.000000] Base memory trampoline at [ffff880000066000] 66000 size 24576
[    0.000000] Using GB pages for direct mapping
[    0.000000] init_memory_mapping: [mem 0x00000000-0x37f9cfff]
[    0.000000]  [mem 0x00000000-0x37dfffff] page 2M
[    0.000000]  [mem 0x37e00000-0x37f9cfff] page 4k
[    0.000000] kernel direct mapping tables up to 0x37f9cfff @ [mem 0x00063000-0x00065fff]
[    0.000000] RAMDISK: [mem 0x37802000-0x37f89fff]
[    0.000000] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[    0.000000] IP: [<ffffffff812882f5>] security_capable+0x15/0x20
[    0.000000] PGD 0 
[    0.000000] Oops: 0000 [#1] SMP 
[    0.000000] Modules linked in:
[    0.000000] CPU 0 
[    0.000000] Pid: 0, comm: swapper Not tainted 3.7.4-204.fc18.x86_64 #1 IBM System x3550 M3 -[7944I21]-/69Y4438     
[    0.000000] RIP: 0010:[<ffffffff812882f5>]  [<ffffffff812882f5>] security_capable+0x15/0x20
[    0.000000] RSP: 0000:ffffffff81c01df8  EFLAGS: 00010046
[    0.000000] RAX: 0000000000000000 RBX: ffffffff81c13420 RCX: 0000000000000001
[    0.000000] RDX: 0000000000000025 RSI: ffffffff81c29be0 RDI: ffffffff81c31be0
[    0.000000] RBP: ffffffff81c01df8 R08: ffffffff81d87290 R09: 000000000000003d
[    0.000000] R10: 303837337830206d R11: 656d5b203a4b5349 R12: ffff880037f8989c
[    0.000000] R13: 0000000037f8a000 R14: 0000000037f9d000 R15: 0000000000000000
[    0.000000] FS:  0000000000000000(0000) GS:ffffffff81ce6000(0000) knlGS:0000000000000000
[    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.000000] CR2: 0000000000000030 CR3: 0000000030c0b000 CR4: 00000000000000b0
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.000000] Process swapper (pid: 0, threadinfo ffffffff81c00000, task ffffffff81c13420)
[    0.000000] Stack:
[    0.000000]  ffffffff81c01e18 ffffffff8106a8ad 0000000037802000 0000000007ff9000
[    0.000000]  ffffffff81c01e28 ffffffff8106a8f7 ffffffff81c01e48 ffffffff81d2c58f
[    0.000000]  0000000000000000 ffffffff81d74ad8 ffffffff81c01e58 ffffffff81d2deeb
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff8106a8ad>] ns_capable+0x2d/0x60
[    0.000000]  [<ffffffff8106a8f7>] capable+0x17/0x20
[    0.000000]  [<ffffffff81d2c58f>] acpi_os_get_root_pointer+0x1c/0x75
[    0.000000]  [<ffffffff81d2deeb>] acpi_initialize_tables+0x45/0x59
[    0.000000]  [<ffffffff81d2c0bf>] acpi_table_init+0x1b/0x99
[    0.000000]  [<ffffffff81d0902a>] acpi_boot_table_init+0x1e/0x87
[    0.000000]  [<ffffffff81d01795>] setup_arch+0xb39/0xc86
[    0.000000]  [<ffffffff81cfb947>] start_kernel+0xd4/0x3d4
[    0.000000]  [<ffffffff81cfb356>] x86_64_start_reservations+0x131/0x135
[    0.000000]  [<ffffffff81cfb45a>] x86_64_start_kernel+0x100/0x10f
[    0.000000] Code: cb 00 55 48 89 e5 ff 50 28 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 e8 1b 59 3b 00 48 8b 05 a4 68 cb 00 55 b9 01 00 00 00 48 89 e5 <ff> 50 30 5d c3 66 0f 1f 44 00 00 e8 fb 58 3b 00 48 8b 05 84 68 
[    0.000000] RIP  [<ffffffff812882f5>] security_capable+0x15/0x20
[    0.000000]  RSP <ffffffff81c01df8>
[    0.000000] CR2: 0000000000000030
[    0.000000] ---[ end trace 790b8d13ee19e9b1 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
PANIC: early exception 0d rip 10:ffffffff81043ff6 error 77b cr2 30
[    0.000000] Pid: 0, comm: swapper Tainted: G      D      3.7.4-204.fc18.x86_64 #1
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff81043ff6>] ? native_irq_enable+0x6/0x10
[    0.000000]  [<ffffffff81cfb189>] early_idt_handler+0x69/0x9c
[    0.000000]  [<ffffffff81043ff6>] ? native_irq_enable+0x6/0x10
[    0.000000]  [<ffffffff8162b8e1>] ? panic+0x18f/0x1d0
[    0.000000]  [<ffffffff810648d4>] do_exit+0x894/0x8b0
[    0.000000]  [<ffffffff8162b983>] ? printk+0x61/0x63
[    0.000000]  [<ffffffff81636c0d>] oops_end+0x9d/0xe0
[    0.000000]  [<ffffffff8162b271>] no_context+0x253/0x27e
[    0.000000]  [<ffffffff81d10b34>] ? __early_set_fixmap+0x99/0xa0
[    0.000000]  [<ffffffff8162b45b>] __bad_area_nosemaphore+0x1bf/0x1de
[    0.000000]  [<ffffffff8162b48d>] bad_area_nosemaphore+0x13/0x15
[    0.000000]  [<ffffffff816398ce>] __do_page_fault+0x39e/0x4e0
[    0.000000]  [<ffffffff812f1ca0>] ? sprintf+0x40/0x50
[    0.000000]  [<ffffffff8105f0ac>] ? print_time.part.5+0x6c/0x90
[    0.000000]  [<ffffffff8105f5d7>] ? print_prefix+0x77/0xc0
[    0.000000]  [<ffffffff81639a1e>] do_page_fault+0xe/0x10
[    0.000000]  [<ffffffff81636058>] page_fault+0x28/0x30
[    0.000000]  [<ffffffff812882f5>] ? security_capable+0x15/0x20
[    0.000000]  [<ffffffff8106a8ad>] ns_capable+0x2d/0x60
[    0.000000]  [<ffffffff8106a8f7>] capable+0x17/0x20
[    0.000000]  [<ffffffff81d2c58f>] acpi_os_get_root_pointer+0x1c/0x75
[    0.000000]  [<ffffffff81d2deeb>] acpi_initialize_tables+0x45/0x59
[    0.000000]  [<ffffffff81d2c0bf>] acpi_table_init+0x1b/0x99
[    0.000000]  [<ffffffff81d0902a>] acpi_boot_table_init+0x1e/0x87
[    0.000000]  [<ffffffff81d01795>] setup_arch+0xb39/0xc86
[    0.000000]  [<ffffffff81cfb947>] start_kernel+0xd4/0x3d4
[    0.000000]  [<ffffffff81cfb356>] x86_64_start_reservations+0x131/0x135
[    0.000000]  [<ffffffff81cfb45a>] x86_64_start_kernel+0x100/0x10f
[    0.000000] RIP 0x0

Comment 3 WANG Chao 2013-02-05 10:37:02 UTC
Dave, I know you've worked on acpi_rsdp. Can you take a look at this?

Comment 4 Dave Young 2013-02-06 02:00:47 UTC
It works when I worked about the acpi_rsdp stuff, there might be some other kernel changes caused this...

Will take a look.

Comment 5 Lingzhu Xiang 2013-02-06 10:00:51 UTC
Created attachment 693838 [details]
3.7.0-0.rc2.git4.2.fc19.x86_64.log

Bisecting with koji kernels.

Working
3.5.0-0.rc0.git3.1.fc18

Kernel panic - not syncing: Cannot find space for the kernel page tables (log attached)
3.5.0-0.rc0.git5.1.fc18
...
3.7.0-0.rc2.git4.2.fc19

BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
3.7.0-0.rc3.git0.1.fc19

Comment 6 Lingzhu Xiang 2013-02-17 08:56:41 UTC
(Reposting comment #1 with private info striped)

Description of problem:

Kernel hangs hard after crash. panic=3 doesn't make it reboot. No vmcore is saved.

Version-Release number of selected component (if applicable):
kernel-3.7.4-204.fc18
kernel-debug-3.6.10-5.fc18
kexec-tools-2.0.3-64.fc18

How reproducible:
Always.
Reproduced on Dell XPS8500 and IBM x3550m3 (both UEFI with SecureBoot off)
Reproduced with 3.7.4-204 and 3.6.10-5.

Steps to Reproduce:
1. crashkernel=128M; reboot; 
2. yum install kexec-tools; systemctl restart kdump.service; reboot
3. echo c >/proc/sysrq-trigger
  
Actual results:
[ 6192.841756] SysRq : Trigger a crash
[ 6192.845275] BUG: unable to handle kernel NULL pointer dereference at           (null)
...
[ 6193.085649] RIP  [<ffffffff8139f286>] sysrq_handle_crash+0x16/0x20
[ 6193.091833]  RSP <ffff880473f5be38>
[ 6193.095315] CR2: 0000000000000000
(Hanged, did not boot the crash kernel or reboot)

Expected results:
It boots crash kernel, saves vmcore and reboots.

Comment 7 Dave Young 2013-02-17 09:14:26 UTC
Without secure boot patch kdump works fine. Looks like the security subsystem is not initialized yet at the early point.

Comment 8 Josh Boyer 2013-02-18 16:39:11 UTC
(In reply to comment #7)
> Without secure boot patch kdump works fine. Looks like the security
> subsystem is not initialized yet at the early point.

Hm.  Yes, seems so.  The ACPI setup calls are done as part of the setup_arch call, which is called well before security_init in init/main.c:start_kernel.

We'll have to figure out what to do here, but relying on capabilities at this point is probably not going to work.

Comment 9 Josh Boyer 2013-02-19 14:47:57 UTC
Please test this scratch build when it completes and let me know if the oops is resolved:

http://koji.fedoraproject.org/koji/taskinfo?taskID=5033039

Comment 10 Lingzhu Xiang 2013-02-20 03:32:03 UTC
(In reply to comment #9)
> Please test this scratch build when it completes and let me know if the oops
> is resolved:
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=5033039

No oops happened with this build.

Comment 11 Josh Boyer 2013-02-20 13:34:36 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > Please test this scratch build when it completes and let me know if the oops
> > is resolved:
> > 
> > http://koji.fedoraproject.org/koji/taskinfo?taskID=5033039
> 
> No oops happened with this build.

Great, thank you for testing.  This is fixed in today's rawhide kernel, and I've committed the changes to the F18 branch.  It should be fixed with the next kernel update.

Comment 12 Fedora Update System 2013-02-25 19:33:29 UTC
kernel-3.7.9-205.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/kernel-3.7.9-205.fc18

Comment 13 Fedora Update System 2013-02-27 02:28:37 UTC
kernel-3.7.9-205.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 14 Lingzhu Xiang 2013-03-06 07:36:42 UTC
Patch was reverted during 3.8 rebase.

http://pkgs.fedoraproject.org/cgit/kernel.git/commit/?h=f18&id=d3a4ba3dbfb0c4b5db0d2669b15373f06d842cef

Does rawhide include this patch?

Comment 15 Josh Boyer 2013-03-06 13:47:13 UTC
(In reply to comment #14)
> Patch was reverted during 3.8 rebase.
> 
> http://pkgs.fedoraproject.org/cgit/kernel.git/commit/
> ?h=f18&id=d3a4ba3dbfb0c4b5db0d2669b15373f06d842cef

That was my mistake.  I told Dave to grab the wrong version of the patchset.  I'll fix it.

> Does rawhide include this patch?

Yes.  It's in devel-pekey-secure-boot-20130227.patch

Comment 16 Josh Boyer 2013-03-06 14:01:21 UTC
Fixed in git (again).

Comment 17 Lingzhu Xiang 2013-03-07 03:55:29 UTC
Verified in 3.8.2-204.fc18.x86_64

Comment 18 Fedora Update System 2013-03-08 18:43:56 UTC
kernel-3.8.2-206.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/kernel-3.8.2-206.fc18

Comment 19 Fedora Update System 2013-03-11 01:22:58 UTC
kernel-3.8.2-206.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.