Bug 1318596 - unable to handle kernel NULL pointer dereference at (null) in _find_next_bit
Summary: unable to handle kernel NULL pointer dereference at (null) in _find_next_bit
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: TRACKER-bugs-affecting-libguestfs
TreeView+ depends on / blocked
 
Reported: 2016-03-17 10:25 UTC by Richard W.M. Jones
Modified: 2016-03-18 14:00 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-18 13:59:50 UTC
Type: Bug


Attachments (Terms of Use)
reproducer.sh (905 bytes, text/plain)
2016-03-17 18:28 UTC, Richard W.M. Jones
no flags Details

Description Richard W.M. Jones 2016-03-17 10:25:18 UTC
Description of problem:

Rawhide kernel doesn't boot on qemu.
The error is:

[    0.000000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[    0.000000] ... MAX_LOCKDEP_SUBCLASSES:  8
[    0.000000] ... MAX_LOCK_DEPTH:          48
[    0.000000] ... MAX_LOCKDEP_KEYS:        8191
[    0.000000] ... CLASSHASH_SIZE:          4096
[    0.000000] ... MAX_LOCKDEP_ENTRIES:     32768
[    0.000000] ... MAX_LOCKDEP_CHAINS:      65536
[    0.000000] ... CHAINHASH_SIZE:          32768
[    0.000000]  memory used by lock dependency info: 8671 kB
[    0.000000]  per task-struct memory footprint: 2688 bytes
[    0.000000] tsc: Unable to calibrate against PIT
[    0.000000] tsc: No reference (HPET/PMTIMER) available
[    0.000000] tsc: Marking TSC unstable due to could not calculate TSC khz
[    0.010000] BUG: unable to handle kernel NULL pointer dereference at           (null)
[    0.010000] IP: [<ffffffff814698b5>] _find_next_bit.part.0+0x15/0x70
[    0.010000] PGD 0 
[    0.010000] Oops: 0000 [#1] SMP 
[    0.010000] Modules linked in:
[    0.010000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.0-0.rc0.git5.1.fc25.x86_64 #1
[    0.010000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-2.fc24 04/01/2014
[    0.010000] task: ffffffff81e0d580 ti: ffffffff81e00000 task.ti: ffffffff81e00000
[    0.010000] RIP: 0010:[<ffffffff814698b5>]  [<ffffffff814698b5>] _find_next_bit.part.0+0x15/0x70
[    0.010000] RSP: 0000:ffffffff81e03e40  EFLAGS: 00000246
[    0.010000] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    0.010000] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
[    0.010000] RBP: ffffffff81e03e50 R08: ffffffffffffffff R09: 0000000000000000
[    0.010000] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    0.010000] R13: ffffffff82248960 R14: ffffffff822562e0 R15: 0000000000000000
[    0.010000] FS:  0000000000000000(0000) GS:ffff88001ee00000(0000) knlGS:0000000000000000
[    0.010000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.010000] CR2: 0000000000000000 CR3: 0000000001e06000 CR4: 00000000000006b0
[    0.010000] Stack:
[    0.010000]  ffffffff81e03e50 ffffffff81469928 ffffffff81e03e70 ffffffff81453d56
[    0.010000]  0000000000000000 ffff88001f3fa780 ffffffff81e03e80 ffffffff81040495
[    0.010000]  ffffffff81e03f40 ffffffff8100285a ffffffff810eefb3 ffffffff00000000
[    0.010000] Call Trace:
[    0.010000]  [<ffffffff81469928>] ? find_next_bit+0x18/0x20
[    0.010000]  [<ffffffff81453d56>] cpumask_any_but+0x26/0x50
[    0.010000]  [<ffffffff81040495>] calibrate_delay_is_known+0x45/0x80
[    0.010000]  [<ffffffff8100285a>] calibrate_delay+0xfa/0x690
[    0.010000]  [<ffffffff810eefb3>] ? sched_clock_init+0x13/0x90
[    0.010000]  [<ffffffff81111525>] ? trace_hardirqs_on_caller+0xf5/0x1b0
[    0.010000]  [<ffffffff81146743>] ? ktime_get+0xb3/0x150
[    0.010000]  [<ffffffff810eefb3>] ? sched_clock_init+0x13/0x90
[    0.010000]  [<ffffffff8219cf7a>] start_kernel+0x3db/0x4a3
[    0.010000]  [<ffffffff8219c120>] ? early_idt_handler_array+0x120/0x120
[    0.010000]  [<ffffffff8219c32c>] x86_64_start_reservations+0x2a/0x2c
[    0.010000]  [<ffffffff8219c47a>] x86_64_start_kernel+0x14c/0x16f
[    0.010000] Code: d0 48 0f 47 c2 5d c3 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 c8 48 89 d1 49 c7 c0 ff ff ff ff 48 c1 e9 06 49 89 c1 55 <4c> 33 0c cf 89 d1 48 83 e2 c0 49 d3 e0 48 89 e5 4c 89 c1 4c 21 
[    0.010000] RIP  [<ffffffff814698b5>] _find_next_bit.part.0+0x15/0x70
[    0.010000]  RSP <ffffffff81e03e40>
[    0.010000] CR2: 0000000000000000
[    0.010000] ---[ end trace 1f7208cba1f1b54f ]---

Version-Release number of selected component (if applicable):

kernel 4.6.0-0.rc0.git5.1.fc25.x86_64

How reproducible:

100%

Steps to Reproduce:
1. Boot kernel in qemu.

Additional info:

Full boot log can be found near the end of this file:

https://kojipkgs.fedoraproject.org//work/tasks/3957/13373957/build.log

Comment 1 Josh Boyer 2016-03-17 12:41:13 UTC
Is this plain QEMU or KVM?  I tested this kernel on a KVM guest before I sent it to koji.

Comment 2 Richard W.M. Jones 2016-03-17 15:58:23 UTC
Plain QEMU (software emulation / TCG).  I wonder if it could be a
bug in TCG?

Comment 3 Richard W.M. Jones 2016-03-17 16:19:03 UTC
Here's the disassembly with the apparently failing instruction
marked with asterisks:

ffffffff814698a0 <_find_next_bit.part.0>:
ffffffff814698a0:       48 89 c8                mov    %rcx,%rax
ffffffff814698a3:       48 89 d1                mov    %rdx,%rcx
ffffffff814698a6:       49 c7 c0 ff ff ff ff    mov    $0xffffffffffffffff,%r8
ffffffff814698ad:       48 c1 e9 06             shr    $0x6,%rcx
ffffffff814698b1:       49 89 c1                mov    %rax,%r9
ffffffff814698b4:       55                      push   %rbp
ffffffff814698b5:       4c 33 0c cf             xor    (%rdi,%rcx,8),%r9 ******
ffffffff814698b9:       89 d1                   mov    %edx,%ecx
ffffffff814698bb:       48 83 e2 c0             and    $0xffffffffffffffc0,%rdx

Doesn't appear to be doing anything very strange that would
cause TCG to fail, so I'm guessing it's not a decoding failure
or a brand new instruction.

Comment 4 Richard W.M. Jones 2016-03-17 16:28:55 UTC
It looks as if the failing path is:

cpumask_any_but
-> calls for_each_cpu
-> calls cpumask_next
-> calls find_next_bit

This is a single CPU virtual machine.

I'm a bit lost after that, but the first parameter of find_next_bit
appears to be NULL for some reason.

Comment 5 Josh Boyer 2016-03-17 17:43:19 UTC
I'm guessing this is probably because of commit c25323c07345a843a56a294047b130dfd9250fad, where the topology_core_cpumask that was added to calibrate_delay_is_known is interacting badly with your emulated machine.

It would be helpful if you could distill the log down to a command we can use to run qemu in a similar setup.

Comment 6 Richard W.M. Jones 2016-03-17 18:28:12 UTC
Created attachment 1137468 [details]
reproducer.sh

A reproducer is attached.  It just requires the kernel + qemu.

The test is very timing sensitive.  I found that it only reproduced
about 1 in 10 times.  It seems more likely to reproduce if the host
machine is busy.  I did a bunch of git pulls and kernel compiles at the
same time, and that seems to make it reproduce more reliably.

Comment 7 Richard W.M. Jones 2016-03-17 18:29:19 UTC
I should note that it's expected that the kernel will panic
because there is no initramfs nor root filesystem.  If you hit
that panic, then you *didn't* reproduce the bug.  You only
reproduce the bug if the kernel crashes with the stack trace
shown in comment 0.

Comment 8 Richard W.M. Jones 2016-03-17 22:39:41 UTC
Finally worked out the right incantation to run the script
over and over again until you hit the failure:

while ./reproducer.sh >& /tmp/log ; ! grep -sq calibrate_delay_is_known /tmp/log; do echo -n .; done

(that's all on a single line)

Comment 9 Richard W.M. Jones 2016-03-18 11:28:02 UTC
Looks like we have a fix upstream:
https://lkml.org/lkml/2016/3/18/74

Comment 10 Josh Boyer 2016-03-18 13:59:50 UTC
Patch included in the rc8.git8.1.fc25 build.  Thanks for the report and testing.

Comment 11 Josh Boyer 2016-03-18 14:00:32 UTC
(In reply to Josh Boyer from comment #10)
> Patch included in the rc8.git8.1.fc25 build.  Thanks for the report and
> testing.

Er, rc0.git8.1.fc25 obviously.  Sigh.


Note You need to log in before you can comment on or make changes to this bug.