Description of problem: Rawhide kernel doesn't boot on qemu. The error is: [ 0.000000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar [ 0.000000] ... MAX_LOCKDEP_SUBCLASSES: 8 [ 0.000000] ... MAX_LOCK_DEPTH: 48 [ 0.000000] ... MAX_LOCKDEP_KEYS: 8191 [ 0.000000] ... CLASSHASH_SIZE: 4096 [ 0.000000] ... MAX_LOCKDEP_ENTRIES: 32768 [ 0.000000] ... MAX_LOCKDEP_CHAINS: 65536 [ 0.000000] ... CHAINHASH_SIZE: 32768 [ 0.000000] memory used by lock dependency info: 8671 kB [ 0.000000] per task-struct memory footprint: 2688 bytes [ 0.000000] tsc: Unable to calibrate against PIT [ 0.000000] tsc: No reference (HPET/PMTIMER) available [ 0.000000] tsc: Marking TSC unstable due to could not calculate TSC khz [ 0.010000] BUG: unable to handle kernel NULL pointer dereference at (null) [ 0.010000] IP: [<ffffffff814698b5>] _find_next_bit.part.0+0x15/0x70 [ 0.010000] PGD 0 [ 0.010000] Oops: 0000 [#1] SMP [ 0.010000] Modules linked in: [ 0.010000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.0-0.rc0.git5.1.fc25.x86_64 #1 [ 0.010000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-2.fc24 04/01/2014 [ 0.010000] task: ffffffff81e0d580 ti: ffffffff81e00000 task.ti: ffffffff81e00000 [ 0.010000] RIP: 0010:[<ffffffff814698b5>] [<ffffffff814698b5>] _find_next_bit.part.0+0x15/0x70 [ 0.010000] RSP: 0000:ffffffff81e03e40 EFLAGS: 00000246 [ 0.010000] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 0.010000] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000 [ 0.010000] RBP: ffffffff81e03e50 R08: ffffffffffffffff R09: 0000000000000000 [ 0.010000] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 0.010000] R13: ffffffff82248960 R14: ffffffff822562e0 R15: 0000000000000000 [ 0.010000] FS: 0000000000000000(0000) GS:ffff88001ee00000(0000) knlGS:0000000000000000 [ 0.010000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.010000] CR2: 0000000000000000 CR3: 0000000001e06000 CR4: 00000000000006b0 [ 0.010000] Stack: [ 0.010000] ffffffff81e03e50 ffffffff81469928 ffffffff81e03e70 ffffffff81453d56 [ 0.010000] 0000000000000000 ffff88001f3fa780 ffffffff81e03e80 ffffffff81040495 [ 0.010000] ffffffff81e03f40 ffffffff8100285a ffffffff810eefb3 ffffffff00000000 [ 0.010000] Call Trace: [ 0.010000] [<ffffffff81469928>] ? find_next_bit+0x18/0x20 [ 0.010000] [<ffffffff81453d56>] cpumask_any_but+0x26/0x50 [ 0.010000] [<ffffffff81040495>] calibrate_delay_is_known+0x45/0x80 [ 0.010000] [<ffffffff8100285a>] calibrate_delay+0xfa/0x690 [ 0.010000] [<ffffffff810eefb3>] ? sched_clock_init+0x13/0x90 [ 0.010000] [<ffffffff81111525>] ? trace_hardirqs_on_caller+0xf5/0x1b0 [ 0.010000] [<ffffffff81146743>] ? ktime_get+0xb3/0x150 [ 0.010000] [<ffffffff810eefb3>] ? sched_clock_init+0x13/0x90 [ 0.010000] [<ffffffff8219cf7a>] start_kernel+0x3db/0x4a3 [ 0.010000] [<ffffffff8219c120>] ? early_idt_handler_array+0x120/0x120 [ 0.010000] [<ffffffff8219c32c>] x86_64_start_reservations+0x2a/0x2c [ 0.010000] [<ffffffff8219c47a>] x86_64_start_kernel+0x14c/0x16f [ 0.010000] Code: d0 48 0f 47 c2 5d c3 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 c8 48 89 d1 49 c7 c0 ff ff ff ff 48 c1 e9 06 49 89 c1 55 <4c> 33 0c cf 89 d1 48 83 e2 c0 49 d3 e0 48 89 e5 4c 89 c1 4c 21 [ 0.010000] RIP [<ffffffff814698b5>] _find_next_bit.part.0+0x15/0x70 [ 0.010000] RSP <ffffffff81e03e40> [ 0.010000] CR2: 0000000000000000 [ 0.010000] ---[ end trace 1f7208cba1f1b54f ]--- Version-Release number of selected component (if applicable): kernel 4.6.0-0.rc0.git5.1.fc25.x86_64 How reproducible: 100% Steps to Reproduce: 1. Boot kernel in qemu. Additional info: Full boot log can be found near the end of this file: https://kojipkgs.fedoraproject.org//work/tasks/3957/13373957/build.log
Is this plain QEMU or KVM? I tested this kernel on a KVM guest before I sent it to koji.
Plain QEMU (software emulation / TCG). I wonder if it could be a bug in TCG?
Here's the disassembly with the apparently failing instruction marked with asterisks: ffffffff814698a0 <_find_next_bit.part.0>: ffffffff814698a0: 48 89 c8 mov %rcx,%rax ffffffff814698a3: 48 89 d1 mov %rdx,%rcx ffffffff814698a6: 49 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%r8 ffffffff814698ad: 48 c1 e9 06 shr $0x6,%rcx ffffffff814698b1: 49 89 c1 mov %rax,%r9 ffffffff814698b4: 55 push %rbp ffffffff814698b5: 4c 33 0c cf xor (%rdi,%rcx,8),%r9 ****** ffffffff814698b9: 89 d1 mov %edx,%ecx ffffffff814698bb: 48 83 e2 c0 and $0xffffffffffffffc0,%rdx Doesn't appear to be doing anything very strange that would cause TCG to fail, so I'm guessing it's not a decoding failure or a brand new instruction.
It looks as if the failing path is: cpumask_any_but -> calls for_each_cpu -> calls cpumask_next -> calls find_next_bit This is a single CPU virtual machine. I'm a bit lost after that, but the first parameter of find_next_bit appears to be NULL for some reason.
I'm guessing this is probably because of commit c25323c07345a843a56a294047b130dfd9250fad, where the topology_core_cpumask that was added to calibrate_delay_is_known is interacting badly with your emulated machine. It would be helpful if you could distill the log down to a command we can use to run qemu in a similar setup.
Created attachment 1137468 [details] reproducer.sh A reproducer is attached. It just requires the kernel + qemu. The test is very timing sensitive. I found that it only reproduced about 1 in 10 times. It seems more likely to reproduce if the host machine is busy. I did a bunch of git pulls and kernel compiles at the same time, and that seems to make it reproduce more reliably.
I should note that it's expected that the kernel will panic because there is no initramfs nor root filesystem. If you hit that panic, then you *didn't* reproduce the bug. You only reproduce the bug if the kernel crashes with the stack trace shown in comment 0.
Finally worked out the right incantation to run the script over and over again until you hit the failure: while ./reproducer.sh >& /tmp/log ; ! grep -sq calibrate_delay_is_known /tmp/log; do echo -n .; done (that's all on a single line)
Looks like we have a fix upstream: https://lkml.org/lkml/2016/3/18/74
Patch included in the rc8.git8.1.fc25 build. Thanks for the report and testing.
(In reply to Josh Boyer from comment #10) > Patch included in the rc8.git8.1.fc25 build. Thanks for the report and > testing. Er, rc0.git8.1.fc25 obviously. Sigh.