RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1377920 - Guest fails reboot and causes kernel-panic
Summary: Guest fails reboot and causes kernel-panic
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.3
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Dr. David Alan Gilbert
QA Contact: Virtualization Bugs
URL:
Whiteboard:
: 1357765 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-21 03:06 UTC by Pei Zhang
Modified: 2018-05-13 23:40 UTC (History)
16 users (show)

Fixed In Version: qemu-kvm-rhev-2.6.0-27.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-07 21:36:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1577680 1 None None None 2021-12-10 16:08:35 UTC
Red Hat Product Errata RHBA-2016:2673 0 normal SHIPPED_LIVE qemu-kvm-rhev bug fix and enhancement update 2016-11-08 01:06:13 UTC

Internal Links: 1577680

Description Pei Zhang 2016-09-21 03:06:15 UTC
Description of problem:
Reboot guest will cause guest kernel-panic.

Version-Release number of selected component (if applicable):
Host:
3.10.0-508.rt56.413.el7.x86_64
qemu-kvm-rhev-2.6.0-26.el7.x86_64

Guest:
3.10.0-508.rt56.413.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Boot guest
# /usr/libexec/qemu-kvm -name rhel7.3 \
-cpu IvyBridge -m 4G \
-smp 8,sockets=1,cores=8,threads=1 \
-drive file=/home/pezhang/rhel7.3.qcow2,format=qcow2,if=none,id=drive-virtio-blk0,werror=stop,rerror=stop \
-device virtio-blk-pci,drive=drive-virtio-blk0,id=virtio-blk0 \
-monitor stdio \
-serial unix:/tmp/monitor,server,nowait \

2. Reboot guest, fail and kernel panic.
Rebooting.
[    0.000000] console [ttyS0] enabled
[    0.000000] allocated 67108864 bytes of page_cgroup
[    0.000000] please try 'cgroup_disable=memory' option if you don't want memory cgroups
[    0.000000] tsc: Detected 2299.996 MHz processor
[    0.208161] Calibrating delay loop (skipped) preset value.. 4599.99 BogoMIPS (lpj=2299996)
[    0.208162] pid_max: default: 32768 minimum: 301
[    0.208197] Security Framework initialized
[    0.208205] SELinux:  Initializing.
[    0.208224] SELinux:  Starting in permissive mode
[    0.208929] Dentry cache hash table entries: 524288 (order: 11, 8388608 bytes)
[    0.212585] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
[    0.213525] Mount-cache hash table entries: 4096
[    0.213802] Initializing cgroup subsys memory
[    0.213810] Initializing cgroup subsys devices
[    0.213811] Initializing cgroup subsys freezer
[    0.213812] Initializing cgroup subsys net_cls
[    0.213813] Initializing cgroup subsys blkio
[    0.213814] Initializing cgroup subsys perf_event
[    0.213815] Initializing cgroup subsys hugetlb
[    0.213816] Initializing cgroup subsys pids
[    0.213817] Initializing cgroup subsys net_prio
[    0.213922] CPU: Physical Processor ID: 0
[    0.213923] CPU: Processor Core ID: 0
[    0.214699] mce: CPU supports 10 MCE banks
[    0.214738] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.214738] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.214739] tlb_flushall_shift: 2
[    0.215006] Freeing SMP alternatives: 24k freed
[    0.217266] ACPI: Core revision 20130517
[    0.218067] ACPI: All ACPI Tables successfully acquired
[    0.218098] ftrace: allocating 25317 entries in 99 pages
[    0.245401] smpboot: Max logical packages: 1
[    0.245402] smpboot: APIC(0) Converting physical 0 to logical package 0
[    0.245594] Enabling x2apic
[    0.245599] Enabled x2apic
[    0.245841] Switched APIC routing to physical x2apic.
[    0.246886] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.246889] smpboot: CPU0: Intel Xeon E3-12xx v2 (Ivy Bridge) (fam: 06, model: 3a, stepping: 09)
[    0.246898] TSC deadline timer enabled
[    0.246909] Performance Events: unsupported p6 CPU model 58 no PMU driver, software events only.
[    0.247249] KVM setup paravirtual spinlock
[    0.258408] NMI watchdog: Disabling watchdog on nohz_full cores by default
[   10.541911] smpboot: Booting Node   0, Processors  #1
[   10.541911] smpboot: do_boot_cpu failed(-1) to wakeup CPU#1
[   20.824343]  #2
[   20.824344] smpboot: do_boot_cpu failed(-1) to wakeup CPU#2
[   31.106817]  #3
[   31.106818] smpboot: do_boot_cpu failed(-1) to wakeup CPU#3
[   41.389860]  #4
[   41.389860] smpboot: do_boot_cpu failed(-1) to wakeup CPU#4
[   51.672115]  #5
[   51.672115] smpboot: do_boot_cpu failed(-1) to wakeup CPU#5
[   61.954779]  #6
[   61.954779] smpboot: do_boot_cpu failed(-1) to wakeup CPU#6
[   61.958918]  #7 OK
[   72.237682] smpboot: do_boot_cpu failed(-1) to wakeup CPU#7
[   72.237704] Brought up 1 CPUs
[   72.237705] smpboot: Total of 1 processors activated (4599.99 BogoMIPS)
[   72.238064] ------------[ cut here ]------------
[   72.238065] kernel BUG at kernel/sched/core.c:8400!
[   72.238067] invalid opcode: 0000 [#1] PREEMPT SMP 
[   72.238067] Modules linked in:
[   72.238069] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-508.rt56.413.el7.x86_64 #1
[   72.238070] Hardware name: Red Hat KVM, BIOS 1.9.1-5.el7 04/01/2014
[   72.238071] task: ffff880136700000 ti: ffff880136708000 task.ti: ffff880136708000
[   72.238074] RIP: 0010:[<ffffffff81b58797>]  [<ffffffff81b58797>] sched_init_smp+0x4ac/0x4e3
[   72.238075] RSP: 0000:ffff88013670bbc0  EFLAGS: 00010282
[   72.238075] RAX: 00000000ffffffea RBX: 0000000000000000 RCX: 0000000000000008
[   72.238076] RDX: 0000000000000000 RSI: 00000000013c013a RDI: 0000000000000282
[   72.238076] RBP: ffff88013670be80 R08: 0000000000000000 R09: 0000000000000000
[   72.238077] R10: ffffffff81270e73 R11: ffff88013b00da00 R12: 000000000000000a
[   72.238077] R13: 000000000000000a R14: 000000000000000a R15: 0000000000000001
[   72.238078] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[   72.238079] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   72.238080] CR2: 00000000ffffffff CR3: 00000000019b9000 CR4: 00000000001406f0
[   72.238083] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   72.238083] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   72.238084] Stack:
[   72.238085]  0000000000000000 0000000000000000 00000000000000e0 ffff880136708000
[   72.238086]  ffff88013670bfd8 0000000000000282 ffff88013670bd88 ffffffff8110e780
[   72.238087]  ffff88013670bc20 ffffffff816864c8 0000000000000282 ffff880136700000
[   72.238088] Call Trace:
[   72.238092]  [<ffffffff8110e780>] ? cpu_stop_park+0x40/0x40
[   72.238095]  [<ffffffff816864c8>] ? preempt_schedule.part.88+0x42/0x5a
[   72.238097]  [<ffffffff81686509>] ? preempt_schedule+0x29/0x40
[   72.238099]  [<ffffffff81687c65>] ? _raw_spin_unlock_irqrestore+0x75/0x80
[   72.238101]  [<ffffffff810ef686>] ? rt_mutex_adjust_prio+0x36/0x40
[   72.238102]  [<ffffffff816869cf>] ? rt_spin_lock_slowunlock+0x3f/0x60
[   72.238104]  [<ffffffff816876c8>] ? __rt_spin_unlock+0x28/0x30
[   72.238106]  [<ffffffff810b0258>] ? lg_global_unlock+0x48/0x60
[   72.238108]  [<ffffffff8110e72f>] ? queue_stop_cpus_work.isra.4+0x8f/0xa0
[   72.238109]  [<ffffffff8110e780>] ? cpu_stop_park+0x40/0x40
[   72.238111]  [<ffffffff8110e780>] ? cpu_stop_park+0x40/0x40
[   72.238113]  [<ffffffff8110e910>] ? __stop_cpus+0x70/0xa0
[   72.238114]  [<ffffffff8110e780>] ? cpu_stop_park+0x40/0x40
[   72.238116]  [<ffffffff8168794e>] ? _mutex_unlock+0xe/0x10
[   72.238118]  [<ffffffff8110ec9f>] ? stop_cpus+0x3f/0x50
[   72.238120]  [<ffffffff81039490>] ? mtrr_restore+0xb0/0xb0
[   72.238122]  [<ffffffff8110ee41>] ? __stop_machine+0x101/0x150
[   72.238124]  [<ffffffff81039490>] ? mtrr_restore+0xb0/0xb0
[   72.238125]  [<ffffffff8168794e>] ? _mutex_unlock+0xe/0x10
[   72.238127]  [<ffffffff8107e236>] ? put_online_cpus+0x56/0x80
[   72.238129]  [<ffffffff8110eec1>] ? stop_machine+0x31/0x40
[   72.238131]  [<ffffffff81039eb3>] ? mtrr_aps_init+0x83/0xa0
[   72.238133]  [<ffffffff81b45292>] ? native_smp_cpus_done+0x10a/0x111
[   72.238135]  [<ffffffff81b35197>] kernel_init_freeable+0x13c/0x262
[   72.238140]  [<ffffffff81671be0>] ? rest_init+0x90/0x90
[   72.238142]  [<ffffffff81671bee>] kernel_init+0xe/0xf0
[   72.238144]  [<ffffffff816900d8>] ret_from_fork+0x58/0x90
[   72.238145]  [<ffffffff81671be0>] ? rest_init+0x90/0x90
[   72.238157] Code: 81 e8 1e a5 b1 ff 48 c7 c7 e0 64 9f 81 e8 12 a5 b1 ff 48 8d b5 50 fd ff ff 65 48 8b 3c 25 c0 d6 00 00 e8 ed 4c 56 ff 85 c0 79 02 <0f> 0b e8 72 b7 56 ff e8 3d 42 57 ff e8 91 04 00 00 48 8b 45 d0 
[   72.238159] RIP  [<ffffffff81b58797>] sched_init_smp+0x4ac/0x4e3
[   72.238159]  RSP <ffff88013670bbc0>
[   72.379033] ---[ end trace 0000000000000001 ]---
[   72.379035] Kernel panic - not syncing: Fatal exception


Actual results:
Guest fails reboot.

Expected results:
Guest should reboot well.

Additional info:
1. This is a regression bug
qemu-kvm-rhev-2.6.0-18.el7.x86_64   work
qemu-kvm-rhev-2.6.0-22.el7.x86_64   work
qemu-kvm-rhev-2.6.0-24.el7.x86_64   work
qemu-kvm-rhev-2.6.0-25.el7.x86_64   work
qemu-kvm-rhev-2.6.0-25.el7.x86_64   doesn't work

Comment 1 Pei Zhang 2016-09-21 03:30:17 UTC
Additional info(continued):
1. (fix typo in above)
qemu-kvm-rhev-2.6.0-26.el7.x86_64   doesn't work

2. Also test with rhel kernel(non-rt), hit same issue
Host:
3.10.0-509.el7.x86_64
qemu-kvm-rhev-2.6.0-26.el7.x86_64

Guest:
3.10.0-509.el7.x86_64

console info
[   10.365787] smpboot: do_boot_cpu failed(-1) to wakeup CPU#1

[   20.637017] smpboot: do_boot_cpu failed(-1) to wakeup CPU#2
[   30.907909] smpboot: do_boot_cpu failed(-1) to wakeup CPU#3
[   42.063551] smpboot: do_boot_cpu failed(-1) to wakeup CPU#1
[   52.318500] smpboot: do_boot_cpu failed(-1) to wakeup CPU#2
[   62.576511] smpboot: do_boot_cpu failed(-1) to wakeup CPU#3

Comment 2 Pei Zhang 2016-09-21 07:20:10 UTC
Additional info(continued)
3. rhel6.8GA hit this regression issue, windows doesn't.

rhel6.8 GA      doesn't work
windows 7       work

Comment 6 Dr. David Alan Gilbert 2016-09-21 09:34:55 UTC
OK, I can reproduce this bug - only happens on my rhel/Xeon hosts, not on my fedora/laptop even with the same qemu.

Comment 7 Dr. David Alan Gilbert 2016-09-21 12:38:54 UTC
If I add back the kvm_put_apic in kvm_arch_put_registers it seems to work again; so I'm guessing that means that something somewhere is changing the APIC state after the point we reset and before that point and we're losing the change.

Comment 8 Dr. David Alan Gilbert 2016-09-21 17:51:01 UTC
Note if I turn on
   apic=verbose debug ignore_loglevel

in the guest, I see it's stalled at 'calibrating APIC timer...'

If I make the code in kvm_arch_put_registers:

  kvm_put_apic
like it was before it apparently works (not checked the migrate)

but if I make it:
  kvm_get_apic
  kvm_put_apic

it fails just like the put isn't there; so again it sounds like some inconsistent state that the put is restoring;  so I do:

  x86_cpu_dump_local_apic_state
  kvm_get_apic
  x86_cpu_dump_local_apic_state
  kvm_put_apic

and expect to see the state that's different that's being lost without the put or with the get/put pair - nothing, no differences in the dump.

Hmm.
I add to the dump the dump of some other fields it hasn't got and the only ones that I've found different are initial_count_load_time and count_shift; however, if I preserve those around the get, it still fails - suggesting it's something else. Hmm.

Comment 9 Dr. David Alan Gilbert 2016-09-22 12:13:18 UTC
Debugging with Paolo;  this works with a host -490 kernel, breaks with a -491

Comment 10 Paolo Bonzini 2016-09-22 12:46:28 UTC
... but it's still a QEMU bug.

Comment 16 Miroslav Rezanina 2016-09-23 13:58:39 UTC
Fix included in qemu-kvm-rhev-2.6.0-27.el7

Comment 18 Pei Zhang 2016-09-24 02:08:36 UTC
==Verification==
Versions:
Host:
3.10.0-510.el7.x86_64
qemu-kvm-rhev-2.6.0-27.el7.x86_64

Guest:
Same as Comment14.

Steps:
Same as Comment14.

Results show:
Guest          Basic(boot/reboot/shutdown)    Migration
RHEL7.3            PASS                          PASS
RHEL6.8GA          PASS                          PASS
RHEL7.2GA          PASS                          PASS
Windows7           PASS                          PASS
Windows8           PASS                          PASS
Windows8.1         PASS                          PASS
Windows10          PASS                           \


So this bug has been fixed well, thank you.

Comment 19 Pei Zhang 2016-09-24 02:26:50 UTC
==Verification(update)==
Results show:
Guest          Basic(boot/reboot/shutdown)    Migration
...
Windows10          PASS                           PASS


Note: In this testing, win10 migration works. 
I will update this testing in bug[1]. 
[1]Bug 1357765 - windows10 guest hangs after migration

Comment 20 Pei Zhang 2016-09-24 02:27:52 UTC
Set this bug as 'VERIFIED' as Comment 18 and Comment 19.

Comment 21 Amit Shah 2016-09-27 11:25:31 UTC
*** Bug 1357765 has been marked as a duplicate of this bug. ***

Comment 23 errata-xmlrpc 2016-11-07 21:36:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html


Note You need to log in before you can comment on or make changes to this bug.