Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Thanks for taking the time to enter a bug report with us. We're not able to guarantee the timeliness or suitability of a resolution for issues entered here because this is not a mechanism for requesting support.
If this issue is critical or in any way time sensitive, please raise a ticket through your regular Red Hat support channels to make certain it receives the proper attention and prioritization that will result in a timely resolution.
For information on how to contact the Red Hat production support team, please visit: https://www.redhat.com/support/process/production/#howto
Hi, fanlulin
(1) Does this bug only happen on winxp guest? Is other windows guest affected?
(2) The qemu-kvm version you used is not an RHEL6 build, could you try RHEL6 qemu-kvm version?
Thanks,
Qunfang
Description of problem: On AMD server, I met kernel crash after 2-3 days running qemu-kvm with windows xp guest. Now it crashed 5 times in last 2 weeks. PID: 7405 TASK: ffff8826ad06caa0 CPU: 35 COMMAND: "qemu-system-x86" #0 [ffff88103c6c7b40] machine_kexec at ffffffff81038f3b #1 [ffff88103c6c7ba0] crash_kexec at ffffffff810c5d82 #2 [ffff88103c6c7c70] panic at ffffffff8152751a #3 [ffff88103c6c7cf0] watchdog_overflow_callback at ffffffff810e696d #4 [ffff88103c6c7d10] __perf_event_overflow at ffffffff8111c847 #5 [ffff88103c6c7d90] perf_event_overflow at ffffffff8111ce14 #6 [ffff88103c6c7da0] x86_pmu_handle_irq at ffffffff8101e7e8 #7 [ffff88103c6c7e90] perf_event_nmi_handler at ffffffff8152bd69 #8 [ffff88103c6c7ea0] notifier_call_chain at ffffffff8152d825 #9 [ffff88103c6c7ee0] atomic_notifier_call_chain at ffffffff8152d88a #10 [ffff88103c6c7ef0] notify_die at ffffffff810a153e #11 [ffff88103c6c7f20] do_nmi at ffffffff8152b4eb #12 [ffff88103c6c7f50] nmi at ffffffff8152adb0 [exception RIP: flush_tlb_others_ipi+282] RIP: ffffffff8104fa5a RSP: ffff8824ff603778 RFLAGS: 00000046 RAX: 0000000000000000 RBX: 00000000000006c0 RCX: 0000000000000039 RDX: 0000000000000000 RSI: 0000000000000040 RDI: ffffffff81e27058 RBP: ffff8824ff6037b8 R8: 0000000000000001 R9: 0000000000000040 R10: 0000000000013560 R11: 0000000000000007 R12: ffffffff81e27058 R13: 0000000000000003 R14: ffffffff81e27050 R15: ffff8824ff7610c8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #13 [ffff8824ff603778] flush_tlb_others_ipi at ffffffff8104fa5a #14 [ffff8824ff6037c0] native_flush_tlb_others at ffffffff8104fae6 #15 [ffff8824ff6037f0] flush_tlb_page at ffffffff8104fc0e #16 [ffff8824ff603820] do_wp_page at ffffffff811499f7 #17 [ffff8824ff6038c0] handle_pte_fault at ffffffff8114a81d #18 [ffff8824ff6039a0] handle_mm_fault at ffffffff8114b27a #19 [ffff8824ff603a10] __do_page_fault at ffffffff8104a8d8 #20 [ffff8824ff603b30] do_page_fault at ffffffff8152d76e #21 [ffff8824ff603b60] page_fault at ffffffff8152ab25 [exception RIP: copy_user_generic_string+50] RIP: ffffffff8128d432 RSP: ffff8824ff603c10 RFLAGS: 00010097 RAX: ffff8824ff602000 RBX: ffff883e26abc758 RCX: 0000000000000004 RDX: 0000000000000004 RSI: ffff8824ff603c50 RDI: 00007f88cbecb300 RBP: ffff8824ff603c38 R8: 00000000ffffffff R9: ffff8824ff603ac4 R10: 0000000000000000 R11: 0000000000000007 R12: ffff8834e1fcc000 R13: ffff8824ff603c50 R14: 0000000000000004 R15: ffff8826ad06caa0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #22 [ffff8824ff603c10] kvm_write_guest_cached at ffffffffa02f96ab [kvm] #23 [ffff8824ff603c40] kvm_lapic_sync_to_vapic at ffffffffa0325a8f [kvm] #24 [ffff8824ff603c80] kvm_arch_vcpu_ioctl_run at ffffffffa0312715 [kvm] #25 [ffff8824ff603dc0] kvm_vcpu_ioctl at ffffffffa02f9b04 [kvm] #26 [ffff8824ff603e60] vfs_ioctl at ffffffff8119db32 #27 [ffff8824ff603ea0] do_vfs_ioctl at ffffffff8119dffa #28 [ffff8824ff603f30] sys_ioctl at ffffffff8119e251 #29 [ffff8824ff603f80] system_call_fastpath at ffffffff8100b072 RIP: 00007f89451bf7b7 RSP: 00007f893bffea68 RFLAGS: 00010206 RAX: 0000000000000010 RBX: ffffffff8100b072 RCX: 0000000000000009 RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001c RBP: 00007f894b940890 R8: 0000000000000000 R9: 0000000000000007 R10: fffffffffffffdd7 R11: 0000000000000246 R12: 0000000000000001 R13: 00007f894b940980 R14: 00007f8948fae940 R15: 00007f8948388000 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b PID: 7406 TASK: ffff8826ad06d500 CPU: 56 COMMAND: "qemu-system-x86" #0 [ffff88383c407e90] crash_nmi_callback at ffffffff8102fee6 #1 [ffff88383c407ea0] notifier_call_chain at ffffffff8152d825 #2 [ffff88383c407ee0] atomic_notifier_call_chain at ffffffff8152d88a #3 [ffff88383c407ef0] notify_die at ffffffff810a153e #4 [ffff88383c407f20] do_nmi at ffffffff8152b4eb #5 [ffff88383c407f50] nmi at ffffffff8152adb0 [exception RIP: _spin_lock+30] RIP: ffffffff8152a61e RSP: ffff8824c190d818 RFLAGS: 00000097 RAX: 0000000000000589 RBX: ffff8825d04ec700 RCX: 0000000000000001 RDX: 0000000000000588 RSI: ffff88401c4cb7b0 RDI: ffffea00d363eca8 RBP: ffff8824c190d818 R8: 00003ffffffff000 R9: 0000000000001000 R10: 0000000000013560 R11: 0000000000000007 R12: ffffea00cd4082e8 R13: ffffea00d2a66760 R14: 8000003c2f8b4065 R15: ffff883c65b15658 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff8824c190d818] _spin_lock at ffffffff8152a61e #7 [ffff8824c190d820] do_wp_page at ffffffff81149926 #8 [ffff8824c190d8c0] handle_pte_fault at ffffffff8114a81d #9 [ffff8824c190d9a0] handle_mm_fault at ffffffff8114b27a #10 [ffff8824c190da10] __do_page_fault at ffffffff8104a8d8 #11 [ffff8824c190db30] do_page_fault at ffffffff8152d76e #12 [ffff8824c190db60] page_fault at ffffffff8152ab25 [exception RIP: copy_user_generic_string+50] RIP: ffffffff8128d432 RSP: ffff8824c190dc10 RFLAGS: 00010097 RAX: ffff8824c190c000 RBX: ffff8835b799a698 RCX: 0000000000000004 RDX: 0000000000000004 RSI: ffff8824c190dc50 RDI: 00007f88cbecb380 RBP: ffff8824c190dc38 R8: ffff8824c190db18 R9: ffff8824c190dac4 R10: 0000000000000000 R11: 0000000000000007 R12: ffff8834e1fcc000 R13: ffff8824c190dc50 R14: 0000000000000004 R15: ffff8826ad06d500 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #13 [ffff8824c190dc10] kvm_write_guest_cached at ffffffffa02f96ab [kvm] #14 [ffff8824c190dc40] kvm_lapic_sync_to_vapic at ffffffffa0325a8f [kvm] #15 [ffff8824c190dc80] kvm_arch_vcpu_ioctl_run at ffffffffa0312715 [kvm] #16 [ffff8824c190ddc0] kvm_vcpu_ioctl at ffffffffa02f9b04 [kvm] #17 [ffff8824c190de60] vfs_ioctl at ffffffff8119db32 #18 [ffff8824c190dea0] do_vfs_ioctl at ffffffff8119dffa #19 [ffff8824c190df30] sys_ioctl at ffffffff8119e251 #20 [ffff8824c190df80] system_call_fastpath at ffffffff8100b072 RIP: 00007f89451bf7b7 RSP: 00007f893b5fda68 RFLAGS: 00010206 RAX: 0000000000000010 RBX: ffffffff8100b072 RCX: 0000000000000009 RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001d RBP: 00007f894b96c1d0 R8: 0000000000000000 R9: 0000000000000007 R10: fffffffffffffdd7 R11: 0000000000000246 R12: 0000000000000001 R13: 00007f894b96c2c0 R14: 00007f8948fae940 R15: 00007f8948380000 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b processor : 63 vendor_id : AuthenticAMD cpu family : 21 model : 1 model name : AMD Opteron(TM) Processor 6272 stepping : 2 cpu MHz : 2100.089 cache size : 2048 KB physical id : 3 siblings : 16 core id : 7 cpu cores : 8 apicid : 111 initial apicid : 111 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave av x lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skin it wdt lwp fma4 nodeid_msr topoext perfctr_core cpb npt lbrv svm_lock nrip_save tsc_scale vmcb_cl ean flushbyasid decodeassists pausefilter pfthreshold bogomips : 4199.79 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual Version-Release number of selected component (if applicable): kernel: 2.6.32-431.5.1.el6.x86_64 qemu-kvm: 1.4.2 How reproducible: sometimes. Steps to Reproduce: 1. 2. 3. Actual results: kernel crash Expected results: no crash Additional info: after spent several days on reading the kernel/kvm code and kdump, I think the problem is in the kvm, when vcpu_enter_guest() disable the irq, the page_fault from copy_to_user of this cpu will call flush_tlb_others_ipi() to send a ipi to other cpu, at the same time, another cpu also in a same page_fault with irq disable, that cause a deadlock. CPU(A) CPU(B) vcpu_enter_guest vcpu_enter_guest disable_irq disable_irq kvm_lapic_sync_to_vapic kvm_lapic_sync_to_vapic kvm_write_guest_cached kvm_write_guest_cached page_fault page_fault do_wp_page(hold a lock) do_wp_page(try to get lock) flush_tlb_others_ipi waiting.... waiting CPU B to handle ipi... Also found this commit: http://git.kernel.org/cgit/virt/kvm/kvm.git/commit/?id=b463a6f744a263fccd7da14db1afdc880371a280 it seems to resolve this issue. I am not sure the analysis is right, any suggestion??