Bug 706335 - Node failover in virtual cluster nodes cause physical host to dump core via kvm kernel module
Summary: Node failover in virtual cluster nodes cause physical host to dump core via k...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-20 07:14 UTC by Sirius Rayner-Karlsson
Modified: 2011-06-18 06:43 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-06-18 06:43:25 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Sirius Rayner-Karlsson 2011-05-20 07:14:00 UTC
* Description of problem:

I have two cluster nodes that are running RHEL 5.6. They are VM’s running on a physical host that runs RHEL 6.0 with all the latest errata applied as of May 19th. I’m trying to reproduce an issue a customer is seeing, and it involves failing the primary cluster node so that it gets fenced (done via fence_xvm in the cluster nodes, talking to fence_virtd on the phys host), it restarts as expected - then, the second cluster node/VM fails, and that crashes the phys host.

The details from the first vmcore on 2.6.32-71.29.1.el6.x86_64:

      KERNEL: vmlinux                           
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 2
        DATE: Thu May 19 13:16:37 2011
      UPTIME: 00:06:40
LOAD AVERAGE: 31.77, 22.40, 9.90
       TASKS: 448
    NODENAME: savage.ell.ite
     RELEASE: 2.6.32-71.29.1.el6.x86_64
     VERSION: #1 SMP Thu Apr 21 16:08:55 EDT 2011
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 8 GB
       PANIC: "Oops: 0000 [#1] SMP " (check log for details)
         PID: 8165
     COMMAND: "qemu-kvm"
        TASK: ffff8802239914e0  [THREAD_INFO: ffff880223a26000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

The log entry from the vmcore looks like this:

BUG: unable to handle kernel paging request at fffff16c81202958
IP: [<ffffffffa032c780>] __mmu_unsync_walk+0x90/0x230 [kvm]
PGD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0 
Modules linked in: tun ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables kspiceusb(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table bridge stp llc ipv6 ext3 jbd dm_mirror dm_region_hash dm_log kvm_intel kvm i2c_i801 iTCO_wdt iTCO_vendor_support sg tg3 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i82975x_edac edac_core ext4 mbcache jbd2 raid1 sd_mod crc_t10dif sr_mod cdrom ahci firewire_ohci firewire_core crc_itu_t ata_generic pata_acpi ata_piix usb_storage nouveau ttm drm_kms_helper drm i2c_algo_bit video output i2c_core dm_mod [last unloaded: microcode]

Modules linked in: tun ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables kspiceusb(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table bridge stp llc ipv6 ext3 jbd dm_mirror dm_region_hash dm_log kvm_intel kvm i2c_i801 iTCO_wdt iTCO_vendor_support sg tg3 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i82975x_edac edac_core ext4 mbcache jbd2 raid1 sd_mod crc_t10dif sr_mod cdrom ahci firewire_ohci firewire_core crc_itu_t ata_generic pata_acpi ata_piix usb_storage nouveau ttm drm_kms_helper drm i2c_algo_bit video output i2c_core dm_mod [last unloaded: microcode]
Pid: 8165, comm: qemu-kvm Not tainted 2.6.32-71.29.1.el6.x86_64 #1 SD39V10
RIP: 0010:[<ffffffffa032c780>]  [<ffffffffa032c780>] __mmu_unsync_walk+0x90/0x230 [kvm]
RSP: 0018:ffff880223a27868  EFLAGS: 00010216
RAX: 0000076c81202948 RBX: ffff8801abbe9b48 RCX: 0000010f80292a78
RDX: 00000000000001ff RSI: 0000000000000200 RDI: ffff8801abbe9b48
RBP: ffff880223a278b8 R08: ffffea00077c52f8 R09: 0000000000000000
R10: ffff880028401c40 R11: 0000000000000000 R12: ffff8801abbe9ad8
R13: ffffea0000000000 R14: 000ffffffffff000 R15: ffff880223a278d8
FS:  00007fb9ac46e700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: fffff16c81202958 CR3: 00000002238e0000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-kvm (pid: 8165, threadinfo ffff880223a26000, task ffff8802239914e0)
Stack:
 ffff880223a278e8 ffffffff8104b5fd ffff880223a278a8 00000000a03158c2
<0> 0000000000000008 ffff8801abbe9ad8 0000000000000000 ffff880223a279e8
<0> ffff88021da88000 ffff8801abbe9ad8 ffff880223a27a48 ffffffffa032d8c6
Call Trace:
 [<ffffffff8104b5fd>] ? get_user_pages_fast+0xdd/0x1c0
 [<ffffffffa032d8c6>] mmu_zap_unsync_children+0xa6/0x200 [kvm]
 [<ffffffffa032ec49>] ? paging64_prefetch_page+0x99/0x100 [kvm]
 [<ffffffff81157ad4>] ? kmem_cache_alloc_node_notrace+0x104/0x130
 [<ffffffff8111efef>] ? free_hot_page+0x2f/0x60
 [<ffffffffa032d513>] kvm_mmu_zap_page+0x43/0x350 [kvm]
 [<ffffffffa032dd6c>] __kvm_mmu_free_some_pages+0x2c/0x60 [kvm]
 [<ffffffffa0333756>] paging64_page_fault+0x4b6/0x4c0 [kvm]
 [<ffffffffa0325905>] ? emulator_write_emulated+0x75/0x90 [kvm]
 [<ffffffffa03267d7>] ? emulate_instruction+0x307/0x380 [kvm]
 [<ffffffffa032fe8f>] kvm_mmu_page_fault+0x1f/0xa0 [kvm]
 [<ffffffffa0375e60>] handle_exception+0x2c0/0x380 [kvm_intel]
 [<ffffffffa0375905>] vmx_handle_exit+0xc5/0x280 [kvm_intel]
 [<ffffffffa0328e12>] kvm_arch_vcpu_ioctl_run+0x392/0xdd0 [kvm]
 [<ffffffff8104f70c>] ? enqueue_task+0x5c/0x70
 [<ffffffffa03141f2>] kvm_vcpu_ioctl+0x522/0x670 [kvm]
 [<ffffffff810a5522>] ? do_futex+0x682/0xb00
 [<ffffffff8105c592>] ? default_wake_function+0x12/0x20
 [<ffffffff8117fdf2>] vfs_ioctl+0x22/0xa0
 [<ffffffff811802ba>] do_vfs_ioctl+0x3aa/0x580
 [<ffffffff81180511>] sys_ioctl+0x81/0xa0
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Code: 3b 05 2d db 01 00 74 3b 48 3b 05 2c db 01 00 74 32 84 c0 78 2e 4c 21 f0 48 c1 e8 0c 48 8d 0c c5 00 00 00 00 48 c1 e0 06 48 29 c8 <4a> 8b 4c 28 10 44 8b 49 64 45 85 c9 0f 85 de 00 00 00 80 79 60 
RIP  [<ffffffffa032c780>] __mmu_unsync_walk+0x90/0x230 [kvm]
 RSP <ffff880223a27868>
CR2: fffff16c81202958

Once I noticed the problem, I tried to reproduce and on two attempts, I had two crashes generating vmcores, so it’s reproducible. Have not yet tested RHEL 6.1 as it has not sync’ed to my local satellite yet. Once it has, I’ll be trying again to reproduce. I have two vmcores available for 2.6.32-71.29.1.el6.x86_64, one 204MB and one 495MB, both partial (kdump.conf is pristine, no changes done, so whatever is set as RHEL6 default). Tell me where you want the vmcores and I’ll upload them there.

 * Version-Release number of selected component (if applicable):

2.6.32-71.29.1.el6.x86_64

 * How reproducible:

Very. In three attempts, three vmcores. I can describe the environment which the problem manifests in, but exactly how much of it is required for this to trigger, I am not certain. I have sosreports from the two VM’s and from the phys host which I will attach. If there is anything unclear, let me know and I’ll explain/elaborate.


 * Steps to Reproduce:
   1. Start the physical system, once it is up, start the two VM’s and let their cluster form.
   2. Make the master cluster node, vm1, fail. It gets fenced and restarts.
   3. The other cluster node, vm2, takes over service and becomes primary. When vm1 comes back up and rejoins the cluster, vm2 will shortly fail (don’t know why) and will get fenced/restart. At this point, the oops/panic occur on the physical host.
  
 * Actual results:

Kernel Oops and Panic

 * Expected results:

No oops or panic.

 * Additional info:

Attaching sosreports of the two VMs and the physical host. Two vmcores available, let me know where you want them and I’ll upload them.

Comment 3 Sirius Rayner-Karlsson 2011-05-20 07:41:19 UTC
The two latter vmcores have log and backtrace like so:

[ /var/crash/127.0.0.1-2011-05-19-15:46:42 ]
br0: port 2(vnet0) entering forwarding state
kvm: 5387: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x130079
kvm: 5387: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xffdb5eb8
kvm: 5387: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x530079
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffffa034bb8a>] gfn_to_rmap+0x2a/0x80 [kvm]
PGD 21f0ce067 PUD 225bb4067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/kernel/mm/ksm/run
CPU 1 
Modules linked in: tun ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables kspiceusb(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table bridge stp llc ipv6 ext3 jbd dm_mirror dm_region_hash dm_log kvm_intel kvm i82975x_edac edac_core i2c_i801 sg iTCO_wdt iTCO_vendor_support tg3 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 raid1 sr_mod cdrom firewire_ohci firewire_core crc_itu_t sd_mod crc_t10dif ata_generic pata_acpi ata_piix ahci usb_storage nouveau ttm drm_kms_helper drm i2c_algo_bit video output i2c_core dm_mod [last unloaded: microcode]

Modules linked in: tun ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables kspiceusb(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table bridge stp llc ipv6 ext3 jbd dm_mirror dm_region_hash dm_log kvm_intel kvm i82975x_edac edac_core i2c_i801 sg iTCO_wdt iTCO_vendor_support tg3 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 raid1 sr_mod cdrom firewire_ohci firewire_core crc_itu_t sd_mod crc_t10dif ata_generic pata_acpi ata_piix ahci usb_storage nouveau ttm drm_kms_helper drm i2c_algo_bit video output i2c_core dm_mod [last unloaded: microcode]
Pid: 5408, comm: qemu-kvm Not tainted 2.6.32-71.29.1.el6.x86_64 #1 SD39V10
RIP: 0010:[<ffffffffa034bb8a>]  [<ffffffffa034bb8a>] gfn_to_rmap+0x2a/0x80 [kvm]
RSP: 0018:ffff88021f05ba08  EFLAGS: 00010246
RAX: 0000000000000000 RBX: fffffffffffff001 RCX: ffff88021d5dc900
RDX: 0000000000000021 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88021f05ba18 R08: 0000000000000021 R09: 0000000000000000
R10: ffff880028401f80 R11: 0000000000000000 R12: 0000000000000001
R13: 000000d4b005254f R14: ffff880225b60b88 R15: 00000000000001ff
FS:  00007f18aa413700(0000) GS:ffff880028280000(0000) knlGS:0000000000000000
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000225163000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-kvm (pid: 5408, threadinfo ffff88021f05a000, task ffff880224f84af0)
Stack:
 ffff880224091ff8 ffff8801bc39c000 ffff88021f05ba48 ffffffffa034bc7f
<0> ffff880225b60b88 ffff8801bc39c000 ffff880224091ff8 0000000000000000
<0> ffff88021f05ba88 ffffffffa034c554 ffff88021f05ba88 ffff88021f09c040
Call Trace:
 [<ffffffffa034bc7f>] rmap_remove+0x9f/0x1c0 [kvm]
 [<ffffffffa034c554>] kvm_mmu_zap_page+0x84/0x350 [kvm]
 [<ffffffffa034cd6c>] __kvm_mmu_free_some_pages+0x2c/0x60 [kvm]
 [<ffffffffa0352756>] paging64_page_fault+0x4b6/0x4c0 [kvm]
 [<ffffffffa034ee8f>] kvm_mmu_page_fault+0x1f/0xa0 [kvm]
 [<ffffffffa0394e60>] handle_exception+0x2c0/0x380 [kvm_intel]
 [<ffffffffa0394905>] vmx_handle_exit+0xc5/0x280 [kvm_intel]
 [<ffffffffa0347e12>] kvm_arch_vcpu_ioctl_run+0x392/0xdd0 [kvm]
 [<ffffffff8104f70c>] ? enqueue_task+0x5c/0x70
 [<ffffffffa03331f2>] kvm_vcpu_ioctl+0x522/0x670 [kvm]
 [<ffffffff810a5522>] ? do_futex+0x682/0xb00
 [<ffffffff8105c592>] ? default_wake_function+0x12/0x20
 [<ffffffff8117fdf2>] vfs_ioctl+0x22/0xa0
 [<ffffffff811802ba>] do_vfs_ioctl+0x3aa/0x580
 [<ffffffff81180511>] sys_ioctl+0x81/0xa0
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Code: 00 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 0f 1f 44 00 00 41 89 d4 48 89 f3 e8 7f 87 fe ff 41 83 fc 01 48 89 c6 75 19 <48> 2b 18 48 c1 e3 03 48 03 58 18 48 89 d8 4c 8b 64 24 08 48 8b 
RIP  [<ffffffffa034bb8a>] gfn_to_rmap+0x2a/0x80 [kvm]
 RSP <ffff88021f05ba08>
CR2: 0000000000000000
crash> bt
PID: 5408   TASK: ffff880224f84af0  CPU: 1   COMMAND: "qemu-kvm"
 #0 [ffff88021f05b6d0] machine_kexec at ffffffff8103697b
 #1 [ffff88021f05b730] crash_kexec at ffffffff810b9128
 #2 [ffff88021f05b800] oops_end at ffffffff814ccc00
 #3 [ffff88021f05b830] no_context at ffffffff8104656b
 #4 [ffff88021f05b880] __bad_area_nosemaphore at ffffffff810467f5
 #5 [ffff88021f05b8d0] bad_area at ffffffff8104691e
 #6 [ffff88021f05b900] do_page_fault at ffffffff814ce770
 #7 [ffff88021f05b950] page_fault at ffffffff814cbf75
    [exception RIP: gfn_to_rmap+42]
    RIP: ffffffffa034bb8a  RSP: ffff88021f05ba08  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: fffffffffffff001  RCX: ffff88021d5dc900
    RDX: 0000000000000021  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffff88021f05ba18   R8: 0000000000000021   R9: 0000000000000000
    R10: ffff880028401f80  R11: 0000000000000000  R12: 0000000000000001
    R13: 000000d4b005254f  R14: ffff880225b60b88  R15: 00000000000001ff
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff88021f05ba20] rmap_remove at ffffffffa034bc7f
 #9 [ffff88021f05ba50] kvm_mmu_zap_page at ffffffffa034c554
#10 [ffff88021f05ba90] __kvm_mmu_free_some_pages at ffffffffa034cd6c
#11 [ffff88021f05bab0] paging64_page_fault at ffffffffa0352756
#12 [ffff88021f05bc10] kvm_mmu_page_fault at ffffffffa034ee8f
#13 [ffff88021f05bc40] handle_exception at ffffffffa0394e60
#14 [ffff88021f05bc90] vmx_handle_exit at ffffffffa0394905
#15 [ffff88021f05bcd0] kvm_arch_vcpu_ioctl_run at ffffffffa0347e12
#16 [ffff88021f05bdb0] kvm_vcpu_ioctl at ffffffffa03331f2
#17 [ffff88021f05be60] vfs_ioctl at ffffffff8117fdf2
#18 [ffff88021f05bea0] do_vfs_ioctl at ffffffff811802ba
#19 [ffff88021f05bf30] sys_ioctl at ffffffff81180511
#20 [ffff88021f05bf80] system_call_fastpath at ffffffff81013172
    RIP: 0000003d948d99a7  RSP: 00007f18aa412c18  RFLAGS: 00000246
    RAX: 0000000000000010  RBX: ffffffff81013172  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 000000000000ae80  RDI: 000000000000000b
    RBP: 0000000000000001   R8: 000000000085c760   R9: 0000000004000001
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000001dac490
    R13: 00000000c0c0c0c1  R14: 0000000000000000  R15: 0000000001df4540
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
crash> 

[ /var/crash/127.0.0.1-2011-05-20-09:10:34 ]
br0: port 2(vnet0) entering forwarding state
br0: port 3(vnet1) entering forwarding state
kvm: 15336: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x130079
kvm: 15336: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xffdb5ec4
kvm: 15336: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x530079
kvm: 15367: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x130079
kvm: 15367: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xffdb5ec4
kvm: 15367: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x530079
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffffa034bb8a>] gfn_to_rmap+0x2a/0x80 [kvm]
PGD 1b1ea8067 PUD 223e2f067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/kernel/mm/ksm/run
CPU 1 
Modules linked in: tun iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables kspiceusb(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table bridge stp llc ipv6 ext3 jbd dm_mirror dm_region_hash dm_log kvm_intel kvm i82975x_edac edac_core sg i2c_i801 iTCO_wdt iTCO_vendor_support tg3 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 raid1 sr_mod cdrom firewire_ohci firewire_core crc_itu_t sd_mod crc_t10dif ata_generic pata_acpi ata_piix ahci usb_storage nouveau ttm drm_kms_helper drm i2c_algo_bit video output i2c_core dm_mod [last unloaded: microcode]

Modules linked in: tun iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables kspiceusb(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table bridge stp llc ipv6 ext3 jbd dm_mirror dm_region_hash dm_log kvm_intel kvm i82975x_edac edac_core sg i2c_i801 iTCO_wdt iTCO_vendor_support tg3 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 raid1 sr_mod cdrom firewire_ohci firewire_core crc_itu_t sd_mod crc_t10dif ata_generic pata_acpi ata_piix ahci usb_storage nouveau ttm drm_kms_helper drm i2c_algo_bit video output i2c_core dm_mod [last unloaded: microcode]
Pid: 15359, comm: qemu-kvm Not tainted 2.6.32-71.29.1.el6.x86_64 #1 SD39V10
RIP: 0010:[<ffffffffa034bb8a>]  [<ffffffffa034bb8a>] gfn_to_rmap+0x2a/0x80 [kvm]
RSP: 0018:ffff88021fe53af8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: fffffffffffff001 RCX: ffff880224d9b900
RDX: 0000000000000021 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88021fe53b08 R08: 0000000000000021 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: 000000d4b005254f R14: ffff880225593138 R15: ffff88021fe53b90
FS:  00007feb59b28700(0000) GS:ffff880028280000(0000) knlGS:0000000000000000
CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000223e36000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-kvm (pid: 15359, threadinfo ffff88021fe52000, task ffff880222353520)
Stack:
 ffff8801b1d40ff8 ffff880223e28000 ffff88021fe53b38 ffffffffa034bc7f
<0> 0000000000000ff8 ffff880225593138 ffff880223e58040 fffffffffffff001
<0> ffff88021fe53bc8 ffffffffa034f9b1 ffff880225593138 ffff8802241b96c8
Call Trace:
 [<ffffffffa034bc7f>] rmap_remove+0x9f/0x1c0 [kvm]
 [<ffffffffa034f9b1>] paging64_sync_page+0xb1/0x1b0 [kvm]
 [<ffffffff81259349>] ? free_cpumask_var+0x9/0x10
 [<ffffffffa03356ec>] ? make_all_cpus_request+0xec/0x120 [kvm]
 [<ffffffffa034cca8>] kvm_sync_page+0x78/0x110 [kvm]
 [<ffffffffa034f1a3>] kvm_mmu_get_page+0x193/0x540 [kvm]
 [<ffffffffa0351fb7>] kvm_mmu_load+0x2b7/0x2f0 [kvm]
 [<ffffffffa0394905>] ? vmx_handle_exit+0xc5/0x280 [kvm_intel]
 [<ffffffffa0348541>] kvm_arch_vcpu_ioctl_run+0xac1/0xdd0 [kvm]
 [<ffffffffa03331f2>] kvm_vcpu_ioctl+0x522/0x670 [kvm]
 [<ffffffff810a5522>] ? do_futex+0x682/0xb00
 [<ffffffff8105c592>] ? default_wake_function+0x12/0x20
 [<ffffffff8117fdf2>] vfs_ioctl+0x22/0xa0
 [<ffffffff811802ba>] do_vfs_ioctl+0x3aa/0x580
 [<ffffffff81180511>] sys_ioctl+0x81/0xa0
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Code: 00 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 0f 1f 44 00 00 41 89 d4 48 89 f3 e8 7f 87 fe ff 41 83 fc 01 48 89 c6 75 19 <48> 2b 18 48 c1 e3 03 48 03 58 18 48 89 d8 4c 8b 64 24 08 48 8b 
RIP  [<ffffffffa034bb8a>] gfn_to_rmap+0x2a/0x80 [kvm]
 RSP <ffff88021fe53af8>
CR2: 0000000000000000
crash> bt
PID: 15359  TASK: ffff880222353520  CPU: 1   COMMAND: "qemu-kvm"
 #0 [ffff88021fe537c0] machine_kexec at ffffffff8103697b
 #1 [ffff88021fe53820] crash_kexec at ffffffff810b9128
 #2 [ffff88021fe538f0] oops_end at ffffffff814ccc00
 #3 [ffff88021fe53920] no_context at ffffffff8104656b
 #4 [ffff88021fe53970] __bad_area_nosemaphore at ffffffff810467f5
 #5 [ffff88021fe539c0] bad_area at ffffffff8104691e
 #6 [ffff88021fe539f0] do_page_fault at ffffffff814ce770
 #7 [ffff88021fe53a40] page_fault at ffffffff814cbf75
    [exception RIP: gfn_to_rmap+42]
    RIP: ffffffffa034bb8a  RSP: ffff88021fe53af8  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: fffffffffffff001  RCX: ffff880224d9b900
    RDX: 0000000000000021  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffff88021fe53b08   R8: 0000000000000021   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000001
    R13: 000000d4b005254f  R14: ffff880225593138  R15: ffff88021fe53b90
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff88021fe53b10] rmap_remove at ffffffffa034bc7f
 #9 [ffff88021fe53b40] paging64_sync_page at ffffffffa034f9b1
#10 [ffff88021fe53bd0] kvm_sync_page at ffffffffa034cca8
#11 [ffff88021fe53c00] kvm_mmu_get_page at ffffffffa034f1a3
#12 [ffff88021fe53c60] kvm_mmu_load at ffffffffa0351fb7
#13 [ffff88021fe53cd0] kvm_arch_vcpu_ioctl_run at ffffffffa0348541
#14 [ffff88021fe53db0] kvm_vcpu_ioctl at ffffffffa03331f2
#15 [ffff88021fe53e60] vfs_ioctl at ffffffff8117fdf2
#16 [ffff88021fe53ea0] do_vfs_ioctl at ffffffff811802ba
#17 [ffff88021fe53f30] sys_ioctl at ffffffff81180511
#18 [ffff88021fe53f80] system_call_fastpath at ffffffff81013172
    RIP: 0000003d948d99a7  RSP: 00007feb59b27c18  RFLAGS: 00000246
    RAX: 0000000000000010  RBX: ffffffff81013172  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 000000000000ae80  RDI: 000000000000000b
    RBP: 0000000000000001   R8: 000000000085c760   R9: 0000000004000001
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000001019490
    R13: 00000000c0c0c0c1  R14: 0000000000000000  R15: 0000000001061540
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
crash>

Comment 5 Chris Wright 2011-05-20 16:55:45 UTC
Any chance to attempt to reproduce on 6.1 yet?

Comment 6 Sirius Rayner-Karlsson 2011-05-21 13:42:15 UTC
This morning, I saw that my Satellite had sync'ed in RHEL 6.1, so I updated, rebooted and started the two VM's that I use for reproducing the issue. It crashed immediately.

[root@savage 127.0.0.1-2011-05-21-10:20:18]# pwd
/var/crash/127.0.0.1-2011-05-21-10:20:18
[root@savage 127.0.0.1-2011-05-21-10:20:18]# ls -l
total 192256
-rw-------. 1 root root 196860338 May 21 10:20 vmcore
lrwxrwxrwx. 1 root root        61 May 21 10:52 vmlinux -> /usr/lib/debug/lib/modules/2.6.32-131.0.15.el6.x86_64/vmlinux
[root@savage 127.0.0.1-2011-05-21-10:20:18]# uname -a
Linux savage.ell.ite 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

From crash:


      KERNEL: vmlinux                           
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 2
        DATE: Sat May 21 10:19:02 2011
      UPTIME: 00:25:41
LOAD AVERAGE: 0.77, 0.34, 0.50
       TASKS: 264
    NODENAME: savage.ell.ite
     RELEASE: 2.6.32-131.0.15.el6.x86_64
     VERSION: #1 SMP Tue May 10 15:42:40 EDT 2011
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 8 GB
       PANIC: "Oops: 0000 [#1] SMP " (check log for details)
         PID: 8626
     COMMAND: "qemu-kvm"
        TASK: ffff880223d6f4c0  [THREAD_INFO: ffff880222d8c000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)
crash> log
...
br0: port 2(vnet0) entering forwarding state
br0: port 3(vnet1) entering forwarding state
kvm: 8196: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x130079
kvm: 8196: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xffdb5f52
kvm: 8196: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x530079
kvm: 8228: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x130079
kvm: 8228: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xffdb5f52
kvm: 8228: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x530079
[drm] nouveau 0000:01:00.0: Setting dpms mode 1 on vga encoder (output 0)
br0: port 2(vnet0) entering disabled state
device vnet0 left promiscuous mode
br0: port 2(vnet0) entering disabled state
device vnet0 entered promiscuous mode
br0: port 2(vnet0) entering learning state
vnet0: no IPv6 routers present
br0: port 2(vnet0) entering forwarding state
kvm: 8604: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x130079
kvm: 8604: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xffdb5f52
kvm: 8604: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x530079
BUG: unable to handle kernel paging request at 0000188681202610
IP: [<ffffffffa037673d>] kvm_mmu_zap_page+0x12d/0x360 [kvm]
PGD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/kernel/mm/ksm/run
CPU 1 
Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables kspiceusb(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table bridge stp llc ipv6 ext3 jbd dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun kvm_intel kvm tg3 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc shpchp i82975x_edac edac_core ext4 mbcache jbd2 raid1 sd_mod crc_t10dif sr_mod cdrom firewire_ohci firewire_core crc_itu_t ahci ata_generic pata_acpi ata_piix usb_storage nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mod [last unloaded: scsi_wait_scan]

Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables kspiceusb(U) sunrpc cpufreq_ondemand acpi_cpufreq freq_table bridge stp llc ipv6 ext3 jbd dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun kvm_intel kvm tg3 microcode i2c_i801 sg iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc shpchp i82975x_edac edac_core ext4 mbcache jbd2 raid1 sd_mod crc_t10dif sr_mod cdrom firewire_ohci firewire_core crc_itu_t ahci ata_generic pata_acpi ata_piix usb_storage nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mod [last unloaded: scsi_wait_scan]
Pid: 8626, comm: qemu-kvm Not tainted 2.6.32-131.0.15.el6.x86_64 #1 SD39V10
RIP: 0010:[<ffffffffa037673d>]  [<ffffffffa037673d>] kvm_mmu_zap_page+0x12d/0x360 [kvm]
RSP: 0018:ffff880222d8da38  EFLAGS: 00010206
RAX: 00002e8681202600 RBX: ffff880223c99d98 RCX: ffff88021fbc8000
RDX: ffffea0000000000 RSI: ffff880223eafff8 RDI: ffff880223d4d3f8
RBP: ffff880222d8da68 R08: ffffea00077c53d8 R09: 0000000000000000
R10: ffff880028401f80 R11: 0000000000000000 R12: ffff88021fbc8000
R13: ffff880223eafff8 R14: 0000000000000000 R15: 00000000000001ff
FS:  00007fb0c8365700(0000) GS:ffff880028280000(0000) knlGS:ffffffff80426000
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000188681202610 CR3: 0000000223359000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-kvm (pid: 8626, threadinfo ffff880222d8c000, task ffff880223d6f4c0)
Stack:
 ffff880222d8da68 ffff880225678040 0000000000000141 0000000000000001
<0> 0000003b43a5f030 000000008000030e ffff880222d8da88 ffffffffa0376ec4
<0> 0000003b43a5f030 ffff880225678040 ffff880222d8dbe8 ffffffffa037ca7b
Call Trace:
 [<ffffffffa0376ec4>] __kvm_mmu_free_some_pages+0x34/0x60 [kvm]
 [<ffffffffa037ca7b>] paging64_page_fault+0x49b/0x4b0 [kvm]
 [<ffffffffa036e5c5>] ? emulator_write_emulated+0x75/0x90 [kvm]
 [<ffffffffa0370eb7>] ? emulate_instruction+0x307/0x380 [kvm]
 [<ffffffffa037919f>] kvm_mmu_page_fault+0x1f/0xa0 [kvm]
 [<ffffffffa03bffb0>] handle_exception+0x2c0/0x380 [kvm_intel]
 [<ffffffff8100be6e>] ? reschedule_interrupt+0xe/0x20
 [<ffffffffa03bfa11>] vmx_handle_exit+0xc1/0x280 [kvm_intel]
 [<ffffffffa0371e4a>] kvm_arch_vcpu_ioctl_run+0x3da/0xec0 [kvm]
 [<ffffffffa035c332>] kvm_vcpu_ioctl+0x522/0x670 [kvm]
 [<ffffffff8105dc72>] ? default_wake_function+0x12/0x20
 [<ffffffff811876f6>] ? pollwake+0x56/0x60
 [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20
 [<ffffffff81184ee2>] vfs_ioctl+0x22/0xa0
 [<ffffffff811853aa>] do_vfs_ioctl+0x3aa/0x580
 [<ffffffff81185601>] sys_ioctl+0x81/0xa0
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
Code: 00 f0 ff ff ff ff 0f 00 4c 89 ee 48 21 d0 48 c1 e8 0c 48 8d 14 c5 00 00 00 00 48 c1 e0 06 48 29 d0 48 ba 00 00 00 00 00 ea ff ff <48> 8b 7c 10 10 e8 19 f3 ff ff 48 8b 05 3a d2 01 00 e9 48 ff ff 
RIP  [<ffffffffa037673d>] kvm_mmu_zap_page+0x12d/0x360 [kvm]
 RSP <ffff880222d8da38>
CR2: 0000188681202610
crash> bt
PID: 8626   TASK: ffff880223d6f4c0  CPU: 1   COMMAND: "qemu-kvm"
 #0 [ffff880222d8d600] machine_kexec at ffffffff810310db
 #1 [ffff880222d8d660] crash_kexec at ffffffff810b63b2
 #2 [ffff880222d8d730] oops_end at ffffffff814dec50
 #3 [ffff880222d8d760] no_context at ffffffff81040cdb
 #4 [ffff880222d8d7b0] __bad_area_nosemaphore at ffffffff81040f65
 #5 [ffff880222d8d800] bad_area at ffffffff8104108e
 #6 [ffff880222d8d830] __do_page_fault at ffffffff810417b3
 #7 [ffff880222d8d950] do_page_fault at ffffffff814e0c3e
 #8 [ffff880222d8d980] page_fault at ffffffff814ddfe5
    [exception RIP: kvm_mmu_zap_page+301]
    RIP: ffffffffa037673d  RSP: ffff880222d8da38  RFLAGS: 00010206
    RAX: 00002e8681202600  RBX: ffff880223c99d98  RCX: ffff88021fbc8000
    RDX: ffffea0000000000  RSI: ffff880223eafff8  RDI: ffff880223d4d3f8
    RBP: ffff880222d8da68   R8: ffffea00077c53d8   R9: 0000000000000000
    R10: ffff880028401f80  R11: 0000000000000000  R12: ffff88021fbc8000
    R13: ffff880223eafff8  R14: 0000000000000000  R15: 00000000000001ff
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff880222d8da70] __kvm_mmu_free_some_pages at ffffffffa0376ec4 [kvm]
#10 [ffff880222d8da90] paging64_page_fault at ffffffffa037ca7b [kvm]
#11 [ffff880222d8dbf0] kvm_mmu_page_fault at ffffffffa037919f [kvm]
#12 [ffff880222d8dc20] handle_exception at ffffffffa03bffb0 [kvm_intel]
#13 [ffff880222d8dc70] vmx_handle_exit at ffffffffa03bfa11 [kvm_intel]
#14 [ffff880222d8dcb0] kvm_arch_vcpu_ioctl_run at ffffffffa0371e4a [kvm]
#15 [ffff880222d8ddb0] kvm_vcpu_ioctl at ffffffffa035c332 [kvm]
#16 [ffff880222d8de60] vfs_ioctl at ffffffff81184ee2
#17 [ffff880222d8dea0] do_vfs_ioctl at ffffffff811853aa
#18 [ffff880222d8df30] sys_ioctl at ffffffff81185601
#19 [ffff880222d8df80] system_call_fastpath at ffffffff8100b172
    RIP: 00000039b88de9a7  RSP: 00007fb0c8364ad8  RFLAGS: 00000246
    RAX: 0000000000000010  RBX: ffffffff8100b172  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 000000000000ae80  RDI: 000000000000000b
    RBP: 0000000000000001   R8: 000000000000000b   R9: 00000000000021b2
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000002de8e50
    R13: 00000000c0c0c0c1  R14: 0000000000000000  R15: 0000000002f63630
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
crash> 

So, still happens on RHEL 6.1.

[root@savage 127.0.0.1-2011-05-21-10:20:18]# rpm -q kernel qemu-kvm libvirt
kernel-2.6.32-71.24.1.el6.x86_64
kernel-2.6.32-71.29.1.el6.x86_64
kernel-2.6.32-131.0.15.el6.x86_64
qemu-kvm-0.12.1.2-2.160.el6.x86_64
libvirt-0.8.7-18.el6.x86_64

Comment 7 Marcelo Tosatti 2011-05-25 18:15:22 UTC
Anders,

The oops message indicate probable memory corruption. Please boot the kernel with slub_debug=ZFPU option to verify.

Comment 8 Sirius Rayner-Karlsson 2011-05-26 10:05:07 UTC
Hi there Marcelo,

I think I have to apologise here. The physical system that this has been running on have suffered a failure where I can only narrow it down as far as being either motherboard or CPU that has ceased to function. All other components have been verified in another system and are still working fine.

With the reproducer system dead due to hardware failure, my personal view in light of your update is that the problem was manifesting due to the hardware failing.

I have a replacement system on order, but it will take 1-2 weeks before I have all the parts and the system is again installed to the point that I can attempt to reproduce with the two VM's. What I would suggest, unless your view is that this is certainly hardware failure related, I'll leave this NEEDINFO on me until I have rebuilt and set up the replacement system and attempted to reproduce on that again. If it still occurs, I'll follow your instructions. If I can not reproduce, I'll update to confirm this was hardware and I'll close the issue.

Thanks for patience,

/Anders

Comment 9 Sirius Rayner-Karlsson 2011-06-01 10:43:37 UTC
Hello again,

I have gotten the replacement system installed and set up so that I can start the VM's which on the previous system was reproducing the initial problem I observed. There is no problem this time, so the problem I observed on the older system must have been "hardware prefailure issues" (for want of a better description).

My apologies for the wild goose chase gentlemen. Please close this off as hardware failure.

Thanks!

/Anders

Comment 10 Sirius Rayner-Karlsson 2011-06-18 06:43:25 UTC
I'll close this off as this was hardware related (CPU failure by all accounts) and not reproducible on the new replacement system.


Note You need to log in before you can comment on or make changes to this bug.