Bug 865380
Summary: | Kernel oops/crash when running perf on a SandyBridge host | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Qunfang Zhang <qzhang> | ||||||||||||
Component: | kernel | Assignee: | Jiri Olsa <jolsa> | ||||||||||||
kernel sub component: | Perf | QA Contact: | Zhang Kexin <kzhang> | ||||||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||||||
Severity: | urgent | ||||||||||||||
Priority: | high | CC: | abaron, acme, areis, bazulay, chayang, dzickus, flang, gleb, iheim, jolsa, juzhang, knoel, kzhang, lveyde, mazhang, michen, shu, sluo, tburke, xfu | ||||||||||||
Version: | 6.4 | Keywords: | Regression | ||||||||||||
Target Milestone: | rc | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | virt | ||||||||||||||
Fixed In Version: | kernel-2.6.32-338.el6 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2013-02-21 06:49:57 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 1300182 | ||||||||||||||
Attachments: |
|
Description
Qunfang Zhang
2012-10-11 10:03:22 UTC
Host cpu info: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz stepping : 7 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid bogomips : 6784.20 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: Created attachment 625475 [details]
Host crash log
Retest on the SandyBridge host on kernel-2.6.32-328.el6.x86_64: 1. "-M rhel6.4.0 -cpu SandyBridge" for more than 10 times. ==> Passed. 2. "-M rhel6.3.0"(defualt -cpu ) ==> Reproduced at the third time. 3. "-M rhel6.3.0 -cpu SandyBridge" ==> Reproduce at the third time. Re-test on an old Conroe host on kernel-2.6.32-328.el6.x86_64: 1. "-M rhel6.3.0" ==> Passed after 20 times attempts. Can you attach guest dmesg with -cpu SandyBridge? Also can you make sure that the crahs does not happen when you run "perf record -e cycles firefox" (without any guest running on the machine). (In reply to comment #6) > Can you attach guest dmesg with -cpu SandyBridge? Also can you make sure > that the crahs does not happen when you run "perf record -e cycles firefox" > (without any guest running on the machine). Hi, Gleb Guest dmesg with -cpu SandyBridge will be upload. The crash also happens when I run "perf record -e cycles firefox" (without any guest running). Attachment will be upload as well. Created attachment 627958 [details]
dmesg of guest with -cpu SandyBridge
Created attachment 627959 [details]
host crash log when running perf on host without guest running
(In reply to comment #7) > The crash also happens when I run "perf record -e cycles firefox" (without > any guest running). Attachment will be upload as well. Thanks you. This is KVM unrelated problem. KVM creates PMU counter just like perf does. Assigning back to kernel. (In reply to comment #9) > Created attachment 627959 [details] > host crash log when running perf on host without guest running Hi Qunfang, Show stack trace shows guests running. The panic itself happens within qemu-kvm. Can you re-run your test without qemu running? Thanks, Don Hi Qunfang, You don't happen to have a test setup do you? I tried setting up a guest and failed miserably on 6.4. I ran a 'perf record -e cycles grep -ri blah /*' on the host Sandybridge box with the -328.el6 kernel successfully. Was trying to see if qemu caused issues. Cheers, Don (In reply to comment #11) > (In reply to comment #9) > > Created attachment 627959 [details] > > host crash log when running perf on host without guest running > > Hi Qunfang, > > Show stack trace shows guests running. The panic itself happens within > qemu-kvm. Can you re-run your test without qemu running? > > Thanks, > Don Hi, Don I remember I test without guest running. Currently the host is using by someone else. I will re-run later after get the host. Thanks. (In reply to comment #13) > Hi Qunfang, > > You don't happen to have a test setup do you? I tried setting up a guest > and failed miserably on 6.4. > You mean you need the setup for starting a guest, just to have qemu running? If that's the case, you should be able to use virt-manager. Install qemu-kvm, libvirt-daemon and then virt-manager, start the libvirt service and then invocate virt-manager. The rest should be fairly simple. Hi Ademar, I tried, there are various 6.4 bugs that were in my way. So I gave up. Dave Allan helped my through some. Also I was trying to do this remotely and virt-manager is not working over my ssh connection. I am not sure why port forwarding is not working. I will try running the virt install QE test to setup me up for 6.3 and then migrate over to the 6.4 qemu tools. Cheers, Don Hi, So I finally had help to setup a RHEV guest. However, it uses qemu-kvm-rhev and friends instead of qemu-kvm. As a result I haven't been able to duplicate this problem. I installed RHEL-6.3, added some rhev pkgs and then installed a -328.el6 kernel on the host. Used RHEV to install a 6.3 guest. Rebooted the guest multiple times, no warning. Another co-worker only sees this problem on RHEV with a -324.el6 kernel. I can't duplicate that either. Kinda stuck. Is there a machine I can play with to investigate the issue more? Cheers, Don Created attachment 632035 [details]
kernel module source
I was able to reproduce the crash (or may be different one but related) without any guest running by using attached module. Compile it and load in a loop like this:
while true; do insmod perfev.ko; done
After a couple of iteration kernel crashes with:
create event ffffffffa0b69000
counter=539 enabled=4027 running=4027
counter=10000743 enabled=10156430 running=3381
release event ffffffffa0b69000
exit ffffffffa0b69000
BUG: unable to handle kernel paging request at ffffffffa0b69000
IP: [<ffffffffa0b69000>] 0xffffffffa0b69000
PGD 1a87067 PUD 1a8b063 PMD 46e29c067 PTE 0
Oops: 0010 [#1] SMP
last sysfs file: /sys/devices/virtual/misc/autofs/uevent
CPU 13
Modules linked in: netconsole configfs autofs4 fuse nfsd exportfs nfs nfs_acl auth_rpcgss fscache lockd sunrpc kvm_intel kvm ipv6 ext2 dm_crypt snd_pcsp snd_pcm snd_page_alloc cdc_ether i2c_i801 i7core_edac snd_timer usbnet iTCO_wdt serio_raw i2c_core shpchp mii edac_core ioatdma dca iTCO_vendor_support snd soundcore ext3 mbcache jbd btrfs(T) libcrc32c lzo_compress lzo_decompress zlib_deflate dm_mod sr_mod sd_mod crc_t10dif cdrom usb_storage mptsas mptscsih mptbase bnx2 scsi_transport_sas [last unloaded: scsi_wait_scan]
Pid: 3625, comm: bash Tainted: G W --------------- T 2.6.32 #10 IBM IBM System x -[7870B3G]-/49Y5178
RIP: 0010:[<ffffffffa0b69000>] [<ffffffffa0b69000>] 0xffffffffa0b69000
RSP: 0018:ffff880287547cd0 EFLAGS: 00010086
RAX: ffffffffa0b69000 RBX: ffff88046e153c00 RCX: ffff880287547f58
RDX: ffff880287547dd8 RSI: 0000000000000001 RDI: ffff88046e153c00
RBP: ffff880287547d68 R08: ffff880287547f58 R09: ffff880287547dd8
R10: ffff88046ee3be00 R11: 0000000000000007 R12: 0000000000000001
R13: 0000000000000000 R14: ffff880287547f58 R15: 0000000000000000
FS: 00007f979819a700(0000) GS:ffff880287540000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa0b69000 CR3: 000000046ed7f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 3625, threadinfo ffff88046ee3a000, task ffff88046eae0aa0)
Stack:
ffffffff81112540 0000000000000000 0000000000000000 0000000000000000
<d> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
<d> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
<NMI>
[<ffffffff81112540>] ? __perf_event_overflow+0xb0/0x2a0
[<ffffffff81112b54>] perf_event_overflow+0x14/0x20
[<ffffffff8101eb26>] intel_pmu_handle_irq+0x336/0x550
[<ffffffff8150e976>] ? kprobe_exceptions_notify+0x16/0x450
[<ffffffff8150d4d9>] perf_event_nmi_handler+0x39/0xb0
[<ffffffff8150efc6>] notifier_call_chain+0x56/0x80
[<ffffffff8150f02a>] atomic_notifier_call_chain+0x1a/0x20
[<ffffffff81098bde>] notify_die+0x2e/0x30
[<ffffffff8150cc5b>] do_nmi+0x1bb/0x340
[<ffffffff8150c510>] nmi+0x20/0x30
[<ffffffff8127edf8>] ? strnlen_user+0x78/0x90
<<EOE>>
[<ffffffff811847df>] copy_strings+0x7f/0x240
[<ffffffff811854e2>] do_execve+0x1e2/0x2c0
[<ffffffff810095ca>] sys_execve+0x4a/0x80
[<ffffffff8100b48a>] stub_execve+0x6a/0xc0
Code: Bad RIP value.
RIP [<ffffffffa0b69000>] 0xffffffffa0b69000
RSP <ffff880287547cd0>
CR2: ffffffffa0b69000
---[ end trace a252b2f0a2ddb53e ]---
So it looks like even callback is called after event is released by perf_event_release_kernel() and module is unloaded.
Kernel 279 runs this while insmod loop without any problem.
Created attachment 632179 [details]
fix
fixies the leftover from:
[kernel] perf: Change and simplify ctx::is_active semantics
commit c103845d8ab9c98a97dee342ac86a496937a7a26
Author: Jiri Olsa <jolsa>
Date: Fri Oct 5 13:54:54 2012 -0400
it fixies the issue in my tests
*** Bug 869216 has been marked as a duplicate of this bug. *** Patch(es) available on kernel-2.6.32-338.el6 *** Bug 871329 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0496.html *** Bug 890962 has been marked as a duplicate of this bug. *** |