Bug 742383 - panic on perf probe manipulation during perf record
Summary: panic on perf probe manipulation during perf record
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-09-29 21:49 UTC by Frank Ch. Eigler
Modified: 2011-11-08 00:07 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-11-08 00:07:09 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
dmesg (54.04 KB, text/plain)
2011-09-29 21:49 UTC, Frank Ch. Eigler
no flags Details

Description Frank Ch. Eigler 2011-09-29 21:49:29 UTC
Created attachment 525651 [details]
dmesg

kernel 3.1.0-0.rc6.git0.0.fc17.x86_64, running on 4cpu KVM

Testing kprobe / module loading stability, I ran into a reproducible
insta-crash using perf probe:

# cd /lib/modules/`uname -r`/kernel/fs
# find . -name '*.ko' | while read nm
do readelf -s $nm | grep 'FUNC.* 6 ' | grep -v _module  | awk '{print "'$nm' " $8}'
done | while read mod fn; do
perf probe -m $mod --add $fn 
done      

This creates perf-probes (kprobes) on (mostly) offline modules' init functions.
Now we start recording:

# perf record -e probe:\* -aR sh
sh-4.2# perf probe --del probe:\*

Inside the subshell, we drop all the kprobes.  Now we exit:

sh-4.2# exit

Bang.  Panic text attached.

Comment 1 Masami Hiramatsu 2011-09-30 07:59:14 UTC
I've ensured that this occurs also on the upstream kernel. And when I got the log on VirtualBox, it said that the BUG was a kernel pagefault.

[  635.796100] BUG: unable to handle kernel paging request at ffff88984fd000ac
[  635.797084] IP: [<ffffffff810b9f76>] perf_trace_add+0x5a/0x7d
[  635.797084] PGD 0 
[  635.797084] Oops: 0000 [#1] SMP 
[  635.797084] CPU 1 
[  635.797084] Modules linked in: virtio_net pcspkr i2c_piix4 i2c_core [last unloaded: scsi_wait_scan]
[  635.797084] 
[  635.797084] Pid: 0, comm: kworker/0:0 Not tainted 3.1.0-rc4-tip+ #20 innotek GmbH VirtualBox
[  635.797084] RIP: 0010:[<ffffffff810b9f76>]  [<ffffffff810b9f76>] perf_trace_add+0x5a/0x7d
[  635.797084] RSP: 0018:ffff88004fd03cc8  EFLAGS: 00010086
[  635.797084] RAX: ffff88984fd000ac RBX: ffff88004c13b000 RCX: 0000000000000000

Here is the disassembled code:
----
        list = this_cpu_ptr(pcpu_list);
ffffffff810b9f6d:       65 48 03 04 25 f0 c4    add    %gs:0xc4f0,%rax
ffffffff810b9f74:       00 00
 * list-traversal primitive must be guarded by rcu_read_lock().
 */
static inline void hlist_add_head_rcu(struct hlist_node *n,
                                        struct hlist_head *h)
{
        struct hlist_node *first = h->first;
ffffffff810b9f76:       48 8b 10                mov    (%rax),%rdx
----

So, it seems that perf was trying to add removed event and failed.
Hmm, I need to dig this:
- Does that also occur on not .init functions?
- Does that also occur on online module functions?

Comment 2 Masami Hiramatsu 2011-10-03 02:30:21 UTC
OK, I think this should be fixed in kprobe-tracer side not to free a tracepoint if it is in use, because the suggested reproducing method is not a normal operation and should be rejected with -EBUSY.

Thank you for reporting :)

Comment 3 Masami Hiramatsu 2011-10-04 10:50:40 UTC
I've sent a bugfix patch (and 3 trivial typo fixes) here.
http://thread.gmane.org/gmane.linux.kernel/1199036

It prevents removing running kprobe-events to fix this bug.

Thank you,

Comment 4 Josh Boyer 2011-11-08 00:07:09 UTC
The fixes went upstream as
02ca1521ad404cf566e0075848f80d064c0a0503

and

44a56040a0037a845d5fa218dffde464579f0cab

Both were CC'd to stable, so they should make their way back into f15/f16 shortly.


Note You need to log in before you can comment on or make changes to this bug.