Bug 742383

Summary: panic on perf probe manipulation during perf record
Product: [Fedora] Fedora Reporter: Frank Ch. Eigler <fche>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, masami.hiramatsu.pt
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-11-08 00:07:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg none

Description Frank Ch. Eigler 2011-09-29 21:49:29 UTC
Created attachment 525651 [details]
dmesg

kernel 3.1.0-0.rc6.git0.0.fc17.x86_64, running on 4cpu KVM

Testing kprobe / module loading stability, I ran into a reproducible
insta-crash using perf probe:

# cd /lib/modules/`uname -r`/kernel/fs
# find . -name '*.ko' | while read nm
do readelf -s $nm | grep 'FUNC.* 6 ' | grep -v _module  | awk '{print "'$nm' " $8}'
done | while read mod fn; do
perf probe -m $mod --add $fn 
done      

This creates perf-probes (kprobes) on (mostly) offline modules' init functions.
Now we start recording:

# perf record -e probe:\* -aR sh
sh-4.2# perf probe --del probe:\*

Inside the subshell, we drop all the kprobes.  Now we exit:

sh-4.2# exit

Bang.  Panic text attached.

Comment 1 Masami Hiramatsu 2011-09-30 07:59:14 UTC
I've ensured that this occurs also on the upstream kernel. And when I got the log on VirtualBox, it said that the BUG was a kernel pagefault.

[  635.796100] BUG: unable to handle kernel paging request at ffff88984fd000ac
[  635.797084] IP: [<ffffffff810b9f76>] perf_trace_add+0x5a/0x7d
[  635.797084] PGD 0 
[  635.797084] Oops: 0000 [#1] SMP 
[  635.797084] CPU 1 
[  635.797084] Modules linked in: virtio_net pcspkr i2c_piix4 i2c_core [last unloaded: scsi_wait_scan]
[  635.797084] 
[  635.797084] Pid: 0, comm: kworker/0:0 Not tainted 3.1.0-rc4-tip+ #20 innotek GmbH VirtualBox
[  635.797084] RIP: 0010:[<ffffffff810b9f76>]  [<ffffffff810b9f76>] perf_trace_add+0x5a/0x7d
[  635.797084] RSP: 0018:ffff88004fd03cc8  EFLAGS: 00010086
[  635.797084] RAX: ffff88984fd000ac RBX: ffff88004c13b000 RCX: 0000000000000000

Here is the disassembled code:
----
        list = this_cpu_ptr(pcpu_list);
ffffffff810b9f6d:       65 48 03 04 25 f0 c4    add    %gs:0xc4f0,%rax
ffffffff810b9f74:       00 00
 * list-traversal primitive must be guarded by rcu_read_lock().
 */
static inline void hlist_add_head_rcu(struct hlist_node *n,
                                        struct hlist_head *h)
{
        struct hlist_node *first = h->first;
ffffffff810b9f76:       48 8b 10                mov    (%rax),%rdx
----

So, it seems that perf was trying to add removed event and failed.
Hmm, I need to dig this:
- Does that also occur on not .init functions?
- Does that also occur on online module functions?

Comment 2 Masami Hiramatsu 2011-10-03 02:30:21 UTC
OK, I think this should be fixed in kprobe-tracer side not to free a tracepoint if it is in use, because the suggested reproducing method is not a normal operation and should be rejected with -EBUSY.

Thank you for reporting :)

Comment 3 Masami Hiramatsu 2011-10-04 10:50:40 UTC
I've sent a bugfix patch (and 3 trivial typo fixes) here.
http://thread.gmane.org/gmane.linux.kernel/1199036

It prevents removing running kprobe-events to fix this bug.

Thank you,

Comment 4 Josh Boyer 2011-11-08 00:07:09 UTC
The fixes went upstream as
02ca1521ad404cf566e0075848f80d064c0a0503

and

44a56040a0037a845d5fa218dffde464579f0cab

Both were CC'd to stable, so they should make their way back into f15/f16 shortly.