Bug 84032 - starting profiling sometimes crashes the kernel
Summary: starting profiling sometimes crashes the kernel
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: oprofile
Version: 9
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: William Cohen
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 79578 CambridgeBlocker
TreeView+ depends on / blocked
 
Reported: 2003-02-11 04:17 UTC by Ulrich Drepper
Modified: 2007-04-18 16:51 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2003-10-07 02:07:59 UTC


Attachments (Terms of Use)
script to crash kernel via oprofile (852 bytes, text/plain)
2003-02-14 05:49 UTC, Ulrich Drepper
no flags Details
S scripted revised to use "rm -rf /var/lib/oprofile/samples/*" (854 bytes, text/plain)
2003-02-14 14:58 UTC, William Cohen
no flags Details

Description Ulrich Drepper 2003-02-11 04:17:37 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3b) Gecko/20030203

Description of problem:
One combination of event counters crash my UP P4 HT system sooner or later. 
Until 2.4.20-2.40 every use of MEMSYNC_CANCEL seemed to be fatal.  With
2.4.20-2.41 I see the problem only after some preparation.

I'm using oprofile-0.4-40 but earlier versions had the problem, too.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
Run the following steps:

    rm -f /root/.oprofile/daemonrc
rm -f /var/lib/oprofile/lock
rm -f /var/lib/oprofile/samples/*
opcontrol --init
opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED
--ctr3-count=600000 --ctr3-unit-mask=15
opcontrol --start

<Run some test program.  Mine was thread and created 1,000,000 threads in sequence.>

opcontrol --stop
oprofpp -c 3 -l <SOMEDSO>

opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED
--ctr3-count=600000 --ctr3-unit-mask=15 --ctr2-event=MEMORY_CANCEL
--ctr2-count=240000 --ctr2-unit-mask=8
opcontrol --start
^^^^ Sometimes this opcontrol call crashes the kernel

<Run the program again>

opcontrol --stop
rm -f /root/.oprofile/daemonrc
rm -f /var/lib/oprofile/lock
rm -f /var/lib/oprofile/samples/*
opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED
--ctr3-count=600000 --ctr3-unit-mask=15 --ctr2-event=MEMORY_CANCEL
--ctr2-count=240000 --ctr2-unit-mask=8
opcontrol --start

^^^^ I never managed to get past this last opcontrol call.


Actual Results:  Kernel bug:

kernel BUG at ../../../drivers/oprofile/cpu_buffer.c:95!
invalid operand: 0000
oprofile nfs lockd sunrpc parport_pc lp parport autofs e100 ipt_REJECT iptable_
CPU:    1
EIP:    0060:[<d0933554>]    Not tainted
EFLAGS: 00010046

EIP is at oprofile_add_sample [oprofile] 0xc4 (2.4.20-2.41smp)
eax: 00000000   ebx: d0938700   ecx: 00000080   edx: 00000002
esi: 0806d880   edi: 00000002   ebp: 00000000   esp: caaf1f68
ds: 0068   es: 0068   ss: 0068
Process awk (pid: 26649, stackpage=caaf1000)
Stack: 00000000 00000002 00000048 d0935d61 d09396f8 00000004 cb333620 00000001
       caaf1fc4 0808bac8 08089eb6 bffff628 d0934ebd 00000001 d0939d88 caaf1fc4
       c010a682 caaf1fc4 00000001 0808bac4 c0109a4a caaf1fc4 00000000 0808bac4
Call Trace:   [<d0935d61>] p4_check_ctrs [oprofile] 0xc1 (0xcaaf1f74))
[<d09396f8>] counter_config [oprofile] 0x38 (0xcaaf1f78))
[<d0934ebd>] nmi_callback [oprofile] 0x2d (0xcaaf1f98))
[<d0939d88>] cpu_msrs [oprofile] 0x5e8 (0xcaaf1fa0))
[<c010a682>] do_nmi [kernel] 0x22 (0xcaaf1fa8))
[<c0109a4a>] nmi [kernel] 0x1e (0xcaaf1fb8))


Expected Results:  Profiling works.

Additional info:

I can provide you a version of the test program.  It's in my home dir on devserv.

Comment 1 William Cohen 2003-02-11 15:23:53 UTC
Does the last opcontrol --setup /opcontrol --start work if the the previous
opcontrol --setup/--starts are not run?


Comment 2 William Cohen 2003-02-11 23:46:37 UTC
Using the oprofile-0.4-41, I encounter a different problem. Removing the
/var/lib/oprofile/lock, and staring with new setup, yields

[root@dhcp59-189 SPECS]# opcontrol --start
Failed to open profile device: Device or resource busy
Couldn't start oprofiled.
Check the log file "/var/lib/oprofile/oprofiled.log" and /var/log/messages

It appears that the old daemon does an access to one of the special files, and
the new daemon can not access the file (probably /dev/oprofile/buffer).

Can you reproduce the problem with "opcontrol --shutdown" instead of "opcontrol
--stop" and without the "rm -f ..."?

-Will

Comment 3 William Cohen 2003-02-13 21:36:10 UTC
Uli, do you have a watchdog timer set up on this machine?

Comment 4 Ulrich Drepper 2003-02-13 21:39:35 UTC
> Uli, do you have a watchdog timer set up on this machine?

I usually have nmi_watchdog defined, yes.

Comment 5 Ulrich Drepper 2003-02-14 05:49:00 UTC
Created attachment 90077 [details]
script to crash kernel via oprofile

Executing this script crashes my UP P4 HT machine reliable (100%) during the
last opcontrol --start.

Comment 6 William Cohen 2003-02-14 14:58:04 UTC
Created attachment 90087 [details]
S scripted revised to use "rm -rf /var/lib/oprofile/samples/*"

I tried the attached script S (S2 merely does a better job cleaning out
/var/lib/oprofile/samples/*). The kernel did not crash the machine. It did get
following output:

[root@dhcp59-189 root]# ./S
Using log file /var/lib/oprofile/oprofiled.log
Daemon started.
Profiler running.
Stopping profiling.
Profiler running.
Stopping profiling.
Daemon not running
Failed to open profile device: Device or resource busy
Couldn't start oprofiled.
Check the log file "/var/lib/oprofile/oprofiled.log" and /var/log/messages

What is is doing is not right, but it isn't crashing. Here is the configuration
information. This is a Dell precision 430 running a freshly installed
GinGin-re2011.nightly.	What differences are their between the machine that the
script crashes on and the machine I am using to try to replicate the problem?
Here are the details about the machine and software:

rpm packages:
oprofile-0.4-41
kernel-smp-2.4.20-2.44



The processor is

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 15
model		: 2
model name	: Intel(R) Xeon(TM) CPU 2.66GHz
stepping	: 7
cpu MHz 	: 2657.857
cache size	: 512 KB
physical id	: 0
siblings	: 2
fdiv_bug	: no
hlt_bug 	: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips	: 5308.41

Comment 7 William Cohen 2003-06-03 14:26:49 UTC
Uli, is this bug still occuring?

Comment 8 William Cohen 2003-08-26 16:37:39 UTC
Uli, is this bug still occuring with the current RHL9 kernels?

Comment 11 Ulrich Drepper 2003-10-07 02:07:59 UTC
Sorry for the delay.  I cannot reproduce the problem anymore.  Assumed to be
fixed.  Closing as such.


Note You need to log in before you can comment on or make changes to this bug.