Bug 84032 - starting profiling sometimes crashes the kernel
starting profiling sometimes crashes the kernel
Product: Red Hat Linux
Classification: Retired
Component: oprofile (Show other bugs)
i686 Linux
medium Severity high
: ---
: ---
Assigned To: William Cohen
Depends On:
Blocks: 79578 CambridgeBlocker
  Show dependency treegraph
Reported: 2003-02-10 23:17 EST by Ulrich Drepper
Modified: 2007-04-18 12:51 EDT (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2003-10-06 22:07:59 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
script to crash kernel via oprofile (852 bytes, text/plain)
2003-02-14 00:49 EST, Ulrich Drepper
no flags Details
S scripted revised to use "rm -rf /var/lib/oprofile/samples/*" (854 bytes, text/plain)
2003-02-14 09:58 EST, William Cohen
no flags Details

  None (edit)
Description Ulrich Drepper 2003-02-10 23:17:37 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3b) Gecko/20030203

Description of problem:
One combination of event counters crash my UP P4 HT system sooner or later. 
Until 2.4.20-2.40 every use of MEMSYNC_CANCEL seemed to be fatal.  With
2.4.20-2.41 I see the problem only after some preparation.

I'm using oprofile-0.4-40 but earlier versions had the problem, too.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Run the following steps:

    rm -f /root/.oprofile/daemonrc
rm -f /var/lib/oprofile/lock
rm -f /var/lib/oprofile/samples/*
opcontrol --init
opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED
--ctr3-count=600000 --ctr3-unit-mask=15
opcontrol --start

<Run some test program.  Mine was thread and created 1,000,000 threads in sequence.>

opcontrol --stop
oprofpp -c 3 -l <SOMEDSO>

opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED
--ctr3-count=600000 --ctr3-unit-mask=15 --ctr2-event=MEMORY_CANCEL
--ctr2-count=240000 --ctr2-unit-mask=8
opcontrol --start
^^^^ Sometimes this opcontrol call crashes the kernel

<Run the program again>

opcontrol --stop
rm -f /root/.oprofile/daemonrc
rm -f /var/lib/oprofile/lock
rm -f /var/lib/oprofile/samples/*
opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED
--ctr3-count=600000 --ctr3-unit-mask=15 --ctr2-event=MEMORY_CANCEL
--ctr2-count=240000 --ctr2-unit-mask=8
opcontrol --start

^^^^ I never managed to get past this last opcontrol call.

Actual Results:  Kernel bug:

kernel BUG at ../../../drivers/oprofile/cpu_buffer.c:95!
invalid operand: 0000
oprofile nfs lockd sunrpc parport_pc lp parport autofs e100 ipt_REJECT iptable_
CPU:    1
EIP:    0060:[<d0933554>]    Not tainted
EFLAGS: 00010046

EIP is at oprofile_add_sample [oprofile] 0xc4 (2.4.20-2.41smp)
eax: 00000000   ebx: d0938700   ecx: 00000080   edx: 00000002
esi: 0806d880   edi: 00000002   ebp: 00000000   esp: caaf1f68
ds: 0068   es: 0068   ss: 0068
Process awk (pid: 26649, stackpage=caaf1000)
Stack: 00000000 00000002 00000048 d0935d61 d09396f8 00000004 cb333620 00000001
       caaf1fc4 0808bac8 08089eb6 bffff628 d0934ebd 00000001 d0939d88 caaf1fc4
       c010a682 caaf1fc4 00000001 0808bac4 c0109a4a caaf1fc4 00000000 0808bac4
Call Trace:   [<d0935d61>] p4_check_ctrs [oprofile] 0xc1 (0xcaaf1f74))
[<d09396f8>] counter_config [oprofile] 0x38 (0xcaaf1f78))
[<d0934ebd>] nmi_callback [oprofile] 0x2d (0xcaaf1f98))
[<d0939d88>] cpu_msrs [oprofile] 0x5e8 (0xcaaf1fa0))
[<c010a682>] do_nmi [kernel] 0x22 (0xcaaf1fa8))
[<c0109a4a>] nmi [kernel] 0x1e (0xcaaf1fb8))

Expected Results:  Profiling works.

Additional info:

I can provide you a version of the test program.  It's in my home dir on devserv.
Comment 1 William Cohen 2003-02-11 10:23:53 EST
Does the last opcontrol --setup /opcontrol --start work if the the previous
opcontrol --setup/--starts are not run?
Comment 2 William Cohen 2003-02-11 18:46:37 EST
Using the oprofile-0.4-41, I encounter a different problem. Removing the
/var/lib/oprofile/lock, and staring with new setup, yields

[root@dhcp59-189 SPECS]# opcontrol --start
Failed to open profile device: Device or resource busy
Couldn't start oprofiled.
Check the log file "/var/lib/oprofile/oprofiled.log" and /var/log/messages

It appears that the old daemon does an access to one of the special files, and
the new daemon can not access the file (probably /dev/oprofile/buffer).

Can you reproduce the problem with "opcontrol --shutdown" instead of "opcontrol
--stop" and without the "rm -f ..."?

Comment 3 William Cohen 2003-02-13 16:36:10 EST
Uli, do you have a watchdog timer set up on this machine?
Comment 4 Ulrich Drepper 2003-02-13 16:39:35 EST
> Uli, do you have a watchdog timer set up on this machine?

I usually have nmi_watchdog defined, yes.
Comment 5 Ulrich Drepper 2003-02-14 00:49:00 EST
Created attachment 90077 [details]
script to crash kernel via oprofile

Executing this script crashes my UP P4 HT machine reliable (100%) during the
last opcontrol --start.
Comment 6 William Cohen 2003-02-14 09:58:04 EST
Created attachment 90087 [details]
S scripted revised to use "rm -rf /var/lib/oprofile/samples/*"

I tried the attached script S (S2 merely does a better job cleaning out
/var/lib/oprofile/samples/*). The kernel did not crash the machine. It did get
following output:

[root@dhcp59-189 root]# ./S
Using log file /var/lib/oprofile/oprofiled.log
Daemon started.
Profiler running.
Stopping profiling.
Profiler running.
Stopping profiling.
Daemon not running
Failed to open profile device: Device or resource busy
Couldn't start oprofiled.
Check the log file "/var/lib/oprofile/oprofiled.log" and /var/log/messages

What is is doing is not right, but it isn't crashing. Here is the configuration
information. This is a Dell precision 430 running a freshly installed
GinGin-re2011.nightly.	What differences are their between the machine that the
script crashes on and the machine I am using to try to replicate the problem?
Here are the details about the machine and software:

rpm packages:

The processor is

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 15
model		: 2
model name	: Intel(R) Xeon(TM) CPU 2.66GHz
stepping	: 7
cpu MHz 	: 2657.857
cache size	: 512 KB
physical id	: 0
siblings	: 2
fdiv_bug	: no
hlt_bug 	: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips	: 5308.41
Comment 7 William Cohen 2003-06-03 10:26:49 EDT
Uli, is this bug still occuring?
Comment 8 William Cohen 2003-08-26 12:37:39 EDT
Uli, is this bug still occuring with the current RHL9 kernels?
Comment 11 Ulrich Drepper 2003-10-06 22:07:59 EDT
Sorry for the delay.  I cannot reproduce the problem anymore.  Assumed to be
fixed.  Closing as such.

Note You need to log in before you can comment on or make changes to this bug.