Bug 84032
| Summary: | starting profiling sometimes crashes the kernel | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] Red Hat Linux | Reporter: | Ulrich Drepper <drepper> | ||||||
| Component: | oprofile | Assignee: | William Cohen <wcohen> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 9 | ||||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | i686 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2003-10-07 02:07:59 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 79578, 100643 | ||||||||
| Attachments: |
|
||||||||
Does the last opcontrol --setup /opcontrol --start work if the the previous opcontrol --setup/--starts are not run? Using the oprofile-0.4-41, I encounter a different problem. Removing the /var/lib/oprofile/lock, and staring with new setup, yields [root@dhcp59-189 SPECS]# opcontrol --start Failed to open profile device: Device or resource busy Couldn't start oprofiled. Check the log file "/var/lib/oprofile/oprofiled.log" and /var/log/messages It appears that the old daemon does an access to one of the special files, and the new daemon can not access the file (probably /dev/oprofile/buffer). Can you reproduce the problem with "opcontrol --shutdown" instead of "opcontrol --stop" and without the "rm -f ..."? -Will Uli, do you have a watchdog timer set up on this machine? > Uli, do you have a watchdog timer set up on this machine?
I usually have nmi_watchdog defined, yes.
Created attachment 90077 [details]
script to crash kernel via oprofile
Executing this script crashes my UP P4 HT machine reliable (100%) during the
last opcontrol --start.
Created attachment 90087 [details]
S scripted revised to use "rm -rf /var/lib/oprofile/samples/*"
I tried the attached script S (S2 merely does a better job cleaning out
/var/lib/oprofile/samples/*). The kernel did not crash the machine. It did get
following output:
[root@dhcp59-189 root]# ./S
Using log file /var/lib/oprofile/oprofiled.log
Daemon started.
Profiler running.
Stopping profiling.
Profiler running.
Stopping profiling.
Daemon not running
Failed to open profile device: Device or resource busy
Couldn't start oprofiled.
Check the log file "/var/lib/oprofile/oprofiled.log" and /var/log/messages
What is is doing is not right, but it isn't crashing. Here is the configuration
information. This is a Dell precision 430 running a freshly installed
GinGin-re2011.nightly. What differences are their between the machine that the
script crashes on and the machine I am using to try to replicate the problem?
Here are the details about the machine and software:
rpm packages:
oprofile-0.4-41
kernel-smp-2.4.20-2.44
The processor is
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.66GHz
stepping : 7
cpu MHz : 2657.857
cache size : 512 KB
physical id : 0
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 5308.41
Uli, is this bug still occuring? Uli, is this bug still occuring with the current RHL9 kernels? Sorry for the delay. I cannot reproduce the problem anymore. Assumed to be fixed. Closing as such. |
From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3b) Gecko/20030203 Description of problem: One combination of event counters crash my UP P4 HT system sooner or later. Until 2.4.20-2.40 every use of MEMSYNC_CANCEL seemed to be fatal. With 2.4.20-2.41 I see the problem only after some preparation. I'm using oprofile-0.4-40 but earlier versions had the problem, too. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: Run the following steps: rm -f /root/.oprofile/daemonrc rm -f /var/lib/oprofile/lock rm -f /var/lib/oprofile/samples/* opcontrol --init opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED --ctr3-count=600000 --ctr3-unit-mask=15 opcontrol --start <Run some test program. Mine was thread and created 1,000,000 threads in sequence.> opcontrol --stop oprofpp -c 3 -l <SOMEDSO> opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED --ctr3-count=600000 --ctr3-unit-mask=15 --ctr2-event=MEMORY_CANCEL --ctr2-count=240000 --ctr2-unit-mask=8 opcontrol --start ^^^^ Sometimes this opcontrol call crashes the kernel <Run the program again> opcontrol --stop rm -f /root/.oprofile/daemonrc rm -f /var/lib/oprofile/lock rm -f /var/lib/oprofile/samples/* opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED --ctr3-count=600000 --ctr3-unit-mask=15 --ctr2-event=MEMORY_CANCEL --ctr2-count=240000 --ctr2-unit-mask=8 opcontrol --start ^^^^ I never managed to get past this last opcontrol call. Actual Results: Kernel bug: kernel BUG at ../../../drivers/oprofile/cpu_buffer.c:95! invalid operand: 0000 oprofile nfs lockd sunrpc parport_pc lp parport autofs e100 ipt_REJECT iptable_ CPU: 1 EIP: 0060:[<d0933554>] Not tainted EFLAGS: 00010046 EIP is at oprofile_add_sample [oprofile] 0xc4 (2.4.20-2.41smp) eax: 00000000 ebx: d0938700 ecx: 00000080 edx: 00000002 esi: 0806d880 edi: 00000002 ebp: 00000000 esp: caaf1f68 ds: 0068 es: 0068 ss: 0068 Process awk (pid: 26649, stackpage=caaf1000) Stack: 00000000 00000002 00000048 d0935d61 d09396f8 00000004 cb333620 00000001 caaf1fc4 0808bac8 08089eb6 bffff628 d0934ebd 00000001 d0939d88 caaf1fc4 c010a682 caaf1fc4 00000001 0808bac4 c0109a4a caaf1fc4 00000000 0808bac4 Call Trace: [<d0935d61>] p4_check_ctrs [oprofile] 0xc1 (0xcaaf1f74)) [<d09396f8>] counter_config [oprofile] 0x38 (0xcaaf1f78)) [<d0934ebd>] nmi_callback [oprofile] 0x2d (0xcaaf1f98)) [<d0939d88>] cpu_msrs [oprofile] 0x5e8 (0xcaaf1fa0)) [<c010a682>] do_nmi [kernel] 0x22 (0xcaaf1fa8)) [<c0109a4a>] nmi [kernel] 0x1e (0xcaaf1fb8)) Expected Results: Profiling works. Additional info: I can provide you a version of the test program. It's in my home dir on devserv.