From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3b) Gecko/20030203 Description of problem: One combination of event counters crash my UP P4 HT system sooner or later. Until 2.4.20-2.40 every use of MEMSYNC_CANCEL seemed to be fatal. With 2.4.20-2.41 I see the problem only after some preparation. I'm using oprofile-0.4-40 but earlier versions had the problem, too. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: Run the following steps: rm -f /root/.oprofile/daemonrc rm -f /var/lib/oprofile/lock rm -f /var/lib/oprofile/samples/* opcontrol --init opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED --ctr3-count=600000 --ctr3-unit-mask=15 opcontrol --start <Run some test program. Mine was thread and created 1,000,000 threads in sequence.> opcontrol --stop oprofpp -c 3 -l <SOMEDSO> opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED --ctr3-count=600000 --ctr3-unit-mask=15 --ctr2-event=MEMORY_CANCEL --ctr2-count=240000 --ctr2-unit-mask=8 opcontrol --start ^^^^ Sometimes this opcontrol call crashes the kernel <Run the program again> opcontrol --stop rm -f /root/.oprofile/daemonrc rm -f /var/lib/oprofile/lock rm -f /var/lib/oprofile/samples/* opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED --ctr3-count=600000 --ctr3-unit-mask=15 --ctr2-event=MEMORY_CANCEL --ctr2-count=240000 --ctr2-unit-mask=8 opcontrol --start ^^^^ I never managed to get past this last opcontrol call. Actual Results: Kernel bug: kernel BUG at ../../../drivers/oprofile/cpu_buffer.c:95! invalid operand: 0000 oprofile nfs lockd sunrpc parport_pc lp parport autofs e100 ipt_REJECT iptable_ CPU: 1 EIP: 0060:[<d0933554>] Not tainted EFLAGS: 00010046 EIP is at oprofile_add_sample [oprofile] 0xc4 (2.4.20-2.41smp) eax: 00000000 ebx: d0938700 ecx: 00000080 edx: 00000002 esi: 0806d880 edi: 00000002 ebp: 00000000 esp: caaf1f68 ds: 0068 es: 0068 ss: 0068 Process awk (pid: 26649, stackpage=caaf1000) Stack: 00000000 00000002 00000048 d0935d61 d09396f8 00000004 cb333620 00000001 caaf1fc4 0808bac8 08089eb6 bffff628 d0934ebd 00000001 d0939d88 caaf1fc4 c010a682 caaf1fc4 00000001 0808bac4 c0109a4a caaf1fc4 00000000 0808bac4 Call Trace: [<d0935d61>] p4_check_ctrs [oprofile] 0xc1 (0xcaaf1f74)) [<d09396f8>] counter_config [oprofile] 0x38 (0xcaaf1f78)) [<d0934ebd>] nmi_callback [oprofile] 0x2d (0xcaaf1f98)) [<d0939d88>] cpu_msrs [oprofile] 0x5e8 (0xcaaf1fa0)) [<c010a682>] do_nmi [kernel] 0x22 (0xcaaf1fa8)) [<c0109a4a>] nmi [kernel] 0x1e (0xcaaf1fb8)) Expected Results: Profiling works. Additional info: I can provide you a version of the test program. It's in my home dir on devserv.
Does the last opcontrol --setup /opcontrol --start work if the the previous opcontrol --setup/--starts are not run?
Using the oprofile-0.4-41, I encounter a different problem. Removing the /var/lib/oprofile/lock, and staring with new setup, yields [root@dhcp59-189 SPECS]# opcontrol --start Failed to open profile device: Device or resource busy Couldn't start oprofiled. Check the log file "/var/lib/oprofile/oprofiled.log" and /var/log/messages It appears that the old daemon does an access to one of the special files, and the new daemon can not access the file (probably /dev/oprofile/buffer). Can you reproduce the problem with "opcontrol --shutdown" instead of "opcontrol --stop" and without the "rm -f ..."? -Will
Uli, do you have a watchdog timer set up on this machine?
> Uli, do you have a watchdog timer set up on this machine? I usually have nmi_watchdog defined, yes.
Created attachment 90077 [details] script to crash kernel via oprofile Executing this script crashes my UP P4 HT machine reliable (100%) during the last opcontrol --start.
Created attachment 90087 [details] S scripted revised to use "rm -rf /var/lib/oprofile/samples/*" I tried the attached script S (S2 merely does a better job cleaning out /var/lib/oprofile/samples/*). The kernel did not crash the machine. It did get following output: [root@dhcp59-189 root]# ./S Using log file /var/lib/oprofile/oprofiled.log Daemon started. Profiler running. Stopping profiling. Profiler running. Stopping profiling. Daemon not running Failed to open profile device: Device or resource busy Couldn't start oprofiled. Check the log file "/var/lib/oprofile/oprofiled.log" and /var/log/messages What is is doing is not right, but it isn't crashing. Here is the configuration information. This is a Dell precision 430 running a freshly installed GinGin-re2011.nightly. What differences are their between the machine that the script crashes on and the machine I am using to try to replicate the problem? Here are the details about the machine and software: rpm packages: oprofile-0.4-41 kernel-smp-2.4.20-2.44 The processor is processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.66GHz stepping : 7 cpu MHz : 2657.857 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 5308.41
Uli, is this bug still occuring?
Uli, is this bug still occuring with the current RHL9 kernels?
Sorry for the delay. I cannot reproduce the problem anymore. Assumed to be fixed. Closing as such.