I am collecting profiling samples for a high-speed (10 GbE) network running netperf. After a fairly short period of time, oprofiled crashes with: oprofiled[30260]: segfault at 00002ac467eee058 rip 0000000000409060 rsp 00007fff42ce9410 error 4 or oprofiled[32510]: segfault at 0000003d85fd6ac4 rip 00000000004094e7 rsp 00007fff2280de90 error 4 I am using all 4 monitoring registers on an AMD64 CPU running the recently released 2.6.18-92 and the oprofile package 0.9.3-16. My configuration is: ahecox@x4600 ~ $ sudo opcontrol --status Daemon running: pid 1842 Event 0: INSTRUCTION_CACHE_MISSES:1000000:0:1:1 Event 1: DATA_CACHE_MISSES:1000000:0:1:1 Event 2: L2_CACHE_MISS:1000000:7:1:1 Event 3: CPU_CLK_UNHALTED:1000000:0:1:1 Separate options: none vmlinux file: /usr/lib/debug/lib/modules/2.6.18-92.el5/vmlinux Image filter: none Call-graph depth: 0 I've reduced my sampling rate via one order of magnitude and have not yet experienced the issue.
It would be good to get a trace back showing where the crash is occuring. Install the oprofile-debuginfo rpm: yum install oprofile-debuginfo --enablerepo=rhel-debuginfo Start the daemon without starting profiling, find the process and attach gdb to it: # opcontrol --start-daemon # pgrep oprofile 25689 # gdb (gdb) attach 25689 ... (gdb) c leave gdb running and start the actual profiling in another window with: # opcontrol --start (may need to continue gdb) when oprofiled crashes: (gdb) where That should give us some idea where the crash is occurring.
Which version of oprofile? rpm -q oprofile
The machine has a number of processor cores (16). Given that sampling rate this might be filling up the buffer to move data between kernel and user space rather quickly. It would be useful to check to see if using a larger buffer eliminates the problem. Something like the following command would instruct oprofile to use a larger buffer: sudo opcontrol --setup \ --vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux \ -e INSTRUCTION_CACHE_MISSES:1000000:0:1:1 \ -e DATA_CACHE_MISSES:1000000:0:1:1 \ -e L2_CACHE_MISS:1000000:7:1:1 \ -e CPU_CLK_UNHALTED:1000000:0:1:1 \ --buffer-size=524288 --buffer-watershed=131072 If problem continues try larger buffer "--buffer-size=1048576 --buffer-watershed=262144" -Will
Could we check that this is due to running out of buffer space as in comment #5.
Did some more investigation on this. Got the oprofile debuginfo to see what rip:0x0409060 relates to. Used nm and sort to find out which function in. Look like in opb_sync: 0000000000409050 T odb_sync 0000000000409080 T odb_close Disassembly of the rpm installed /usr/bin/oprfiled (** marks problem instruction): 409050: 53 push %rbx 409051: 48 8b 1f mov (%rdi),%rbx 409054: 48 85 db test %rbx,%rbx 409057: 74 1f je 409078 <fflush@plt+0x72a8> 409059: 48 8b 43 10 mov 0x10(%rbx),%rax 40905d: 48 89 df mov %rbx,%rdi **409060: 8b 30 mov (%rax),%esi 409062: e8 99 ff ff ff callq 409000 <fflush@plt+0x7230> 409067: 48 8b 7b 28 mov 0x28(%rbx),%rdi 40906b: 89 c6 mov %eax,%esi 40906d: ba 01 00 00 00 mov $0x1,%edx 409072: 5b pop %rbx 409073: e9 b8 8a ff ff jmpq 401b30 <msync@plt> 409078: 5b pop %rbx 409079: c3 retq Looked for the equivalent in the local rpmbuild to better understand (** marks problem instruction): 0000000000409070 <odb_sync>: 409070: 53 push %rbx 409071: 48 8b 1f mov (%rdi),%rbx 409074: 48 85 db test %rbx,%rbx 409077: 74 1f je 409098 <odb_sync+0x28> 409079: 48 8b 43 10 mov 0x10(%rbx),%rax 40907d: 48 89 df mov %rbx,%rdi **409080: 8b 30 mov (%rax),%esi 409082: e8 99 ff ff ff callq 409020 <tables_size> 409087: 48 8b 7b 28 mov 0x28(%rbx),%rdi 40908b: 89 c6 mov %eax,%esi 40908d: ba 01 00 00 00 mov $0x1,%edx 409092: 5b pop %rbx 409093: e9 98 8a ff ff jmpq 401b30 <msync@plt> 409098: 5b pop %rbx 409099: c3 retq The following is the odb_sync() function from ./libdb/db_manage.c: void odb_sync(odb_t const * odb) { odb_data_t * data = odb->data; size_t size; if (!data) return; size = tables_size(data, data->descr->size); msync(data->base_memory, size, MS_ASYNC); } Either the value of data data or data->descr getting corrupted.
Could not reproduce; not recently problematic.