Bug 448583 - oprofiled segfault
oprofiled segfault
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: oprofile (Show other bugs)
5.2
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: William Cohen
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-05-27 14:05 EDT by Andrew Hecox
Modified: 2013-03-11 13:44 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-03-11 13:44:29 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Andrew Hecox 2008-05-27 14:05:00 EDT
I am collecting profiling samples for a high-speed (10 GbE) network running
netperf. After a fairly short period of time, oprofiled crashes with: 

oprofiled[30260]: segfault at 00002ac467eee058 rip 0000000000409060 rsp
00007fff42ce9410 error 4

or

oprofiled[32510]: segfault at 0000003d85fd6ac4 rip 00000000004094e7 rsp
00007fff2280de90 error 4


I am using all 4 monitoring registers on an AMD64 CPU running the recently
released 2.6.18-92 and the oprofile package 0.9.3-16. My configuration is:

ahecox@x4600 ~ $ sudo opcontrol --status
Daemon running: pid 1842
Event 0: INSTRUCTION_CACHE_MISSES:1000000:0:1:1
Event 1: DATA_CACHE_MISSES:1000000:0:1:1
Event 2: L2_CACHE_MISS:1000000:7:1:1
Event 3: CPU_CLK_UNHALTED:1000000:0:1:1
Separate options: none
vmlinux file: /usr/lib/debug/lib/modules/2.6.18-92.el5/vmlinux
Image filter: none
Call-graph depth: 0

I've reduced my sampling rate via one order of magnitude and have not yet
experienced the issue.
Comment 2 William Cohen 2008-06-05 11:06:42 EDT
It would be good to get a trace back showing where the crash is occuring.

Install the oprofile-debuginfo rpm:

yum install oprofile-debuginfo --enablerepo=rhel-debuginfo

Start the daemon without starting profiling, find the process and attach gdb to it:

# opcontrol --start-daemon
# pgrep oprofile
25689
# gdb
(gdb) attach 25689
...
(gdb) c

leave gdb running and start the actual profiling in another window with:

# opcontrol --start
(may need to continue gdb)

when oprofiled crashes:

(gdb) where

That should give us some idea where the crash is occurring.
Comment 3 William Cohen 2008-06-10 10:07:04 EDT
Which version of oprofile? rpm -q oprofile
Comment 5 William Cohen 2008-07-10 10:39:22 EDT
The machine has a number of processor cores (16). Given that sampling rate this
might be filling up the buffer to move data between kernel and user space rather
quickly. It would be useful to check to see if using a larger buffer eliminates
the problem. Something like the following command would instruct oprofile to use
a larger buffer:

sudo opcontrol --setup \
--vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux \
-e INSTRUCTION_CACHE_MISSES:1000000:0:1:1 \
-e DATA_CACHE_MISSES:1000000:0:1:1 \
-e L2_CACHE_MISS:1000000:7:1:1 \
-e CPU_CLK_UNHALTED:1000000:0:1:1 \
--buffer-size=524288 --buffer-watershed=131072

If problem continues try larger buffer "--buffer-size=1048576
--buffer-watershed=262144"

-Will
Comment 6 William Cohen 2008-08-27 17:56:13 EDT
Could we check that this is due to running out of buffer space as in comment #5.
Comment 7 William Cohen 2008-09-02 21:01:09 EDT
Did some more investigation on this.


Got the oprofile debuginfo to see what rip:0x0409060 relates to. Used
nm and sort to find out which function in. Look like in opb_sync:

0000000000409050 T odb_sync
0000000000409080 T odb_close

Disassembly of the rpm installed /usr/bin/oprfiled (** marks problem
instruction):

  409050:	53                   	push   %rbx
  409051:	48 8b 1f             	mov    (%rdi),%rbx
  409054:	48 85 db             	test   %rbx,%rbx
  409057:	74 1f                	je     409078 <fflush@plt+0x72a8>
  409059:	48 8b 43 10          	mov    0x10(%rbx),%rax
  40905d:	48 89 df             	mov    %rbx,%rdi
**409060:	8b 30                	mov    (%rax),%esi
  409062:	e8 99 ff ff ff       	callq  409000 <fflush@plt+0x7230>
  409067:	48 8b 7b 28          	mov    0x28(%rbx),%rdi
  40906b:	89 c6                	mov    %eax,%esi
  40906d:	ba 01 00 00 00       	mov    $0x1,%edx
  409072:	5b                   	pop    %rbx
  409073:	e9 b8 8a ff ff       	jmpq   401b30 <msync@plt>
  409078:	5b                   	pop    %rbx
  409079:	c3                   	retq  

Looked for the equivalent in the local rpmbuild to better understand
(** marks problem instruction):

0000000000409070 <odb_sync>:
  409070:	53                   	push   %rbx
  409071:	48 8b 1f             	mov    (%rdi),%rbx
  409074:	48 85 db             	test   %rbx,%rbx
  409077:	74 1f                	je     409098 <odb_sync+0x28>
  409079:	48 8b 43 10          	mov    0x10(%rbx),%rax
  40907d:	48 89 df             	mov    %rbx,%rdi
**409080:	8b 30                	mov    (%rax),%esi
  409082:	e8 99 ff ff ff       	callq  409020 <tables_size>
  409087:	48 8b 7b 28          	mov    0x28(%rbx),%rdi
  40908b:	89 c6                	mov    %eax,%esi
  40908d:	ba 01 00 00 00       	mov    $0x1,%edx
  409092:	5b                   	pop    %rbx
  409093:	e9 98 8a ff ff       	jmpq   401b30 <msync@plt>
  409098:	5b                   	pop    %rbx
  409099:	c3                   	retq 

The following is the odb_sync() function from ./libdb/db_manage.c:

void odb_sync(odb_t const * odb)
{
	odb_data_t * data = odb->data;
	size_t size;

	if (!data)
		return;

	size = tables_size(data, data->descr->size);
	msync(data->base_memory, size, MS_ASYNC);
}

Either the value of data data or data->descr getting corrupted.
Comment 18 Frank Ch. Eigler 2013-03-11 13:44:29 EDT
Could not reproduce; not recently problematic.

Note You need to log in before you can comment on or make changes to this bug.