448583 – oprofiled segfault

Bug 448583 - oprofiled segfault

Summary: oprofiled segfault

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	oprofile
Sub Component:
Version:	5.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	William Cohen
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-05-27 18:05 UTC by Andrew Hecox
Modified:	2013-03-11 17:44 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-03-11 17:44:29 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Andrew Hecox 2008-05-27 18:05:00 UTC

I am collecting profiling samples for a high-speed (10 GbE) network running
netperf. After a fairly short period of time, oprofiled crashes with: 

oprofiled[30260]: segfault at 00002ac467eee058 rip 0000000000409060 rsp
00007fff42ce9410 error 4

or

oprofiled[32510]: segfault at 0000003d85fd6ac4 rip 00000000004094e7 rsp
00007fff2280de90 error 4


I am using all 4 monitoring registers on an AMD64 CPU running the recently
released 2.6.18-92 and the oprofile package 0.9.3-16. My configuration is:

ahecox@x4600 ~ $ sudo opcontrol --status
Daemon running: pid 1842
Event 0: INSTRUCTION_CACHE_MISSES:1000000:0:1:1
Event 1: DATA_CACHE_MISSES:1000000:0:1:1
Event 2: L2_CACHE_MISS:1000000:7:1:1
Event 3: CPU_CLK_UNHALTED:1000000:0:1:1
Separate options: none
vmlinux file: /usr/lib/debug/lib/modules/2.6.18-92.el5/vmlinux
Image filter: none
Call-graph depth: 0

I've reduced my sampling rate via one order of magnitude and have not yet
experienced the issue.

Comment 2 William Cohen 2008-06-05 15:06:42 UTC

It would be good to get a trace back showing where the crash is occuring.

Install the oprofile-debuginfo rpm:

yum install oprofile-debuginfo --enablerepo=rhel-debuginfo

Start the daemon without starting profiling, find the process and attach gdb to it:

# opcontrol --start-daemon
# pgrep oprofile
25689
# gdb
(gdb) attach 25689
...
(gdb) c

leave gdb running and start the actual profiling in another window with:

# opcontrol --start
(may need to continue gdb)

when oprofiled crashes:

(gdb) where

That should give us some idea where the crash is occurring.

Comment 3 William Cohen 2008-06-10 14:07:04 UTC

Which version of oprofile? rpm -q oprofile

Comment 5 William Cohen 2008-07-10 14:39:22 UTC

The machine has a number of processor cores (16). Given that sampling rate this
might be filling up the buffer to move data between kernel and user space rather
quickly. It would be useful to check to see if using a larger buffer eliminates
the problem. Something like the following command would instruct oprofile to use
a larger buffer:

sudo opcontrol --setup \
--vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux \
-e INSTRUCTION_CACHE_MISSES:1000000:0:1:1 \
-e DATA_CACHE_MISSES:1000000:0:1:1 \
-e L2_CACHE_MISS:1000000:7:1:1 \
-e CPU_CLK_UNHALTED:1000000:0:1:1 \
--buffer-size=524288 --buffer-watershed=131072

If problem continues try larger buffer "--buffer-size=1048576
--buffer-watershed=262144"

-Will

Comment 6 William Cohen 2008-08-27 21:56:13 UTC

Could we check that this is due to running out of buffer space as in comment #5.

Comment 7 William Cohen 2008-09-03 01:01:09 UTC

Did some more investigation on this.


Got the oprofile debuginfo to see what rip:0x0409060 relates to. Used
nm and sort to find out which function in. Look like in opb_sync:

0000000000409050 T odb_sync
0000000000409080 T odb_close

Disassembly of the rpm installed /usr/bin/oprfiled (** marks problem
instruction):

  409050:	53                   	push   %rbx
  409051:	48 8b 1f             	mov    (%rdi),%rbx
  409054:	48 85 db             	test   %rbx,%rbx
  409057:	74 1f                	je     409078 <fflush@plt+0x72a8>
  409059:	48 8b 43 10          	mov    0x10(%rbx),%rax
  40905d:	48 89 df             	mov    %rbx,%rdi
**409060:	8b 30                	mov    (%rax),%esi
  409062:	e8 99 ff ff ff       	callq  409000 <fflush@plt+0x7230>
  409067:	48 8b 7b 28          	mov    0x28(%rbx),%rdi
  40906b:	89 c6                	mov    %eax,%esi
  40906d:	ba 01 00 00 00       	mov    $0x1,%edx
  409072:	5b                   	pop    %rbx
  409073:	e9 b8 8a ff ff       	jmpq   401b30 <msync@plt>
  409078:	5b                   	pop    %rbx
  409079:	c3                   	retq  

Looked for the equivalent in the local rpmbuild to better understand
(** marks problem instruction):

0000000000409070 <odb_sync>:
  409070:	53                   	push   %rbx
  409071:	48 8b 1f             	mov    (%rdi),%rbx
  409074:	48 85 db             	test   %rbx,%rbx
  409077:	74 1f                	je     409098 <odb_sync+0x28>
  409079:	48 8b 43 10          	mov    0x10(%rbx),%rax
  40907d:	48 89 df             	mov    %rbx,%rdi
**409080:	8b 30                	mov    (%rax),%esi
  409082:	e8 99 ff ff ff       	callq  409020 <tables_size>
  409087:	48 8b 7b 28          	mov    0x28(%rbx),%rdi
  40908b:	89 c6                	mov    %eax,%esi
  40908d:	ba 01 00 00 00       	mov    $0x1,%edx
  409092:	5b                   	pop    %rbx
  409093:	e9 98 8a ff ff       	jmpq   401b30 <msync@plt>
  409098:	5b                   	pop    %rbx
  409099:	c3                   	retq 

The following is the odb_sync() function from ./libdb/db_manage.c:

void odb_sync(odb_t const * odb)
{
	odb_data_t * data = odb->data;
	size_t size;

	if (!data)
		return;

	size = tables_size(data, data->descr->size);
	msync(data->base_memory, size, MS_ASYNC);
}

Either the value of data data or data->descr getting corrupted.

Comment 18 Frank Ch. Eigler 2013-03-11 17:44:29 UTC

Could not reproduce; not recently problematic.

Note You need to log in before you can comment on or make changes to this bug.