Bug 735109 - gatherd: segfault in libmetricUnixProcess.so
Summary: gatherd: segfault in libmetricUnixProcess.so
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: sblim-gather
Version: 6.2
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Vitezslav Crhonek
QA Contact: qe-baseos-daemons
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-09-01 14:09 UTC by Milos Malik
Modified: 2016-08-08 13:16 UTC (History)
3 users (show)

Fixed In Version: sblim-gather-2.2.3-2.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-12-06 11:57:00 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1593 normal SHIPPED_LIVE sblim-gather bug fix update 2011-12-06 00:38:46 UTC

Description Milos Malik 2011-09-01 14:09:37 UTC
Description of problem:


Version-Release number of selected component (if applicable):
sblim-gather-2.2.3-1.el6

How reproducible:
always

Steps to Reproduce:
# service gatherer status
gatherd is stopped
reposd is stopped
# service gatherer start
Starting gatherd:                                          [  OK  ]
Starting reposd:                                           [  OK  ]
# service gatherer status
gatherd (pid 16358) is running...
reposd (pid 16363) is running...
# gatherctl
help
	h		print this help message
	s		status
	i		init
	t		terminate
	b		start sampling
	e		stop sampling
	l plugin	load plugin
	u plugin	unload plugin
	v plugin	view/list metrics for plugin
	q		quit
	k		kill daemon
	d		start daemon
	c		local trace
s
Status initialized and sampling, 8 plugins and 20 metrics. 
s
Daemon not reachable.
q
# service gatherer status
gatherd dead but subsys locked
reposd (pid 16363) is running...
# 

Actual results:
gatherd[14802]: segfault at 0 ip 0052d5e2 sp b6e66200 error 6 in libmetricUnixProcess.so[52b000+3000]
gatherd[16059]: segfault at 0 ip 005b55e2 sp b6d97200 error 6 in libmetricUnixProcess.so[5b3000+3000]
gatherd[16370]: segfault at 0 ip 001645e2 sp b6e48200 error 6 in libmetricUnixProcess.so[162000+3000]

Expected results:
* no segfaults

Comment 4 Karel Volný 2011-09-08 16:14:32 UTC
I have to know the exact condition which triggers the segfault - at first it seemed I cannot reproduce it:

.live.[root@s390x-6s-v1 tps]# service gatherer status
gatherd is stopped
reposd is stopped
.live.[root@s390x-6s-v1 tps]# service gatherer start
Starting gatherd: [  OK  ]
Starting reposd: [  OK  ]
.live.[root@s390x-6s-v1 tps]# service gatherer status
gatherd (pid 51111) is running...
reposd (pid 51121) is running...
.live.[root@s390x-6s-v1 tps]# gatherctl
s
Status initialized and sampling, 12 plugins and 24 metrics. 
s
Status initialized and sampling, 12 plugins and 24 metrics. 
s
Status initialized and sampling, 12 plugins and 24 metrics. 
s
Status initialized and sampling, 12 plugins and 24 metrics. 
s
Status initialized and sampling, 12 plugins and 24 metrics. 
s
Status initialized and sampling, 12 plugins and 24 metrics. 
s
Status initialized and sampling, 12 plugins and 24 metrics. 
s
Status initialized and sampling, 12 plugins and 24 metrics. 
s
Status initialized and sampling, 12 plugins and 24 metrics. 
s
Status initialized and sampling, 12 plugins and 24 metrics. 
s
Status initialized and sampling, 12 plugins and 24 metrics. 
q
.live.[root@s390x-6s-v1 tps]# service gatherer status
gatherd (pid 51111) is running...
reposd (pid 51121) is running...
.live.[root@s390x-6s-v1 tps]# rpm -q sblim-gather
sblim-gather-2.2.3-1.el6.s390x



but then, after a while, suddenly it became dead:

.live.[root@s390x-6s-v1 tps]# service gatherer status
gatherd dead but subsys locked
reposd (pid 51121) is running...


now how long exactly I have to wait, or what action should I take, to be sure that the new version doesn't crash?

I tried to run the checks periodically and it seems the daemon dies when the system clock hits a new minute (0 seconds) - for example:

Sep  8 12:10:00 x86-64-6s-m1 kernel: gatherd[13777]: segfault at 0 ip 00007f385b0622ba sp 00007f385a453cc0 error 6 in libmetricUnixProcess.so[7f385b060000+3000]

- is that the reason, is it enough to wait 61 seconds (in the worst case) then?

Comment 5 Vitezslav Crhonek 2011-09-12 11:06:25 UTC
(In reply to comment #4)
> 
> now how long exactly I have to wait, or what action should I take, to be sure
> that the new version doesn't crash?
> 
> I tried to run the checks periodically and it seems the daemon dies when the
> system clock hits a new minute (0 seconds) - for example:
> 
> Sep  8 12:10:00 x86-64-6s-m1 kernel: gatherd[13777]: segfault at 0 ip
> 00007f385b0622ba sp 00007f385a453cc0 error 6 in
> libmetricUnixProcess.so[7f385b060000+3000]
> 
> - is that the reason, is it enough to wait 61 seconds (in the worst case) then?

The sampling function (metricRetrCPUTime) which caused segfault is called periodically by the daemon every 60 seconds. So it should be okay to wait 61 seconds (or more if you want to see it survive more iterations).

Comment 7 errata-xmlrpc 2011-12-06 11:57:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1593.html


Note You need to log in before you can comment on or make changes to this bug.