Hide Forgot
Description of problem: Version-Release number of selected component (if applicable): sblim-gather-2.2.3-1.el6 How reproducible: always Steps to Reproduce: # service gatherer status gatherd is stopped reposd is stopped # service gatherer start Starting gatherd: [ OK ] Starting reposd: [ OK ] # service gatherer status gatherd (pid 16358) is running... reposd (pid 16363) is running... # gatherctl help h print this help message s status i init t terminate b start sampling e stop sampling l plugin load plugin u plugin unload plugin v plugin view/list metrics for plugin q quit k kill daemon d start daemon c local trace s Status initialized and sampling, 8 plugins and 20 metrics. s Daemon not reachable. q # service gatherer status gatherd dead but subsys locked reposd (pid 16363) is running... # Actual results: gatherd[14802]: segfault at 0 ip 0052d5e2 sp b6e66200 error 6 in libmetricUnixProcess.so[52b000+3000] gatherd[16059]: segfault at 0 ip 005b55e2 sp b6d97200 error 6 in libmetricUnixProcess.so[5b3000+3000] gatherd[16370]: segfault at 0 ip 001645e2 sp b6e48200 error 6 in libmetricUnixProcess.so[162000+3000] Expected results: * no segfaults
I have to know the exact condition which triggers the segfault - at first it seemed I cannot reproduce it: .live.[root@s390x-6s-v1 tps]# service gatherer status gatherd is stopped reposd is stopped .live.[root@s390x-6s-v1 tps]# service gatherer start Starting gatherd: [ OK ] Starting reposd: [ OK ] .live.[root@s390x-6s-v1 tps]# service gatherer status gatherd (pid 51111) is running... reposd (pid 51121) is running... .live.[root@s390x-6s-v1 tps]# gatherctl s Status initialized and sampling, 12 plugins and 24 metrics. s Status initialized and sampling, 12 plugins and 24 metrics. s Status initialized and sampling, 12 plugins and 24 metrics. s Status initialized and sampling, 12 plugins and 24 metrics. s Status initialized and sampling, 12 plugins and 24 metrics. s Status initialized and sampling, 12 plugins and 24 metrics. s Status initialized and sampling, 12 plugins and 24 metrics. s Status initialized and sampling, 12 plugins and 24 metrics. s Status initialized and sampling, 12 plugins and 24 metrics. s Status initialized and sampling, 12 plugins and 24 metrics. s Status initialized and sampling, 12 plugins and 24 metrics. q .live.[root@s390x-6s-v1 tps]# service gatherer status gatherd (pid 51111) is running... reposd (pid 51121) is running... .live.[root@s390x-6s-v1 tps]# rpm -q sblim-gather sblim-gather-2.2.3-1.el6.s390x but then, after a while, suddenly it became dead: .live.[root@s390x-6s-v1 tps]# service gatherer status gatherd dead but subsys locked reposd (pid 51121) is running... now how long exactly I have to wait, or what action should I take, to be sure that the new version doesn't crash? I tried to run the checks periodically and it seems the daemon dies when the system clock hits a new minute (0 seconds) - for example: Sep 8 12:10:00 x86-64-6s-m1 kernel: gatherd[13777]: segfault at 0 ip 00007f385b0622ba sp 00007f385a453cc0 error 6 in libmetricUnixProcess.so[7f385b060000+3000] - is that the reason, is it enough to wait 61 seconds (in the worst case) then?
(In reply to comment #4) > > now how long exactly I have to wait, or what action should I take, to be sure > that the new version doesn't crash? > > I tried to run the checks periodically and it seems the daemon dies when the > system clock hits a new minute (0 seconds) - for example: > > Sep 8 12:10:00 x86-64-6s-m1 kernel: gatherd[13777]: segfault at 0 ip > 00007f385b0622ba sp 00007f385a453cc0 error 6 in > libmetricUnixProcess.so[7f385b060000+3000] > > - is that the reason, is it enough to wait 61 seconds (in the worst case) then? The sampling function (metricRetrCPUTime) which caused segfault is called periodically by the daemon every 60 seconds. So it should be okay to wait 61 seconds (or more if you want to see it survive more iterations).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1593.html