Bug 1506423 - Nagios regularly crashes with SIGSEGV after couple of weeks of starting.
Summary: Nagios regularly crashes with SIGSEGV after couple of weeks of starting.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: nagios
Version: el6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Stephen John Smoogen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-26 01:40 UTC by Baybars
Modified: 2019-02-02 00:39 UTC (History)
13 users (show)

Fixed In Version: nagios-4.4.3-1.fc28 nagios-4.4.3-1.fc29 nagios-4.4.3-1.el6 nagios-4.4.3-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-30 01:32:02 UTC


Attachments (Terms of Use)

Description Baybars 2017-10-26 01:40:41 UTC
Description of problem:

Nagios regularly crashes with SIGSEGV after couple of weeks of starting. This started to happen after epel released nagios-4.3.2-5.el6

Version-Release number of selected component (if applicable):

nagios-4.3.2-5.el6.i686

How reproducible:

We haven't been able to trigger the bug on demand, but have a core dump and backtrace.

Steps to Reproduce:
1.
2.
3.

Actual results:

Backtrace is:

(gdb) bt
#0  0x001e1a9f in __strlen_ia32 () from /lib/libc.so.6
#1  0x001ac14f in vfprintf () from /lib/libc.so.6
#2  0x00264c32 in __vasprintf_chk () from /lib/libc.so.6
#3  0x00264b66 in __asprintf_chk () from /lib/libc.so.6
#4  0x0808112c in asprintf (mac=0xbfaccd3c) at /usr/include/bits/stdio2.h:158
#5  add_macrox_environment_vars_r (mac=0xbfaccd3c) at ../common/macros.c:3305
#6  macros_to_kvv (mac=0xbfaccd3c) at ../common/macros.c:3251
#7  0x0805e35a in wproc_run_job (job=0xc6b57e0, mac=<value optimized out>) at workers.c:1036
#8  0x080634d7 in run_async_service_check (svc=0xc4f0ea8, check_options=0, latency=0, scheduled_check=1, reschedule_check=1, time_is_valid=0xbfacd0d0, preferred_time=0xbfacd0d8)
    at checks.c:306
#9  0x08063891 in run_scheduled_service_check (svc=0xc4f0ea8, check_options=0, latency=0) at checks.c:90
#10 0x080746c1 in handle_timed_event (event=0x9a55718) at events.c:1171
#11 0x08078023 in event_execution_loop () at events.c:1110
#12 0x08058c88 in main (argc=3, argv=0xbfacd514) at nagios.c:814
(gdb)


Expected results:


Additional info:

Comment 1 Stephen John Smoogen 2017-10-26 13:42:44 UTC
Thanks. I will see if this enough for upstream to find a fix. Could you try the nagios in epel-test to see if they fixed it in the meantime and if they did could you give it a +1 in karma so I know it works.

Comment 2 Stephen John Smoogen 2017-11-20 20:57:42 UTC
I have not seen anything from upstream on this, and I have not been able to replicate on my EL6 nagios system yet. Did the updates fix it for you?

Comment 3 Baybars 2017-11-20 22:00:02 UTC
Hi Stephen, we had to restart nagios (on the 15th of Nov.) after the updates from the test repo were applied, so we haven't got past the ~two week mark. Will update once we know more. Thanks for looking into it!

Comment 4 Baybars 2017-11-21 23:35:22 UTC
This morning we noticed we wern't getting nagios notifications anymore, and checked the nagios log file; basically it was unable to run any checks with:

[1511269200] Unable to run check for service 'sssd' on host 'letter2'
[1511269200] Unable to run check for service 'crond-procs' on host 'silk1'
[1511269200] Unable to run check for service 'syslog-ng-procs' on host 'syslog1'
[1511269200] Unable to run check for service 'memory' on host 'marathon1'
[1511269200] Unable to run check for service 'munin-asyncd' on host 'mars'
[1511269200] Unable to run check for service 'munin-asyncd' on host 'thm-tsta-vm2'
[1511269200] Unable to run check for service 'disk-space-free' on host 'milton2'
...etc.

There wasn't any OOM messages in the kernel log, but looking at the munin graphs for the nagios host, we can see that after the epel-test version of nagios was 
installed, memory and swap usage ramps up considerably. Unfortunately I was unable to get a pstack to help the case.

Comment 5 Stephen John Smoogen 2017-11-21 23:55:58 UTC
Hmmm I am not sure what could be causing that. How many checks and number of hosts are being looked at? Our couple of hundred hosts inside of Fedora is able to run in 40 MB process space. If you can get more info I would appreciate it.

Comment 6 Baybars 2017-11-30 04:22:43 UTC
Hi Stephen, 

Sorry, just seen your response now.

We had another crash but the backtrace looks the same. The numbers are:

# Active Host / Service Checks:	544 / 12304
# Passive Host / Service Checks:	0 / 1948

We have a livestatus broker in nagios (which was working without issues in version 3.x) which I disabled in our test system to see if that would have an effect.

Comment 7 Baybars 2017-12-04 01:53:21 UTC
Disabling livestatus did not seem to help, we are still seeing a memory leak.

Comment 8 Bryan Heden 2017-12-04 12:06:11 UTC
I can confirm various other reports of a memory leak upstream. We haven't found the cause or fix yet, but we have been able to reproduce it internally.

Any other data that you can supply may be helpful in producing a fix. Are you able to set debug_verbosity to 2 and debug_level to -1 by chance? I would only suggest this if you have the disk space to spare.

Comment 9 AJ Zmudosky 2018-01-03 19:38:01 UTC
We're also seeing both of these behaviors (segfault and memory leak) with 451 hosts and 8286 service active checks on this instance. The only additional item is pnp4nagios processing perfdata.

After the update to Nagios 4, we began experiencing segfaults after about 2 weeks of Nagios running. (This bug wasn't yet here when I implemented a script to handle restarting Nagios if it logged a segfault as a temporary measure.)

And we've begun to see the check failures and now the memory leak (evidenced by exhaustion of virtual memory, with Nagios using over 3GB of memory after about 1 week of runtime), since the update from 4.3.2-5.el6 to 4.3.4-4.el6 (installed on 2017-12-7).

We'll be rebooting the system with the latest kernel today and monitor from there. Is there any particular information that would be helpful in diagnosing this?

Comment 10 Bryan Heden 2018-01-03 19:55:14 UTC
Are you using the neb module for pnp? If so, which version are you using?

I think just the debug log after a segfault and turned on to log everything would be rather useful.

Comment 11 Fedora Update System 2018-11-30 19:58:32 UTC
nagios-4.4.2-3.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2018-0346a55d0f

Comment 12 Fedora Update System 2018-11-30 20:52:09 UTC
nagios-4.4.2-3.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2018-42555731d2

Comment 13 Fedora Update System 2018-11-30 21:03:44 UTC
nagios-4.4.2-3.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-70fe6a4d75

Comment 14 Fedora Update System 2018-11-30 21:38:06 UTC
nagios-4.4.2-3.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2018-61fe7c6e70

Comment 15 Fedora Update System 2018-12-01 01:38:38 UTC
nagios-4.4.2-3.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-70fe6a4d75

Comment 16 Fedora Update System 2018-12-01 01:55:08 UTC
nagios-4.4.2-3.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2018-0346a55d0f

Comment 17 Fedora Update System 2018-12-01 02:03:53 UTC
nagios-4.4.2-3.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2018-61fe7c6e70

Comment 18 Fedora Update System 2018-12-01 02:43:47 UTC
nagios-4.4.2-3.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-42555731d2

Comment 19 Fedora Update System 2019-01-17 00:14:46 UTC
nagios-4.4.3-1.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2019-d661b588d2

Comment 20 Fedora Update System 2019-01-17 00:25:26 UTC
nagios-4.4.3-1.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2019-17b388679b

Comment 21 Fedora Update System 2019-01-17 00:43:06 UTC
nagios-4.4.3-1.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2019-376ecc221c

Comment 22 Fedora Update System 2019-01-17 00:55:23 UTC
nagios-4.4.3-1.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2019-0b44528ff1

Comment 23 Fedora Update System 2019-01-18 01:00:29 UTC
nagios-4.4.3-1.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2019-d661b588d2

Comment 24 Fedora Update System 2019-01-18 01:31:51 UTC
nagios-4.4.3-1.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2019-17b388679b

Comment 25 Fedora Update System 2019-01-18 03:04:59 UTC
nagios-4.4.3-1.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-0b44528ff1

Comment 26 Fedora Update System 2019-01-18 03:36:18 UTC
nagios-4.4.3-1.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-376ecc221c

Comment 27 Fedora Update System 2019-01-30 01:32:02 UTC
nagios-4.4.3-1.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.

Comment 28 Fedora Update System 2019-01-30 02:06:44 UTC
nagios-4.4.3-1.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Comment 29 Fedora Update System 2019-02-02 00:36:25 UTC
nagios-4.4.3-1.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.

Comment 30 Fedora Update System 2019-02-02 00:39:26 UTC
nagios-4.4.3-1.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.