Bug 1758935 - check-mk-livestatus-1.4.0p31-2 crashed after "Get hosts" query
Summary: check-mk-livestatus-1.4.0p31-2 crashed after "Get hosts" query
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: check-mk
Version: epel7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Orphan Owner
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-06 22:40 UTC by TJ Yang
Modified: 2024-07-09 02:56 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2024-07-09 02:56:31 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1585168 0 unspecified CLOSED check-mk-livestatus-1.4.0p31-1.el7.x86_64 causes nagios crash short after start 2024-07-09 02:23:56 UTC

Description TJ Yang 2019-10-06 22:40:19 UTC
Description of problem:

Following command will crash nagios-4.4.3 server 

echo 'GET hosts' | unixcat /var/spool/nagios/cmd/livestatus

Version-Release number of selected component (if applicable):

# nagios
[nagios@nagios03 ~]$ cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
[nagios@nagios03 ~]$

# check-mk installed
[root@nagios03 ~]# rpm -qa |grep check-mk
check-mk-1.4.0p31-2.el7.x86_64
check-mk-livestatus-1.4.0p31-2.el7.x86_64
[root@nagios03 ~]#

# Nagios installed.

[root@nagios03 ~]# rpm -qa |grep nagios-4
nagios-4.4.3-1.el7.x86_64
[root@nagios03 ~]#

How to reproducible:


Steps to Reproduce:
1. stop existing running nagios server(systemctl stop nagios)
2. make sure check-mk-livestatus is configured.

[root@nagios03 ~]# grep  livestatus /etc/nagios/nagios.cfg
# /var/log/nagios/livestatus.log
/var/spool/nagios/cmd/livestatus idle_timeout=12000 num_client_threads=20 debug=1 query_timeout=0
[root@nagios03 ~]#

3. Run "nagios /etc/nagios/nagios.cfg" on one vt100 to startup nagios without daemon mode.

wproc: Registry request: name=Core Worker 13657;pid=13657
wproc: Registry request: name=Core Worker 13654;pid=13654
wproc: Registry request: name=Core Worker 13656;pid=13656
Event broker module '/usr/lib64/check_mk/livestatus.o' initialized successfully.
2019-10-06 18:28:05 [6] updating log file index
2019-10-06 18:28:05 [6] updating log file index


4. on another vt100 window run following command to query command.

echo 'GET hosts' | unixcat /var/spool/nagios/cmd/livestatus

5. we will see following result

Actual results:

<snipped>
wproc: Registry request: name=Core Worker 13654;pid=13654
wproc: Registry request: name=Core Worker 13656;pid=13656
Event broker module '/usr/lib64/check_mk/livestatus.o' initialized successfully.
2019-10-06 18:28:05 [6] updating log file index
2019-10-06 18:28:05 [6] updating log file index
Successfully launched command file worker with pid 13668
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_S_construct null not valid
Aborted
[nagios@nagios03 ~]$


Expected results:

nagios server process not crashed by check-mk-livestatus module.

Additional info:

same exact OS/check-mk VM's(nagios02t)  won't crashed. but nagios02t only has a few test hosts.

This issue looks exactly the same like https://bugzilla.redhat.com/show_bug.cgi?id=1585168

but I don't have "check_mk_objects.cfg" file. I have "/etc/nagios/conf.d/check_mk_templates.cfg" instead.

Comment 1 TJ Yang 2019-10-06 22:46:17 UTC
livestatus.log did not show why livestatus module crashed after it got GET hosts request.

[nagios@nagios03 ~]$ tail /var/log/nagios/livestatus.log
2019-10-06 12:01:04 [main] flushing log file index
2019-10-06 14:24:12 [main] socket thread has terminated
2019-10-06 14:24:12 [main] flushing log file index
2019-10-06 14:30:23 [client 2] request: GET hosts
2019-10-06 14:32:05 [main] socket thread has terminated
2019-10-06 14:32:05 [main] flushing log file index
2019-10-06 14:33:07 [client 1] request: GET hosts
2019-10-06 14:48:17 [main] socket thread has terminated
2019-10-06 14:48:17 [main] flushing log file index
2019-10-06 18:28:13 [client 1] request: GET hosts
[nagios@nagios03 ~]$

Comment 2 TJ Yang 2019-10-07 02:05:46 UTC
* I manually compiled different version of livestatus.o from 1.2.8 up to latest 1.6

[nagios@nagios03 check_mk]$ ls -lrt
total 175616
-rwxr-xr-x 1 root root  2806352 Jun  1  2018 livestatus.o.old.p31
-rwxrwxr-x 1 root root 31962768 Oct  6 20:12 livestatus-p31.o
-rwxrwxr-x 1 root root 31981976 Oct  6 20:20 livestatus-p37.o
-rwxrwxr-x 1 root root 41076080 Oct  6 20:42 livestatus-1.6-p2.o
-rwxrwxr-x 1 root root 39231344 Oct  6 21:04 livestatusp-1.5-p21.o
-rwxrwxr-x 1 root root 31953192 Oct  6 21:10 livestatus-1.4-p30.o
-rwxrwxr-x 1 root root   802544 Oct  6 21:34 livestatus-1.2.8.o
lrwxrwxrwx 1 root root       18 Oct  6 21:38 livestatus.o.1.2.8 -> livestatus-1.2.8.o
lrwxrwxrwx 1 root root       20 Oct  6 21:41 livestatus.o -> livestatus-1.4-p30.o
[nagios@nagios03 check_mk]$

* I changed debug=1 to debug=9 and I got more debug info below.

[nagios@nagios03 ~]$ cat  /var/log/nagios/livestatus.log
2019-10-06 21:55:01 [client 2] accepted client connection on fd 37
2019-10-06 21:55:01 [client 2] request: GET hosts
2019-10-06 21:55:01 [client 2] column hosts.groups is unrestricted
2019-10-06 21:55:01 [client 2] using full table scan
[nagios@nagios03 ~]$

* All the versions I tried can not survive "GET hosts" from a production nagios which has 3k hosts.

 'GET statehist' was ok. Looks like I have host defined with empty value that trigger what() to failed.
 
  But I don't know how to debug furhter since this command has no warning/errors :  "nagios -v /etc/nagios/nagios.cfg"

Comment 3 TJ Yang 2019-10-07 02:13:44 UTC
correction: only livestatus-1.2.8.o was able to withstand the 'GET hosts' LQL query without aborting.

Comment 4 TJ Yang 2019-10-07 02:32:44 UTC
Asking help from upstream also:  https://lists.mathias-kettner.de/pipermail/checkmk-en/2019-October/028889.html

Comment 5 TJ Yang 2019-10-07 03:27:39 UTC
I downgraded livestatus further down to 1.2.6 so that adagios's web GUI can display comment and downtime records.

See details at https://github.com/opinkerfi/adagios/issues/643

Comment 6 TJ Yang 2019-10-07 10:28:15 UTC
Another comment: The real solution is to locate which line of c++ code is aborting due to NULL value and fix the c++ code as suggested in 
https://bugzilla.redhat.com/show_bug.cgi?id=1758935

Comment 8 Troy Dawson 2024-07-09 02:56:31 UTC
EPEL 7 entered end-of-life (EOL) status on 2024-06-30.\n\nEPEL 7 is no longer maintained, which means that it\nwill not receive any further security or bug fix updates.\n As a result we are closing this bug.


Note You need to log in before you can comment on or make changes to this bug.