1758935 – check-mk-livestatus-1.4.0p31-2 crashed after "Get hosts" query

Bug 1758935 - check-mk-livestatus-1.4.0p31-2 crashed after "Get hosts" query

Summary: check-mk-livestatus-1.4.0p31-2 crashed after "Get hosts" query

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora EPEL
Classification:	Fedora
Component:	check-mk
Sub Component:
Version:	epel7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Orphan Owner
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-06 22:40 UTC by TJ Yang
Modified:	2024-07-09 02:56 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2024-07-09 02:56:31 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1585168	0	unspecified	CLOSED	check-mk-livestatus-1.4.0p31-1.el7.x86_64 causes nagios crash short after start	2024-07-09 02:23:56 UTC

Description TJ Yang 2019-10-06 22:40:19 UTC

Description of problem:

Following command will crash nagios-4.4.3 server 

echo 'GET hosts' | unixcat /var/spool/nagios/cmd/livestatus

Version-Release number of selected component (if applicable):

# nagios
[nagios@nagios03 ~]$ cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
[nagios@nagios03 ~]$

# check-mk installed
[root@nagios03 ~]# rpm -qa |grep check-mk
check-mk-1.4.0p31-2.el7.x86_64
check-mk-livestatus-1.4.0p31-2.el7.x86_64
[root@nagios03 ~]#

# Nagios installed.

[root@nagios03 ~]# rpm -qa |grep nagios-4
nagios-4.4.3-1.el7.x86_64
[root@nagios03 ~]#

How to reproducible:


Steps to Reproduce:
1. stop existing running nagios server(systemctl stop nagios)
2. make sure check-mk-livestatus is configured.

[root@nagios03 ~]# grep  livestatus /etc/nagios/nagios.cfg
# /var/log/nagios/livestatus.log
/var/spool/nagios/cmd/livestatus idle_timeout=12000 num_client_threads=20 debug=1 query_timeout=0
[root@nagios03 ~]#

3. Run "nagios /etc/nagios/nagios.cfg" on one vt100 to startup nagios without daemon mode.

wproc: Registry request: name=Core Worker 13657;pid=13657
wproc: Registry request: name=Core Worker 13654;pid=13654
wproc: Registry request: name=Core Worker 13656;pid=13656
Event broker module '/usr/lib64/check_mk/livestatus.o' initialized successfully.
2019-10-06 18:28:05 [6] updating log file index
2019-10-06 18:28:05 [6] updating log file index


4. on another vt100 window run following command to query command.

echo 'GET hosts' | unixcat /var/spool/nagios/cmd/livestatus

5. we will see following result

Actual results:

<snipped>
wproc: Registry request: name=Core Worker 13654;pid=13654
wproc: Registry request: name=Core Worker 13656;pid=13656
Event broker module '/usr/lib64/check_mk/livestatus.o' initialized successfully.
2019-10-06 18:28:05 [6] updating log file index
2019-10-06 18:28:05 [6] updating log file index
Successfully launched command file worker with pid 13668
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_S_construct null not valid
Aborted
[nagios@nagios03 ~]$


Expected results:

nagios server process not crashed by check-mk-livestatus module.

Additional info:

same exact OS/check-mk VM's(nagios02t)  won't crashed. but nagios02t only has a few test hosts.

This issue looks exactly the same like https://bugzilla.redhat.com/show_bug.cgi?id=1585168

but I don't have "check_mk_objects.cfg" file. I have "/etc/nagios/conf.d/check_mk_templates.cfg" instead.

Comment 1 TJ Yang 2019-10-06 22:46:17 UTC

livestatus.log did not show why livestatus module crashed after it got GET hosts request.

[nagios@nagios03 ~]$ tail /var/log/nagios/livestatus.log
2019-10-06 12:01:04 [main] flushing log file index
2019-10-06 14:24:12 [main] socket thread has terminated
2019-10-06 14:24:12 [main] flushing log file index
2019-10-06 14:30:23 [client 2] request: GET hosts
2019-10-06 14:32:05 [main] socket thread has terminated
2019-10-06 14:32:05 [main] flushing log file index
2019-10-06 14:33:07 [client 1] request: GET hosts
2019-10-06 14:48:17 [main] socket thread has terminated
2019-10-06 14:48:17 [main] flushing log file index
2019-10-06 18:28:13 [client 1] request: GET hosts
[nagios@nagios03 ~]$

Comment 2 TJ Yang 2019-10-07 02:05:46 UTC

* I manually compiled different version of livestatus.o from 1.2.8 up to latest 1.6

[nagios@nagios03 check_mk]$ ls -lrt
total 175616
-rwxr-xr-x 1 root root  2806352 Jun  1  2018 livestatus.o.old.p31
-rwxrwxr-x 1 root root 31962768 Oct  6 20:12 livestatus-p31.o
-rwxrwxr-x 1 root root 31981976 Oct  6 20:20 livestatus-p37.o
-rwxrwxr-x 1 root root 41076080 Oct  6 20:42 livestatus-1.6-p2.o
-rwxrwxr-x 1 root root 39231344 Oct  6 21:04 livestatusp-1.5-p21.o
-rwxrwxr-x 1 root root 31953192 Oct  6 21:10 livestatus-1.4-p30.o
-rwxrwxr-x 1 root root   802544 Oct  6 21:34 livestatus-1.2.8.o
lrwxrwxrwx 1 root root       18 Oct  6 21:38 livestatus.o.1.2.8 -> livestatus-1.2.8.o
lrwxrwxrwx 1 root root       20 Oct  6 21:41 livestatus.o -> livestatus-1.4-p30.o
[nagios@nagios03 check_mk]$

* I changed debug=1 to debug=9 and I got more debug info below.

[nagios@nagios03 ~]$ cat  /var/log/nagios/livestatus.log
2019-10-06 21:55:01 [client 2] accepted client connection on fd 37
2019-10-06 21:55:01 [client 2] request: GET hosts
2019-10-06 21:55:01 [client 2] column hosts.groups is unrestricted
2019-10-06 21:55:01 [client 2] using full table scan
[nagios@nagios03 ~]$

* All the versions I tried can not survive "GET hosts" from a production nagios which has 3k hosts.

 'GET statehist' was ok. Looks like I have host defined with empty value that trigger what() to failed.
 
  But I don't know how to debug furhter since this command has no warning/errors :  "nagios -v /etc/nagios/nagios.cfg"

Comment 3 TJ Yang 2019-10-07 02:13:44 UTC

correction: only livestatus-1.2.8.o was able to withstand the 'GET hosts' LQL query without aborting.

Comment 4 TJ Yang 2019-10-07 02:32:44 UTC

Asking help from upstream also:  https://lists.mathias-kettner.de/pipermail/checkmk-en/2019-October/028889.html

Comment 5 TJ Yang 2019-10-07 03:27:39 UTC

I downgraded livestatus further down to 1.2.6 so that adagios's web GUI can display comment and downtime records.

See details at https://github.com/opinkerfi/adagios/issues/643

Comment 6 TJ Yang 2019-10-07 10:28:15 UTC

Another comment: The real solution is to locate which line of c++ code is aborting due to NULL value and fix the c++ code as suggested in 
https://bugzilla.redhat.com/show_bug.cgi?id=1758935

Comment 7 TJ Yang 2019-10-07 10:59:27 UTC

Correction of above URL mentioned.
Wrong: https://bugzilla.redhat.com/show_bug.cgi?id=1758935
Correct: https://stackoverflow.com/questions/21068758/basic-string-s-construct-null-not-valid

Comment 8 Troy Dawson 2024-07-09 02:56:31 UTC

EPEL 7 entered end-of-life (EOL) status on 2024-06-30.\n\nEPEL 7 is no longer maintained, which means that it\nwill not receive any further security or bug fix updates.\n As a result we are closing this bug.

Note You need to log in before you can comment on or make changes to this bug.