1504814 – /etc/init.d/nagios stop hangs forever

Bug 1504814 - /etc/init.d/nagios stop hangs forever

Summary: /etc/init.d/nagios stop hangs forever

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora EPEL
Classification:	Fedora
Component:	nagios
Sub Component:
Version:	el6
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Stephen John Smoogen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-20 17:05 UTC by Dimitri Maziuk
Modified:	2017-12-10 23:28 UTC (History)
CC List:	15 users (show)
Fixed In Version:	nagios-4.3.4-7.el6
Clone Of:
Environment:
Last Closed:	2017-12-10 23:28:03 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Description Dimitri Maziuk 2017-10-20 17:05:06 UTC

Description of problem:

service nagios stop

hangs at or after

```
+ remove_commandfile
+ '[' -p /var/spool/nagios/cmd/nagios.cmd ']'
+ mv -f /var/spool/nagios/cmd/nagios.cmd /var/spool/nagios/cmd/nagios.cmd~
+ echo -n ''
+ dd if=/var/spool/nagios/cmd/nagios.cmd~ count=0
```

(dd redirects stderr to /dev/null so I'm not sure what happens after that)

Version-Release number of selected component (if applicable):

nagios-4.3.2-5.el6.x86_64

How reproducible:

always

Steps to Reproduce:
1. service nagios stop

Actual results:

hangs forever

Expected results:

stop nagios service

Additional info:

Looking at the comment in the script, like the cure is worse than the disease: dd trick is supposed to prevent a deadlock that may happen, instead it's making one always happen.

Comment 1 Dimitri Maziuk 2017-10-20 17:05:55 UTC

... it *looks* like the cure is worse ...

Comment 2 Stephen John Smoogen 2017-10-20 22:50:54 UTC

I have not been able to replicate this on my el6 systems. Could you let me know if you have selinux enabled, and if you do if you have the nagios-selinux rpm installed? After that I would edit the file and comment out the /dev/null part and see if it saying something.

We can go from there.

Comment 3 Dimitri Maziuk 2017-10-20 23:19:36 UTC

No, no selinux here. There is nagiosgraph and a whole lot of custom monitoring scripts (some of which seems to have b0rked on this update) but nothing obvious to explain this. Without the redirect:
```
[root@hereford init.d]# ./nagios stop
Stopping nagios:. done.
0+0 records in
0+0 records out
0 bytes (0 B) copied, 7.2405e-05 s, 0.0 kB/s
```
and it's hanging:
```
[root@hereford ~]# ps auxw | grep nagios
root      5875  0.0  0.0 106512  1668 pts/2    S+   18:07   0:00 /bin/sh ./nagios stop
root      5887  0.0  0.0 106512   860 pts/2    S+   18:07   0:00 /bin/sh ./nagios stop
root      5929  0.0  0.0 103324   824 pts/0    S+   18:10   0:00 grep nagios
```

I do have two more notes after looking:

1. does this look right:
```
(dd if=$NagiosCommandFile~ count=0 2>/dev/null & echo -n "" >$NagiosCommandFile~)
```
-- I mean the single ampersand before echo.

2. Stopping nagios should probably print "done" after it's removed command file and cleaned up etc.

Comment 4 Dimitri Maziuk 2017-10-20 23:23:45 UTC

I tried this:
dd if=/var/spool/nagios/cmd/nagios.cmd~ count=0 of=/dev/null
just to make sure it's not hanging trying to write to the tty, and it's still sitting there. So it looks to me like it's waiting on the pipe -- no idea why

Comment 5 Dimitri Maziuk 2017-10-20 23:27:39 UTC

PPS: sending `dd` a USR1:
```
dd: opening `/var/spool/nagios/cmd/nagios.cmd~': Interrupted system call
+ rm -f /var/spool/nagios/cmd/nagios.cmd /var/spool/nagios/cmd/nagios.cmd~
+ rm -f /var/log/nagios/status.dat /var/run/nagios/nagios.pid /var/lock/subsys/nagios
```

Comment 6 Stephen John Smoogen 2017-10-20 23:48:55 UTC

Ok the & is correct because it is putting that process in as a background process. The echo "" > $NagiosCommandFile~ should then cause the background task and any other items attached to the pipe to fail and close. 

Could you see if the following script also locks?
#!/bin/bash
rm -f /tmp/test.pipe
mkfifo /tmp/test.pipe
ls -1 /tmp > /tmp/test.pipe &
mv /tmp/test.pipe /tmp/test.pipe~
## The () make this a subshell. The & is intentional as we want to bg this.
( dd if=/tmp/test.pipe~ count=0 2>/dev/null & 
  echo -n "" > /tmp/test.pipe~ )
jobs

rm -f /tmp/test.pipe~
### end of script

Comment 7 Stephen John Smoogen 2017-10-21 00:00:44 UTC

OK from the dd that says something is locking the sub dd process from being able to read the pipe file but the only thing I can think of would be selinux or a filesystem error. What kind of filesystem is /var/spool on this box and could you 
do a

ls -lZ /var/spool/nagios/cmd/nagios.cmd*

when it is running so I can see what the file is setup as?

Thanks.

Comment 8 Dimitri Maziuk 2017-10-23 17:45:53 UTC

```
# ls -lZ /var/spool/nagios/cmd/nagios.cmd*
prw-rw---- nagios nagios ?                                /var/spool/nagios/cmd/nagios.cmd
```

It's an ext4 on spinning rust with like 4 out 200GB used.

The script hangs about 15% of the time. Can't say I see a rhyme or reason to it.
```
[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
^C./testpipe.sh: line 8: /tmp/test.pipe~: Interrupted system call

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
^C./testpipe.sh: line 8: /tmp/test.pipe~: Interrupted system call

[root@hereford ~]# ./testpipe.sh
^C./testpipe.sh: line 8: /tmp/test.pipe~: Interrupted system call
```

Comment 9 Stephen John Smoogen 2017-10-23 22:03:33 UTC

OK that is weird. I haven't been able to replicate yet, but I fear that nagios is going to hang whether or not the line is in the services because something is holding on the named pipe and isn't allowing an interrupt. 

I am going to get some input from more experienced engineers on what should be done here.

Comment 10 Dimitri Maziuk 2017-10-23 22:22:38 UTC

This is an old dual-core pentium D with 4GB RAM, but about the only other thing that's running there is salt-minion. There's maybe a GB of RAM used and an occasional 0.2% iowait or sys load... can't really say the hardware is strained. Other than that it's a stock centos 6.latest x86_64, it's been running nagios happily for years until this update, nagiosgraph is about the only non-stock thing on it that I can think of.

Comment 11 Stephen John Smoogen 2017-10-24 19:41:56 UTC

The consensus I have gotten so far is that it working for me is a fluke and I will raise it with upstream to remove this from their code. 

https://github.com/NagiosEnterprises/nagioscore/issues/443

I am commenting this out and will put in an updated build.

Comment 12 Dimitri Maziuk 2017-10-24 19:49:38 UTC

LOL

It is a weird one though, I managed to do a couple of /etc/init.d/nagios stops without the hang, too, while playing with it.

Comment 13 Peter Meier 2017-11-16 16:49:30 UTC

I'm observing/affected by the same behavior.

Comment 14 Peter Meier 2017-11-16 16:55:25 UTC

Interesting enough: It worked fine, once I restarted the server. I now added the fix from upstream, as I regularly restart nagios due to config changes, I assume I might see it again quite soon, if it still happens.

Comment 15 Stephen John Smoogen 2017-11-20 20:59:34 UTC

OK I got a probable fix for this. The problem is that the echo also needs to be backgrounded so that the () goes completely into the background. I have tried it on my test boxes and it doesn't hang. But I was not able to trigger it very often. A -5 with this will appear in epel-testing soon.

Comment 16 R P Herrold 2017-11-20 21:20:45 UTC

if accessing through ssh, adding 
   -t  
sometimes helps as does, when application, adding a 
   < /dev/null

to keep it from hanging up looking for stdin

Comment 17 Dimitri Maziuk 2017-11-20 21:26:44 UTC

ssh would probably be the most common: restart after reconfig, but you could fire it from e.g. salt state or something too...

Comment 18 Fedora Update System 2017-11-20 23:08:09 UTC

nagios-4.3.4-5.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-723d26389e

Comment 19 Fedora Update System 2017-11-21 19:28:05 UTC

nagios-4.3.4-5.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-723d26389e

Comment 20 Fedora Update System 2017-11-24 13:16:15 UTC

nagios-4.3.4-6.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-ecb67df0a6

Comment 21 Fedora Update System 2017-11-25 05:23:13 UTC

nagios-4.3.4-7.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-ecb67df0a6

Comment 22 Fedora Update System 2017-12-10 23:28:03 UTC

nagios-4.3.4-7.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.