Bug 1504814

Summary: /etc/init.d/nagios stop hangs forever
Product: [Fedora] Fedora EPEL Reporter: Dimitri Maziuk <dmaziuk>
Component: nagiosAssignee: Stephen John Smoogen <smooge>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: el6CC: affix, athmanem, b.heden, herrold, jose.p.oliveira.oss, lemenkov, nb, ondrejj, peter.meier, redhat, shawn.starr, smooge, smooge, s, swilkerson
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: nagios-4.3.4-7.el6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-10 23:28:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dimitri Maziuk 2017-10-20 17:05:06 UTC
Description of problem:

service nagios stop

hangs at or after

```
+ remove_commandfile
+ '[' -p /var/spool/nagios/cmd/nagios.cmd ']'
+ mv -f /var/spool/nagios/cmd/nagios.cmd /var/spool/nagios/cmd/nagios.cmd~
+ echo -n ''
+ dd if=/var/spool/nagios/cmd/nagios.cmd~ count=0
```

(dd redirects stderr to /dev/null so I'm not sure what happens after that)

Version-Release number of selected component (if applicable):

nagios-4.3.2-5.el6.x86_64

How reproducible:

always

Steps to Reproduce:
1. service nagios stop

Actual results:

hangs forever

Expected results:

stop nagios service

Additional info:

Looking at the comment in the script, like the cure is worse than the disease: dd trick is supposed to prevent a deadlock that may happen, instead it's making one always happen.

Comment 1 Dimitri Maziuk 2017-10-20 17:05:55 UTC
... it *looks* like the cure is worse ...

Comment 2 Stephen John Smoogen 2017-10-20 22:50:54 UTC
I have not been able to replicate this on my el6 systems. Could you let me know if you have selinux enabled, and if you do if you have the nagios-selinux rpm installed? After that I would edit the file and comment out the /dev/null part and see if it saying something.

We can go from there.

Comment 3 Dimitri Maziuk 2017-10-20 23:19:36 UTC
No, no selinux here. There is nagiosgraph and a whole lot of custom monitoring scripts (some of which seems to have b0rked on this update) but nothing obvious to explain this. Without the redirect:
```
[root@hereford init.d]# ./nagios stop
Stopping nagios:. done.
0+0 records in
0+0 records out
0 bytes (0 B) copied, 7.2405e-05 s, 0.0 kB/s
```
and it's hanging:
```
[root@hereford ~]# ps auxw | grep nagios
root      5875  0.0  0.0 106512  1668 pts/2    S+   18:07   0:00 /bin/sh ./nagios stop
root      5887  0.0  0.0 106512   860 pts/2    S+   18:07   0:00 /bin/sh ./nagios stop
root      5929  0.0  0.0 103324   824 pts/0    S+   18:10   0:00 grep nagios
```

I do have two more notes after looking:

1. does this look right:
```
(dd if=$NagiosCommandFile~ count=0 2>/dev/null & echo -n "" >$NagiosCommandFile~)
```
-- I mean the single ampersand before echo.

2. Stopping nagios should probably print "done" after it's removed command file and cleaned up etc.

Comment 4 Dimitri Maziuk 2017-10-20 23:23:45 UTC
I tried this:
dd if=/var/spool/nagios/cmd/nagios.cmd~ count=0 of=/dev/null
just to make sure it's not hanging trying to write to the tty, and it's still sitting there. So it looks to me like it's waiting on the pipe -- no idea why

Comment 5 Dimitri Maziuk 2017-10-20 23:27:39 UTC
PPS: sending `dd` a USR1:
```
dd: opening `/var/spool/nagios/cmd/nagios.cmd~': Interrupted system call
+ rm -f /var/spool/nagios/cmd/nagios.cmd /var/spool/nagios/cmd/nagios.cmd~
+ rm -f /var/log/nagios/status.dat /var/run/nagios/nagios.pid /var/lock/subsys/nagios
```

Comment 6 Stephen John Smoogen 2017-10-20 23:48:55 UTC
Ok the & is correct because it is putting that process in as a background process. The echo "" > $NagiosCommandFile~ should then cause the background task and any other items attached to the pipe to fail and close. 

Could you see if the following script also locks?
#!/bin/bash
rm -f /tmp/test.pipe
mkfifo /tmp/test.pipe
ls -1 /tmp > /tmp/test.pipe &
mv /tmp/test.pipe /tmp/test.pipe~
## The () make this a subshell. The & is intentional as we want to bg this.
( dd if=/tmp/test.pipe~ count=0 2>/dev/null & 
  echo -n "" > /tmp/test.pipe~ )
jobs

rm -f /tmp/test.pipe~
### end of script

Comment 7 Stephen John Smoogen 2017-10-21 00:00:44 UTC
OK from the dd that says something is locking the sub dd process from being able to read the pipe file but the only thing I can think of would be selinux or a filesystem error. What kind of filesystem is /var/spool on this box and could you 
do a

ls -lZ /var/spool/nagios/cmd/nagios.cmd*

when it is running so I can see what the file is setup as?

Thanks.

Comment 8 Dimitri Maziuk 2017-10-23 17:45:53 UTC
```
# ls -lZ /var/spool/nagios/cmd/nagios.cmd*
prw-rw---- nagios nagios ?                                /var/spool/nagios/cmd/nagios.cmd
```

It's an ext4 on spinning rust with like 4 out 200GB used.

The script hangs about 15% of the time. Can't say I see a rhyme or reason to it.
```
[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
^C./testpipe.sh: line 8: /tmp/test.pipe~: Interrupted system call

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
[1]+  Running                 ls -1 /tmp > /tmp/test.pipe &

[root@hereford ~]# ./testpipe.sh
^C./testpipe.sh: line 8: /tmp/test.pipe~: Interrupted system call

[root@hereford ~]# ./testpipe.sh
^C./testpipe.sh: line 8: /tmp/test.pipe~: Interrupted system call
```

Comment 9 Stephen John Smoogen 2017-10-23 22:03:33 UTC
OK that is weird. I haven't been able to replicate yet, but I fear that nagios is going to hang whether or not the line is in the services because something is holding on the named pipe and isn't allowing an interrupt. 

I am going to get some input from more experienced engineers on what should be done here.

Comment 10 Dimitri Maziuk 2017-10-23 22:22:38 UTC
This is an old dual-core pentium D with 4GB RAM, but about the only other thing that's running there is salt-minion. There's maybe a GB of RAM used and an occasional 0.2% iowait or sys load... can't really say the hardware is strained. Other than that it's a stock centos 6.latest x86_64, it's been running nagios happily for years until this update, nagiosgraph is about the only non-stock thing on it that I can think of.

Comment 11 Stephen John Smoogen 2017-10-24 19:41:56 UTC
The consensus I have gotten so far is that it working for me is a fluke and I will raise it with upstream to remove this from their code. 

https://github.com/NagiosEnterprises/nagioscore/issues/443

I am commenting this out and will put in an updated build.

Comment 12 Dimitri Maziuk 2017-10-24 19:49:38 UTC
LOL

It is a weird one though, I managed to do a couple of /etc/init.d/nagios stops without the hang, too, while playing with it.

Comment 13 Peter Meier 2017-11-16 16:49:30 UTC
I'm observing/affected by the same behavior.

Comment 14 Peter Meier 2017-11-16 16:55:25 UTC
Interesting enough: It worked fine, once I restarted the server. I now added the fix from upstream, as I regularly restart nagios due to config changes, I assume I might see it again quite soon, if it still happens.

Comment 15 Stephen John Smoogen 2017-11-20 20:59:34 UTC
OK I got a probable fix for this. The problem is that the echo also needs to be backgrounded so that the () goes completely into the background. I have tried it on my test boxes and it doesn't hang. But I was not able to trigger it very often. A -5 with this will appear in epel-testing soon.

Comment 16 R P Herrold 2017-11-20 21:20:45 UTC
if accessing through ssh, adding 
   -t  
sometimes helps as does, when application, adding a 
   < /dev/null

to keep it from hanging up looking for stdin

Comment 17 Dimitri Maziuk 2017-11-20 21:26:44 UTC
ssh would probably be the most common: restart after reconfig, but you could fire it from e.g. salt state or something too...

Comment 18 Fedora Update System 2017-11-20 23:08:09 UTC
nagios-4.3.4-5.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-723d26389e

Comment 19 Fedora Update System 2017-11-21 19:28:05 UTC
nagios-4.3.4-5.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-723d26389e

Comment 20 Fedora Update System 2017-11-24 13:16:15 UTC
nagios-4.3.4-6.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-ecb67df0a6

Comment 21 Fedora Update System 2017-11-25 05:23:13 UTC
nagios-4.3.4-7.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-ecb67df0a6

Comment 22 Fedora Update System 2017-12-10 23:28:03 UTC
nagios-4.3.4-7.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.