Description of problem: service nagios stop hangs at or after ``` + remove_commandfile + '[' -p /var/spool/nagios/cmd/nagios.cmd ']' + mv -f /var/spool/nagios/cmd/nagios.cmd /var/spool/nagios/cmd/nagios.cmd~ + echo -n '' + dd if=/var/spool/nagios/cmd/nagios.cmd~ count=0 ``` (dd redirects stderr to /dev/null so I'm not sure what happens after that) Version-Release number of selected component (if applicable): nagios-4.3.2-5.el6.x86_64 How reproducible: always Steps to Reproduce: 1. service nagios stop Actual results: hangs forever Expected results: stop nagios service Additional info: Looking at the comment in the script, like the cure is worse than the disease: dd trick is supposed to prevent a deadlock that may happen, instead it's making one always happen.
... it *looks* like the cure is worse ...
I have not been able to replicate this on my el6 systems. Could you let me know if you have selinux enabled, and if you do if you have the nagios-selinux rpm installed? After that I would edit the file and comment out the /dev/null part and see if it saying something. We can go from there.
No, no selinux here. There is nagiosgraph and a whole lot of custom monitoring scripts (some of which seems to have b0rked on this update) but nothing obvious to explain this. Without the redirect: ``` [root@hereford init.d]# ./nagios stop Stopping nagios:. done. 0+0 records in 0+0 records out 0 bytes (0 B) copied, 7.2405e-05 s, 0.0 kB/s ``` and it's hanging: ``` [root@hereford ~]# ps auxw | grep nagios root 5875 0.0 0.0 106512 1668 pts/2 S+ 18:07 0:00 /bin/sh ./nagios stop root 5887 0.0 0.0 106512 860 pts/2 S+ 18:07 0:00 /bin/sh ./nagios stop root 5929 0.0 0.0 103324 824 pts/0 S+ 18:10 0:00 grep nagios ``` I do have two more notes after looking: 1. does this look right: ``` (dd if=$NagiosCommandFile~ count=0 2>/dev/null & echo -n "" >$NagiosCommandFile~) ``` -- I mean the single ampersand before echo. 2. Stopping nagios should probably print "done" after it's removed command file and cleaned up etc.
I tried this: dd if=/var/spool/nagios/cmd/nagios.cmd~ count=0 of=/dev/null just to make sure it's not hanging trying to write to the tty, and it's still sitting there. So it looks to me like it's waiting on the pipe -- no idea why
PPS: sending `dd` a USR1: ``` dd: opening `/var/spool/nagios/cmd/nagios.cmd~': Interrupted system call + rm -f /var/spool/nagios/cmd/nagios.cmd /var/spool/nagios/cmd/nagios.cmd~ + rm -f /var/log/nagios/status.dat /var/run/nagios/nagios.pid /var/lock/subsys/nagios ```
Ok the & is correct because it is putting that process in as a background process. The echo "" > $NagiosCommandFile~ should then cause the background task and any other items attached to the pipe to fail and close. Could you see if the following script also locks? #!/bin/bash rm -f /tmp/test.pipe mkfifo /tmp/test.pipe ls -1 /tmp > /tmp/test.pipe & mv /tmp/test.pipe /tmp/test.pipe~ ## The () make this a subshell. The & is intentional as we want to bg this. ( dd if=/tmp/test.pipe~ count=0 2>/dev/null & echo -n "" > /tmp/test.pipe~ ) jobs rm -f /tmp/test.pipe~ ### end of script
OK from the dd that says something is locking the sub dd process from being able to read the pipe file but the only thing I can think of would be selinux or a filesystem error. What kind of filesystem is /var/spool on this box and could you do a ls -lZ /var/spool/nagios/cmd/nagios.cmd* when it is running so I can see what the file is setup as? Thanks.
``` # ls -lZ /var/spool/nagios/cmd/nagios.cmd* prw-rw---- nagios nagios ? /var/spool/nagios/cmd/nagios.cmd ``` It's an ext4 on spinning rust with like 4 out 200GB used. The script hangs about 15% of the time. Can't say I see a rhyme or reason to it. ``` [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh ^C./testpipe.sh: line 8: /tmp/test.pipe~: Interrupted system call [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh [1]+ Running ls -1 /tmp > /tmp/test.pipe & [root@hereford ~]# ./testpipe.sh ^C./testpipe.sh: line 8: /tmp/test.pipe~: Interrupted system call [root@hereford ~]# ./testpipe.sh ^C./testpipe.sh: line 8: /tmp/test.pipe~: Interrupted system call ```
OK that is weird. I haven't been able to replicate yet, but I fear that nagios is going to hang whether or not the line is in the services because something is holding on the named pipe and isn't allowing an interrupt. I am going to get some input from more experienced engineers on what should be done here.
This is an old dual-core pentium D with 4GB RAM, but about the only other thing that's running there is salt-minion. There's maybe a GB of RAM used and an occasional 0.2% iowait or sys load... can't really say the hardware is strained. Other than that it's a stock centos 6.latest x86_64, it's been running nagios happily for years until this update, nagiosgraph is about the only non-stock thing on it that I can think of.
The consensus I have gotten so far is that it working for me is a fluke and I will raise it with upstream to remove this from their code. https://github.com/NagiosEnterprises/nagioscore/issues/443 I am commenting this out and will put in an updated build.
LOL It is a weird one though, I managed to do a couple of /etc/init.d/nagios stops without the hang, too, while playing with it.
I'm observing/affected by the same behavior.
Interesting enough: It worked fine, once I restarted the server. I now added the fix from upstream, as I regularly restart nagios due to config changes, I assume I might see it again quite soon, if it still happens.
OK I got a probable fix for this. The problem is that the echo also needs to be backgrounded so that the () goes completely into the background. I have tried it on my test boxes and it doesn't hang. But I was not able to trigger it very often. A -5 with this will appear in epel-testing soon.
if accessing through ssh, adding -t sometimes helps as does, when application, adding a < /dev/null to keep it from hanging up looking for stdin
ssh would probably be the most common: restart after reconfig, but you could fire it from e.g. salt state or something too...
nagios-4.3.4-5.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-723d26389e
nagios-4.3.4-5.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-723d26389e
nagios-4.3.4-6.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-ecb67df0a6
nagios-4.3.4-7.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-ecb67df0a6
nagios-4.3.4-7.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.