Bug 980632
Summary: | Chrooted named does not start after unclean shutdown of server | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Simon Matter <simon.matter> | ||||||||||
Component: | bind | Assignee: | Tomáš Hozza <thozza> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | qe-baseos-daemons | ||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 6.4 | CC: | hhorak, milan.kerslager, simon.matter, tgl | ||||||||||
Target Milestone: | rc | Keywords: | Patch | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | bind-9.8.2-0.26.rc1.el6 | Doc Type: | Bug Fix | ||||||||||
Doc Text: |
Cause:
The named initscript did not check if the PID written in named PID-file is really a PID of a running named server.
Consequence:
After an unclean server shutdown there was a chance, that the PID written in the PID-file was a PID of an existing process, while the named server was not running. As a consequence the initscript identified the server as running and user was therefore unable to start the server.
Fix:
The initscript was enhanced to perform check if the PID written in PID-file was a PID of running named server. If not, it deletes the PID-file. The check is done before starting/stopping/reloading the server or checking its status.
Result:
Now, if the PID written in the PID-file is a PID of an existing process, after an unclean server shutdown and the named server is not running, the initscript identified the server as not running. User is therefore able to start the server without problems.
|
Story Points: | --- | ||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2014-10-14 04:34:55 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1070830 | ||||||||||||
Attachments: |
|
Description
Simon Matter
2013-07-02 21:40:16 UTC
Attached patch seem to fix it. Created attachment 768165 [details]
Start named after unclean shutdown of server
Hi Simon. Thank you for your report and patch. I'll have a look at it in the near future. I just ran into this same situation, but I would suggest that the problem is only partly that the start script isn't being very careful about whether named is really running. Rather, the difficulty is that the system expects dead pidfiles to be cleaned out during reboot, and this does not happen for /var/named/chroot/var/run. I patched it like this: $ diff -c /etc/rc.d/rc.sysinit~ /etc/rc.d/rc.sysinit *** /etc/rc.d/rc.sysinit~ Thu Mar 7 10:37:43 2013 --- /etc/rc.d/rc.sysinit Thu Aug 1 15:54:57 2013 *************** *** 588,593 **** --- 588,595 ---- # Clean up /var. rm -rf /var/lock/cvs/* /var/run/screen/* find /var/lock /var/run ! -type d -exec rm -f {} \; + # bind chroot needs that too -- tgl + find /var/named/chroot/var/run ! -type d -exec rm -f {} \; rm -f /var/lib/rpm/__db* &> /dev/null rm -f /var/gdm/.gdmfifo &> /dev/null but there may be a cleaner place to do it. (In reply to Simon Matter from comment #0) > Description of problem: > Chrooted named does not start after unclean shutdown and new boot of server > while being configured to start automatically. > > Version-Release number of selected component (if applicable): > bind-9.8.2-0.17.rc1.el6_4.4.x86_64 > > How reproducible: > Always > > Steps to Reproduce: > 1. chkconfig named on ; service named start > 2. power off server > 3. power on server > > Actual results: > Named does not start with the message "named: already running" > > Expected results: > Named should start on boot automatically > > Additional info: > The problem seems to be that after unclean shutdown, on the next boot > /etc/rc.sysinit cleans up pid files in /var/run/ but with chrooted bind, > /var/run/named.pid is a symlink to > /var/named/chroot//var/run/named/named.pid. > A quick look into the init script indicates that pidofnamed() checks the > file "$ROOTDIR/$PIDFILE" while the symlink /var/run/"$named".pid seems more > appropriate. I'm trying to reproduce your problem, but with no luck so far. named initscript has no problem with existing pid file in the chroot, since "pidofnamed" function is as follows: pidofnamed() { pidofproc -p "$ROOTDIR/$PIDFILE" "$named"; } This means that even if the pid file is present after unclean server shutdown and named is really not running, the pidofnamed returns empty string and named is started without any problems. Are you sure there is no other problem? Is there anything else logged in named log or in /var/log/messages? Thanks! (In reply to Tom Lane from comment #4) > I just ran into this same situation, but I would suggest that the problem is > only partly that the start script isn't being very careful about whether > named is really running. Rather, the difficulty is that the system expects > dead pidfiles to be cleaned out during reboot, and this does not happen for > /var/named/chroot/var/run. I patched it like this: It is not nice that the pid file remains in the chroot in case of unclean shutdown, but as far as I tested it's causing no problems with starting named. Please check system and named logs if there is anything suspicious. Thanks! (In reply to Tomas Hozza from comment #5) > (In reply to Tom Lane from comment #4) > > I just ran into this same situation, but I would suggest that the problem is > > only partly that the start script isn't being very careful about whether > > named is really running. Rather, the difficulty is that the system expects > > dead pidfiles to be cleaned out during reboot, and this does not happen for > > /var/named/chroot/var/run. I patched it like this: > > It is not nice that the pid file remains in the chroot in case of unclean > shutdown, but as far as I tested it's causing no problems with starting > named. Please check system and named logs if there is anything suspicious. There isn't anything at all, AFAICS, that's related to named in my /var/log/messages from the boot cycle where named failed to start. A quick look at the init script suggests that there wouldn't be --- it looks to me like the daemon() function just doesn't print anything if it thinks the pidfile indicates the program is already running. I've had some experience with this type of bug, and it's not 100% reproducible. What happens is that the PID assigned to the daemon process will be nearly, but not always exactly, the same from boot cycle to boot cycle. So what can happen is that the instant that somebody looks at the stale pidfile, that PID *is* in use in the new boot cycle, but not by the daemon but just some random shell script or subprocess that's part of the boot sequence. It's not exactly uncommon for the PID to belong to the process that's looking at the PID file and would launch the daemon as its next step (giving the daemon the next higher PID) if it wasn't confused by the chance match. AFAICS there are no defenses against false matches in the named initscript, so I suspect that I got burned by a false match. A full defense against false matches is pretty hard, so arranging to nuke leftover pidfiles at boot is the easiest fix. BTW, it's not the pidofnamed test that gets confused in this scenario. Rather, it's __pids_var_run (in the functions file, it's called by the daemon function) that decides the PID is in use. And that function is seriously stupid about false matches: if it finds the PID in /proc, it thinks there's a match, without any examination of that process's name or ownership or anything. Doesn't even notice if it's its *own* PID. It might be an idea to file an RFE to make __pids_var_run at least a bit more wary. I'm not sure what would be a general purpose check though. Thanks Tom, I think we are getting closer now. Tomas, when you tried to reproduce it, I think my description was not detailed enough. Maybe the procedure below is better. Steps to Reproduce: 1. Install chrooted bind including bind-chroot package and configure it. 2. chkconfig named on 3. shutdown -r now 4. wait until the machine has completely started 5. power off server 6. power on server Now, there is a chance the newly started named gets the same PID as before and then may not start? I think /var/log/boot.log should possibly show more info. Unfortunately I can not power off the server in question at the moment to test. Regards, Simon (In reply to Simon Matter from comment #8) > Now, there is a chance the newly started named gets the same PID as before > and then may not start? Actually, the dangerous scenario is where the new named should have gotten a PID a few counts higher than named had in the system's previous life, and by bad luck that PID is in use by the initscript itself (or some child process, or some nearby other process of the boot sequence) at the moment the initscript looks. While you can trigger this just with repeated unclean-shutdown & restart, it's a bit more likely to happen if you just added some other service that starts before named does, so that some more PIDs got eaten before the named script starts. It might be worth noting that when I got burnt, I was testing powerfail scenarios with a new UPS, and had accidentally caused the unclean shutdown due to premature removal of power. It's quite possible that I'd added or changed the "ups" service during the previous boot cycle, but I don't remember for sure anymore. You're right, so here we go: 1. service named start 2. killall -KILL named 3. echo 1 > /var/named/chroot/var/run/named/named.pid 4. service named start That's what can happen on boot after unclean shutdown. @Tomas, please have a look here and at the attached new patch. Steps to reproduce: 1. service named start 2. killall -KILL named 3. echo 1 > /var/named/chroot/var/run/named/named.pid # here is what /etc/rc.sysinit does 4. rm -f /var/run/named.pid 5. service named start Without the patch it say named is already running, with the patch it starts named and also updates the content of named.pid. Created attachment 789462 [details]
Start named after unclean shutdown of server
(In reply to Simon Matter from comment #11) > @Tomas, please have a look here and at the attached new patch. > > Steps to reproduce: > 1. service named start > 2. killall -KILL named > 3. echo 1 > /var/named/chroot/var/run/named/named.pid > # here is what /etc/rc.sysinit does > 4. rm -f /var/run/named.pid > 5. service named start Thanks for the explanation of your issue. I can see it now. However I think that the issue is bugger than it seems. The patch has one problem and that is 'pidofproc' without given pidfile would identify also named instance not run by initscript as running even if it has its pidfile somewhere else or even doesn't have pidfile at all. The bigger problem I'm talking about is that also 'service named status/stop' don't work as expected after such a unexpected server shutdown. 'service named stop' would try to kill the process with PID written in "$ROOTDIR/$PIDFILE". 'service named status' would report report that rndc can not connect to the server. If there is a process with the PID written in pidfile then it would report that the process with pid ... is running. I think it would be better to add a extra function that would check the name of the process with PID acquired via 'pidofproc'. If names don't match it would delete the pidfile ("$ROOTDIR/$PIDFILE"). Then the function should be used in start/stop/status. Created attachment 789572 [details]
Proposed patch
Proposed patch. Tested with reproducer. service named start/stop/status works
as expected with this patch after unclean server shutdown.
Created attachment 789585 [details]
Updated patch including reload
May I propose this updated patch which also works for the reload target.
(In reply to Simon Matter from comment #15) > Created attachment 789585 [details] > Updated patch including reload > > May I propose this updated patch which also works for the reload target. Thanks for catching this. Now it should be all covered. *** Bug 1039685 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1373.html |