Bug 980632

Summary: Chrooted named does not start after unclean shutdown of server
Product: Red Hat Enterprise Linux 6 Reporter: Simon Matter <simon.matter>
Component: bindAssignee: Tomáš Hozza <thozza>
Status: CLOSED ERRATA QA Contact: qe-baseos-daemons
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.4CC: hhorak, milan.kerslager, simon.matter, tgl
Target Milestone: rcKeywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: bind-9.8.2-0.26.rc1.el6 Doc Type: Bug Fix
Doc Text:
Cause: The named initscript did not check if the PID written in named PID-file is really a PID of a running named server. Consequence: After an unclean server shutdown there was a chance, that the PID written in the PID-file was a PID of an existing process, while the named server was not running. As a consequence the initscript identified the server as running and user was therefore unable to start the server. Fix: The initscript was enhanced to perform check if the PID written in PID-file was a PID of running named server. If not, it deletes the PID-file. The check is done before starting/stopping/reloading the server or checking its status. Result: Now, if the PID written in the PID-file is a PID of an existing process, after an unclean server shutdown and the named server is not running, the initscript identified the server as not running. User is therefore able to start the server without problems.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-14 04:34:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1070830    
Attachments:
Description Flags
Start named after unclean shutdown of server
none
Start named after unclean shutdown of server
none
Proposed patch
none
Updated patch including reload none

Description Simon Matter 2013-07-02 21:40:16 UTC
Description of problem:
Chrooted named does not start after unclean shutdown and new boot of server while being configured to start automatically.

Version-Release number of selected component (if applicable):
bind-9.8.2-0.17.rc1.el6_4.4.x86_64

How reproducible:
Always

Steps to Reproduce:
1. chkconfig named on ; service named start
2. power off server
3. power on server

Actual results:
Named does not start with the message "named: already running"

Expected results:
Named should start on boot automatically

Additional info:
The problem seems to be that after unclean shutdown, on the next boot /etc/rc.sysinit cleans up pid files in /var/run/ but with chrooted bind, /var/run/named.pid is a symlink to /var/named/chroot//var/run/named/named.pid.
A quick look into the init script indicates that pidofnamed() checks the file "$ROOTDIR/$PIDFILE" while the symlink /var/run/"$named".pid seems more appropriate.

I know, such power off situations should never happen, but in RL they do.

Comment 1 Simon Matter 2013-07-03 08:56:31 UTC
Attached patch seem to fix it.

Comment 2 Simon Matter 2013-07-03 08:57:42 UTC
Created attachment 768165 [details]
Start named after unclean shutdown of server

Comment 3 Tomáš Hozza 2013-07-03 12:27:39 UTC
Hi Simon.

Thank you for your report and patch. I'll have a look at it in the near future.

Comment 4 Tom Lane 2013-08-01 20:15:28 UTC
I just ran into this same situation, but I would suggest that the problem is only partly that the start script isn't being very careful about whether named is really running.  Rather, the difficulty is that the system expects dead pidfiles to be cleaned out during reboot, and this does not happen for /var/named/chroot/var/run.  I patched it like this:

$ diff -c /etc/rc.d/rc.sysinit~ /etc/rc.d/rc.sysinit 
*** /etc/rc.d/rc.sysinit~       Thu Mar  7 10:37:43 2013
--- /etc/rc.d/rc.sysinit        Thu Aug  1 15:54:57 2013
***************
*** 588,593 ****
--- 588,595 ----
  # Clean up /var.
  rm -rf /var/lock/cvs/* /var/run/screen/*
  find /var/lock /var/run ! -type d -exec rm -f {} \;
+ # bind chroot needs that too -- tgl
+ find /var/named/chroot/var/run ! -type d -exec rm -f {} \;
  rm -f /var/lib/rpm/__db* &> /dev/null
  rm -f /var/gdm/.gdmfifo &> /dev/null
  
but there may be a cleaner place to do it.

Comment 5 Tomáš Hozza 2013-08-22 09:17:40 UTC
(In reply to Simon Matter from comment #0)
> Description of problem:
> Chrooted named does not start after unclean shutdown and new boot of server
> while being configured to start automatically.
> 
> Version-Release number of selected component (if applicable):
> bind-9.8.2-0.17.rc1.el6_4.4.x86_64
> 
> How reproducible:
> Always
> 
> Steps to Reproduce:
> 1. chkconfig named on ; service named start
> 2. power off server
> 3. power on server
> 
> Actual results:
> Named does not start with the message "named: already running"
> 
> Expected results:
> Named should start on boot automatically
> 
> Additional info:
> The problem seems to be that after unclean shutdown, on the next boot
> /etc/rc.sysinit cleans up pid files in /var/run/ but with chrooted bind,
> /var/run/named.pid is a symlink to
> /var/named/chroot//var/run/named/named.pid.
> A quick look into the init script indicates that pidofnamed() checks the
> file "$ROOTDIR/$PIDFILE" while the symlink /var/run/"$named".pid seems more
> appropriate.

I'm trying to reproduce your problem, but with no luck so far. named initscript
has no problem with existing pid file in the chroot, since "pidofnamed" function
is as follows:

pidofnamed() {
	pidofproc -p "$ROOTDIR/$PIDFILE" "$named";
}

This means that even if the pid file is present after unclean server shutdown
and named is really not running, the pidofnamed returns empty string and named
is started without any problems.

Are you sure there is no other problem? Is there anything else logged in named
log or in /var/log/messages?

Thanks!


(In reply to Tom Lane from comment #4)
> I just ran into this same situation, but I would suggest that the problem is
> only partly that the start script isn't being very careful about whether
> named is really running.  Rather, the difficulty is that the system expects
> dead pidfiles to be cleaned out during reboot, and this does not happen for
> /var/named/chroot/var/run.  I patched it like this:

It is not nice that the pid file remains in the chroot in case of unclean
shutdown, but as far as I tested it's causing no problems with starting
named. Please check system and named logs if there is anything suspicious.

Thanks!

Comment 6 Tom Lane 2013-08-22 14:32:17 UTC
(In reply to Tomas Hozza from comment #5)
> (In reply to Tom Lane from comment #4)
> > I just ran into this same situation, but I would suggest that the problem is
> > only partly that the start script isn't being very careful about whether
> > named is really running.  Rather, the difficulty is that the system expects
> > dead pidfiles to be cleaned out during reboot, and this does not happen for
> > /var/named/chroot/var/run.  I patched it like this:
> 
> It is not nice that the pid file remains in the chroot in case of unclean
> shutdown, but as far as I tested it's causing no problems with starting
> named. Please check system and named logs if there is anything suspicious.

There isn't anything at all, AFAICS, that's related to named in my /var/log/messages from the boot cycle where named failed to start.  A quick look at the init script suggests that there wouldn't be --- it looks to me like the daemon() function just doesn't print anything if it thinks the pidfile indicates the program is already running.

I've had some experience with this type of bug, and it's not 100% reproducible.  What happens is that the PID assigned to the daemon process will be nearly, but not always exactly, the same from boot cycle to boot cycle.  So what can happen is that the instant that somebody looks at the stale pidfile, that PID *is* in use in the new boot cycle, but not by the daemon but just some random shell script or subprocess that's part of the boot sequence.  It's not exactly uncommon for the PID to belong to the process that's looking at the PID file and would launch the daemon as its next step (giving the daemon the next higher PID) if it wasn't confused by the chance match.  AFAICS there are no defenses against false matches in the named initscript, so I suspect that I got burned by a false match.  A full defense against false matches is pretty hard, so arranging to nuke leftover pidfiles at boot is the easiest fix.

Comment 7 Tom Lane 2013-08-22 14:38:53 UTC
BTW, it's not the pidofnamed test that gets confused in this scenario.  Rather, it's __pids_var_run (in the functions file, it's called by the daemon function) that decides the PID is in use.  And that function is seriously stupid about false matches: if it finds the PID in /proc, it thinks there's a match, without any examination of that process's name or ownership or anything.  Doesn't even notice if it's its *own* PID.

It might be an idea to file an RFE to make __pids_var_run at least a bit more wary.  I'm not sure what would be a general purpose check though.

Comment 8 Simon Matter 2013-08-22 14:55:22 UTC
Thanks Tom, I think we are getting closer now.

Tomas, when you tried to reproduce it, I think my description was not detailed enough. Maybe the procedure below is better.

Steps to Reproduce:
1. Install chrooted bind including bind-chroot package and configure it.
2. chkconfig named on
3. shutdown -r now
4. wait until the machine has completely started
5. power off server
6. power on server

Now, there is a chance the newly started named gets the same PID as before and then may not start?

I think /var/log/boot.log should possibly show more info.

Unfortunately I can not power off the server in question at the moment to test.

Regards,
Simon

Comment 9 Tom Lane 2013-08-22 15:06:42 UTC
(In reply to Simon Matter from comment #8)
> Now, there is a chance the newly started named gets the same PID as before
> and then may not start?

Actually, the dangerous scenario is where the new named should have gotten a PID a  few counts higher than named had in the system's previous life, and by bad luck that PID is in use by the initscript itself (or some child process, or some nearby other process of the boot sequence) at the moment the initscript looks.

While you can trigger this just with repeated unclean-shutdown & restart, it's a bit more likely to happen if you just added some other service that starts before named does, so that some more PIDs got eaten before the named script starts.

It might be worth noting that when I got burnt, I was testing powerfail scenarios with a new UPS, and had accidentally caused the unclean shutdown due to premature removal of power.  It's quite possible that I'd added or changed the "ups" service during the previous boot cycle, but I don't remember for sure anymore.

Comment 10 Simon Matter 2013-08-22 15:30:48 UTC
You're right, so here we go:

1. service named start
2. killall -KILL named
3. echo 1 > /var/named/chroot/var/run/named/named.pid
4. service named start

That's what can happen on boot after unclean shutdown.

Comment 11 Simon Matter 2013-08-23 06:16:40 UTC
@Tomas, please have a look here and at the attached new patch.

Steps to reproduce:
1. service named start
2. killall -KILL named
3. echo 1 > /var/named/chroot/var/run/named/named.pid
# here is what /etc/rc.sysinit does
4. rm -f /var/run/named.pid
5. service named start

Without the patch it say named is already running, with the patch it starts named and also updates the content of named.pid.

Comment 12 Simon Matter 2013-08-23 06:18:34 UTC
Created attachment 789462 [details]
Start named after unclean shutdown of server

Comment 13 Tomáš Hozza 2013-08-23 11:45:23 UTC
(In reply to Simon Matter from comment #11)
> @Tomas, please have a look here and at the attached new patch.
> 
> Steps to reproduce:
> 1. service named start
> 2. killall -KILL named
> 3. echo 1 > /var/named/chroot/var/run/named/named.pid
> # here is what /etc/rc.sysinit does
> 4. rm -f /var/run/named.pid
> 5. service named start

Thanks for the explanation of your issue. I can see it now. However I think
that the issue is bugger than it seems.

The patch has one problem and that is 'pidofproc' without given pidfile would
identify also named instance not run by initscript as running even if it has
its pidfile somewhere else or even doesn't have pidfile at all.

The bigger problem I'm talking about is that also 'service named status/stop'
don't work as expected after such a unexpected server shutdown.

'service named stop' would try to kill the process with PID written in
"$ROOTDIR/$PIDFILE".

'service named status' would report report that rndc can not connect to the
server. If there is a process with the PID written in pidfile then it would
report that the process with pid ... is running.

I think it would be better to add a extra function that would check the name
of the process with PID acquired via 'pidofproc'. If names don't match it would
delete the pidfile ("$ROOTDIR/$PIDFILE"). Then the function should be used in
start/stop/status.

Comment 14 Tomáš Hozza 2013-08-23 11:55:14 UTC
Created attachment 789572 [details]
Proposed patch

Proposed patch. Tested with reproducer. service named start/stop/status works
as expected with this patch after unclean server shutdown.

Comment 15 Simon Matter 2013-08-23 12:50:45 UTC
Created attachment 789585 [details]
Updated patch including reload

May I propose this updated patch which also works for the reload target.

Comment 16 Tomáš Hozza 2013-08-23 13:08:38 UTC
(In reply to Simon Matter from comment #15)
> Created attachment 789585 [details]
> Updated patch including reload
> 
> May I propose this updated patch which also works for the reload target.

Thanks for catching this. Now it should be all covered.

Comment 17 Tomáš Hozza 2013-12-10 09:51:30 UTC
*** Bug 1039685 has been marked as a duplicate of this bug. ***

Comment 22 errata-xmlrpc 2014-10-14 04:34:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1373.html