Bug 505172 - a failure to start a single nanny kills off *all* running nannys
Summary: a failure to start a single nanny kills off *all* running nannys
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: piranha
Version: 5.3
Hardware: i386
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Marek Grac
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 593728
TreeView+ depends on / blocked
 
Reported: 2009-06-10 21:17 UTC by Dan Yocum
Modified: 2018-11-14 14:37 UTC (History)
8 users (show)

Fixed In Version: piranha-0.8.4-22
Doc Type: Bug Fix
Doc Text:
New keyword for lvs.cf was added. hard_shutdows = (0 | 1) 1 (default) => problem with single nanny will kill all nannies 0 => problem with single nanny won't kill all nannies but system needs manual intervention
Clone Of:
: 593728 (view as bug list)
Environment:
Last Closed: 2011-07-21 11:23:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Small patch that prevents the complete shutdown after a nanny error (999 bytes, patch)
2009-06-23 08:13 UTC, J. Kost
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1059 0 normal SHIPPED_LIVE piranha bug fix update 2011-07-20 15:43:20 UTC

Description Dan Yocum 2009-06-10 21:17:46 UTC
Description of problem:

After setting a real server to active = 0 and weight = 0 and reloading pulse, <perform some work on the RS>, set active = 1 and weight = 3 and reloading pulse, lvsd first creates the monitor for the process, which dies for some strange reason, then proceeds to shutdown *all* virtual services!! 

Version-Release number of selected component (if applicable):

piranha-0.8.4-9.3.el5 and piranha-0.8.4-11.el5



How reproducible:

always

Steps to Reproduce:
1. configure lvs.cf with a service on a new real server
2. make sure the service is NOT running on the real server
3. service pulse reload; tail -f /var/log/messages
4. PANIC!
  
Actual results:

lvs[19604]: rereading configuration file
lvs[19604]: create_monitor for saz-admin:8443/fg5x3 running as pid 31729
lvs[19604]: create_monitor for saz-admin:8443/fg6x3 running as pid 31730
lvs[19604]: nanny for child saz-admin:8443/fg5x3 died! shutting down lvs
lvs[19604]: shutting down virtual service MYSQL:3306
lvs[19604]: shutting down virtual service SAZ:8888
lvs[19604]: shutting down virtual service SAZ:8881
lvs[19604]: shutting down virtual service SAZ:8882
lvs[19604]: shutting down virtual service voms:8443
lvs[19604]: shutting down virtual service voms-osg:8443
lvs[19604]: shutting down virtual service gums:8443
nanny[19614]: Terminating due to signal 15
nanny[19617]: Terminating due to signal 15
nanny[19622]: Terminating due to signal 15
nanny[19644]: Terminating due to signal 15
nanny[19645]: Terminating due to signal 15
nanny[19647]: Terminating due to signal 15


Expected results:

lvs[18998]: rereading configuration file
lvs[18998]: starting virtual service saz-admin:8450 active: 8450
lvs[18998]: create_monitor for saz-admin:8450/fgt6x6 running as pid 10673
nanny[10673]: starting LVS client monitor for 131.225.81.155:8450
nanny[10673]: [ active ] making 131.225.81.131:8450 available


Additional info:

Comment 1 J. Kost 2009-06-23 08:13:37 UTC
Created attachment 349055 [details]
Small patch that prevents the complete shutdown after a nanny error

Small patch that prevents the complete lvsd shutdown after a nanny error

Comment 8 David Jacobson 2011-04-14 17:06:23 UTC
Hi,

Running the following :

CentOS 5.3
piranha-0.8.4-19.el5
ipvsadm-1.24-12.el5

We have just been hit by this bug, see logs below:

Apr 14 17:55:18 serverhostname nanny[15067]: [inactive] shutting down 196.x.x.x:25 due to connection failure
Apr 14 17:56:47 serverhostname nanny[20959]: [inactive] shutting down 196.x.x.x:25 due to connection failure
Apr 14 17:56:47 serverhostname nanny[20959]: /sbin/ipvsadm command failed!
Apr 14 17:56:47 serverhostname lvs[20911]: nanny died! shutting down lvs
Apr 14 17:56:47 serverhostname lvs[20911]: shutting down virtual service balancer

Similarly, the problem occured when trying to restart pulse from trying to bring up the nanny process twice incorrectly :

Apr 14 17:59:18 serverhostname nanny[15067]: [ active ] making 196.x.x.x:25 available
Apr 14 17:59:53 serverhostname nanny[22170]: [ active ] making 196.x.x.x:25 available
Apr 14 17:59:53 serverhostname nanny[22170]: /sbin/ipvsadm command failed!
Apr 14 17:59:53 serverhostname lvs[22111]: nanny died! shutting down lvs
Apr 14 17:59:53 serverhostname lvs[22111]: shutting down virtual service balancer

From what I can see the root cause of the issue is that it tries to shutdown the connection twice, if it just did it once all would be fine.

I also agree though, lvsd should not die completely.

This bug has been open for over 2 years, is there going to be any progress on this?

Regards,
David

Comment 10 Marek Grac 2011-06-06 15:40:23 UTC
http://git.fedorahosted.org/git/?p=piranha.git;a=commit;h=8ca36132f5aad67fb0977f2efbf9d25b55776642

[test case is not valid; it ends in a other branch very close to original problem]

Joerg thanks for a patch, I have add a new keywork hard_shutdown = (0 | 1) which should go to global section of lvs.conf. 1 - default - backward compatibility where system is either running completely or not at all. Setting hard_shutdown = 0 is functionality that you would like to use. 

If you are willing to test, I would like to send you a preliminary version.

Comment 11 Marek Grac 2011-06-06 15:40:23 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
New keyword for lvs.cf was added. 

hard_shutdows = (0 | 1) 

1 (default) => problem with single nanny will kill all nannies
0 => problem with single nanny won't kill all nannies but system needs manual intervention

Comment 13 Bill McGonigle 2011-06-07 19:53:42 UTC
FYI, typo in the Technical Note.

Comment 22 errata-xmlrpc 2011-07-21 11:23:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1059.html


Note You need to log in before you can comment on or make changes to this bug.