Bug 505172

Summary: a failure to start a single nanny kills off *all* running nannys
Product: Red Hat Enterprise Linux 5 Reporter: Dan Yocum <dyocum>
Component: piranhaAssignee: Marek Grac <mgrac>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: adrew, bill-bugzilla.redhat.com, cluster-maint, davidj, djansa, ffotorel, lscalabr, wcooley
Target Milestone: rc   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: piranha-0.8.4-22 Doc Type: Bug Fix
Doc Text:
New keyword for lvs.cf was added. hard_shutdows = (0 | 1) 1 (default) => problem with single nanny will kill all nannies 0 => problem with single nanny won't kill all nannies but system needs manual intervention
Story Points: ---
Clone Of:
: 593728 (view as bug list) Environment:
Last Closed: 2011-07-21 07:23:10 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 593728    
Attachments:
Description Flags
Small patch that prevents the complete shutdown after a nanny error none

Description Dan Yocum 2009-06-10 17:17:46 EDT
Description of problem:

After setting a real server to active = 0 and weight = 0 and reloading pulse, <perform some work on the RS>, set active = 1 and weight = 3 and reloading pulse, lvsd first creates the monitor for the process, which dies for some strange reason, then proceeds to shutdown *all* virtual services!! 

Version-Release number of selected component (if applicable):

piranha-0.8.4-9.3.el5 and piranha-0.8.4-11.el5



How reproducible:

always

Steps to Reproduce:
1. configure lvs.cf with a service on a new real server
2. make sure the service is NOT running on the real server
3. service pulse reload; tail -f /var/log/messages
4. PANIC!
  
Actual results:

lvs[19604]: rereading configuration file
lvs[19604]: create_monitor for saz-admin:8443/fg5x3 running as pid 31729
lvs[19604]: create_monitor for saz-admin:8443/fg6x3 running as pid 31730
lvs[19604]: nanny for child saz-admin:8443/fg5x3 died! shutting down lvs
lvs[19604]: shutting down virtual service MYSQL:3306
lvs[19604]: shutting down virtual service SAZ:8888
lvs[19604]: shutting down virtual service SAZ:8881
lvs[19604]: shutting down virtual service SAZ:8882
lvs[19604]: shutting down virtual service voms:8443
lvs[19604]: shutting down virtual service voms-osg:8443
lvs[19604]: shutting down virtual service gums:8443
nanny[19614]: Terminating due to signal 15
nanny[19617]: Terminating due to signal 15
nanny[19622]: Terminating due to signal 15
nanny[19644]: Terminating due to signal 15
nanny[19645]: Terminating due to signal 15
nanny[19647]: Terminating due to signal 15


Expected results:

lvs[18998]: rereading configuration file
lvs[18998]: starting virtual service saz-admin:8450 active: 8450
lvs[18998]: create_monitor for saz-admin:8450/fgt6x6 running as pid 10673
nanny[10673]: starting LVS client monitor for 131.225.81.155:8450
nanny[10673]: [ active ] making 131.225.81.131:8450 available


Additional info:
Comment 1 J. Kost 2009-06-23 04:13:37 EDT
Created attachment 349055 [details]
Small patch that prevents the complete shutdown after a nanny error

Small patch that prevents the complete lvsd shutdown after a nanny error
Comment 8 David Jacobson 2011-04-14 13:06:23 EDT
Hi,

Running the following :

CentOS 5.3
piranha-0.8.4-19.el5
ipvsadm-1.24-12.el5

We have just been hit by this bug, see logs below:

Apr 14 17:55:18 serverhostname nanny[15067]: [inactive] shutting down 196.x.x.x:25 due to connection failure
Apr 14 17:56:47 serverhostname nanny[20959]: [inactive] shutting down 196.x.x.x:25 due to connection failure
Apr 14 17:56:47 serverhostname nanny[20959]: /sbin/ipvsadm command failed!
Apr 14 17:56:47 serverhostname lvs[20911]: nanny died! shutting down lvs
Apr 14 17:56:47 serverhostname lvs[20911]: shutting down virtual service balancer

Similarly, the problem occured when trying to restart pulse from trying to bring up the nanny process twice incorrectly :

Apr 14 17:59:18 serverhostname nanny[15067]: [ active ] making 196.x.x.x:25 available
Apr 14 17:59:53 serverhostname nanny[22170]: [ active ] making 196.x.x.x:25 available
Apr 14 17:59:53 serverhostname nanny[22170]: /sbin/ipvsadm command failed!
Apr 14 17:59:53 serverhostname lvs[22111]: nanny died! shutting down lvs
Apr 14 17:59:53 serverhostname lvs[22111]: shutting down virtual service balancer

From what I can see the root cause of the issue is that it tries to shutdown the connection twice, if it just did it once all would be fine.

I also agree though, lvsd should not die completely.

This bug has been open for over 2 years, is there going to be any progress on this?

Regards,
David
Comment 10 Marek Grac 2011-06-06 11:40:23 EDT
http://git.fedorahosted.org/git/?p=piranha.git;a=commit;h=8ca36132f5aad67fb0977f2efbf9d25b55776642

[test case is not valid; it ends in a other branch very close to original problem]

Joerg thanks for a patch, I have add a new keywork hard_shutdown = (0 | 1) which should go to global section of lvs.conf. 1 - default - backward compatibility where system is either running completely or not at all. Setting hard_shutdown = 0 is functionality that you would like to use. 

If you are willing to test, I would like to send you a preliminary version.
Comment 11 Marek Grac 2011-06-06 11:40:23 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
New keyword for lvs.cf was added. 

hard_shutdows = (0 | 1) 

1 (default) => problem with single nanny will kill all nannies
0 => problem with single nanny won't kill all nannies but system needs manual intervention
Comment 13 Bill McGonigle 2011-06-07 15:53:42 EDT
FYI, typo in the Technical Note.
Comment 22 errata-xmlrpc 2011-07-21 07:23:10 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1059.html