Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 505172

Summary: a failure to start a single nanny kills off *all* running nannys
Product: Red Hat Enterprise Linux 5 Reporter: Dan Yocum <dyocum>
Component: piranhaAssignee: Marek Grac <mgrac>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: adrew, bill-bugzilla.redhat.com, cluster-maint, davidj, djansa, ffotorel, lscalabr, wcooley
Target Milestone: rc   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: piranha-0.8.4-22 Doc Type: Bug Fix
Doc Text:
New keyword for lvs.cf was added. hard_shutdows = (0 | 1) 1 (default) => problem with single nanny will kill all nannies 0 => problem with single nanny won't kill all nannies but system needs manual intervention
Story Points: ---
Clone Of:
: 593728 (view as bug list) Environment:
Last Closed: 2011-07-21 11:23:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 593728    
Attachments:
Description Flags
Small patch that prevents the complete shutdown after a nanny error none

Description Dan Yocum 2009-06-10 21:17:46 UTC
Description of problem:

After setting a real server to active = 0 and weight = 0 and reloading pulse, <perform some work on the RS>, set active = 1 and weight = 3 and reloading pulse, lvsd first creates the monitor for the process, which dies for some strange reason, then proceeds to shutdown *all* virtual services!! 

Version-Release number of selected component (if applicable):

piranha-0.8.4-9.3.el5 and piranha-0.8.4-11.el5



How reproducible:

always

Steps to Reproduce:
1. configure lvs.cf with a service on a new real server
2. make sure the service is NOT running on the real server
3. service pulse reload; tail -f /var/log/messages
4. PANIC!
  
Actual results:

lvs[19604]: rereading configuration file
lvs[19604]: create_monitor for saz-admin:8443/fg5x3 running as pid 31729
lvs[19604]: create_monitor for saz-admin:8443/fg6x3 running as pid 31730
lvs[19604]: nanny for child saz-admin:8443/fg5x3 died! shutting down lvs
lvs[19604]: shutting down virtual service MYSQL:3306
lvs[19604]: shutting down virtual service SAZ:8888
lvs[19604]: shutting down virtual service SAZ:8881
lvs[19604]: shutting down virtual service SAZ:8882
lvs[19604]: shutting down virtual service voms:8443
lvs[19604]: shutting down virtual service voms-osg:8443
lvs[19604]: shutting down virtual service gums:8443
nanny[19614]: Terminating due to signal 15
nanny[19617]: Terminating due to signal 15
nanny[19622]: Terminating due to signal 15
nanny[19644]: Terminating due to signal 15
nanny[19645]: Terminating due to signal 15
nanny[19647]: Terminating due to signal 15


Expected results:

lvs[18998]: rereading configuration file
lvs[18998]: starting virtual service saz-admin:8450 active: 8450
lvs[18998]: create_monitor for saz-admin:8450/fgt6x6 running as pid 10673
nanny[10673]: starting LVS client monitor for 131.225.81.155:8450
nanny[10673]: [ active ] making 131.225.81.131:8450 available


Additional info:

Comment 1 J. Kost 2009-06-23 08:13:37 UTC
Created attachment 349055 [details]
Small patch that prevents the complete shutdown after a nanny error

Small patch that prevents the complete lvsd shutdown after a nanny error

Comment 8 David Jacobson 2011-04-14 17:06:23 UTC
Hi,

Running the following :

CentOS 5.3
piranha-0.8.4-19.el5
ipvsadm-1.24-12.el5

We have just been hit by this bug, see logs below:

Apr 14 17:55:18 serverhostname nanny[15067]: [inactive] shutting down 196.x.x.x:25 due to connection failure
Apr 14 17:56:47 serverhostname nanny[20959]: [inactive] shutting down 196.x.x.x:25 due to connection failure
Apr 14 17:56:47 serverhostname nanny[20959]: /sbin/ipvsadm command failed!
Apr 14 17:56:47 serverhostname lvs[20911]: nanny died! shutting down lvs
Apr 14 17:56:47 serverhostname lvs[20911]: shutting down virtual service balancer

Similarly, the problem occured when trying to restart pulse from trying to bring up the nanny process twice incorrectly :

Apr 14 17:59:18 serverhostname nanny[15067]: [ active ] making 196.x.x.x:25 available
Apr 14 17:59:53 serverhostname nanny[22170]: [ active ] making 196.x.x.x:25 available
Apr 14 17:59:53 serverhostname nanny[22170]: /sbin/ipvsadm command failed!
Apr 14 17:59:53 serverhostname lvs[22111]: nanny died! shutting down lvs
Apr 14 17:59:53 serverhostname lvs[22111]: shutting down virtual service balancer

From what I can see the root cause of the issue is that it tries to shutdown the connection twice, if it just did it once all would be fine.

I also agree though, lvsd should not die completely.

This bug has been open for over 2 years, is there going to be any progress on this?

Regards,
David

Comment 10 Marek Grac 2011-06-06 15:40:23 UTC
http://git.fedorahosted.org/git/?p=piranha.git;a=commit;h=8ca36132f5aad67fb0977f2efbf9d25b55776642

[test case is not valid; it ends in a other branch very close to original problem]

Joerg thanks for a patch, I have add a new keywork hard_shutdown = (0 | 1) which should go to global section of lvs.conf. 1 - default - backward compatibility where system is either running completely or not at all. Setting hard_shutdown = 0 is functionality that you would like to use. 

If you are willing to test, I would like to send you a preliminary version.

Comment 11 Marek Grac 2011-06-06 15:40:23 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
New keyword for lvs.cf was added. 

hard_shutdows = (0 | 1) 

1 (default) => problem with single nanny will kill all nannies
0 => problem with single nanny won't kill all nannies but system needs manual intervention

Comment 13 Bill McGonigle 2011-06-07 19:53:42 UTC
FYI, typo in the Technical Note.

Comment 22 errata-xmlrpc 2011-07-21 11:23:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1059.html