Bug 739223

Summary: Nanny crashes and shuts LVS down if a service is deleted using ipvsadm and then the corresponding real server goes down.
Product: Red Hat Enterprise Linux 5 Reporter: Joel <jweirauch>
Component: piranhaAssignee: Ryan O'Hara <rohara>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 5.5CC: bugzilla-redhat, cluster-maint, djansa, lscalabr, mgrac, mjuricek, uwe.knop
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: piranha-0.8.4-25.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-08 07:33:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 807971    
Attachments:
Description Flags
Errors from ivpsadm should not be fatal. none

Description Joel 2011-09-16 21:07:32 UTC
Description of problem:
If you use ipvsadm to delete a service from LVS, the nanny process monitoring that service is still active.  If you then shutdown the service on the real server, the nanny process sees this and then attempts to remove the service from the LVS configuration.  Since the service has already been deleted, nanny is unable to remove it and the nanny process dies, resulting in LVS shutting down, even with hard_shutdown=0 in the lvs configuration file.

Version-Release number of selected component (if applicable): piranha-0.8.4-22.el5


How reproducible:
Always

Steps to Reproduce:
1. Use ipvsadm to delete a real server. eg: ipvsadm -d -t 172.16.0.2:80 -r 172.16.0.5
2. Stop the corresponding service on the real server. eg: "service httpd stop" on 172.16.0.5
  
Actual results:
/var/log/messages will now show that the nanny process died and that it is shutting down lvs

Expected results:
Given that hard_shutdown=0 and I'm running piranha-0.8.4-22, I would expect that a failure of a single nanny process would not stop LVS entirely.

However, I would expect that when using ipvsadm to delete a service from LVS, the corresponding nanny process would be stopped.

Comment 1 Andrej Moravcik 2011-09-22 17:25:33 UTC
Description of problem:

Hi, I have exactly the same behaviour here. I hit this bug while disabling real server and doing maintenance. Backup LVS router didn't initiated failover, because pulse process was still running on master.


Version-Release number of selected component (if applicable):

piranha-0.8.5-7.el6.i686
Scientific Linux release 6.1 (Carbon)


Actual results:

/var/log/messages:
Sep 22 15:05:46 lb1 nanny[2360]: Trouble. Received results are not what we expected from (10.11.12.141:443)
Sep 22 15:05:46 lb1 nanny[2360]: [inactive] shutting down 10.11.12.141:443 due to connection failure
Sep 22 15:05:46 lb1 nanny[2360]: /sbin/ipvsadm command failed!
Sep 22 15:05:46 lb1 lvs[2343]: nanny died! shutting down lvs
Sep 22 15:05:46 lb1 lvs[2343]: shutting down virtual service service1-nonssl
Sep 22 15:05:46 lb1 nanny[2356]: Terminating due to signal 15
Sep 22 15:05:46 lb1 nanny[2356]: /sbin/ipvsadm command failed!
Sep 22 15:05:46 lb1 nanny[2357]: Terminating due to signal 15
Sep 22 15:05:46 lb1 lvs[2343]: shutting down virtual service service1-ssl
Sep 22 15:05:46 lb1 nanny[2361]: Terminating due to signal 15
Sep 22 15:05:46 lb1 lvs[2343]: shutting down virtual service service2-nonssl
Sep 22 15:05:46 lb1 nanny[2366]: Terminating due to signal 15
Sep 22 15:05:46 lb1 nanny[2367]: Terminating due to signal 15
Sep 22 15:05:46 lb1 lvs[2343]: shutting down virtual service service2-ssl
Sep 22 15:05:46 lb1 nanny[2371]: Terminating due to signal 15
Sep 22 15:05:46 lb1 nanny[2372]: Terminating due to signal 15
Sep 22 15:37:05 lb1 pulse[2332]: Terminating due to signal 15
Sep 22 15:37:06 lb1 ntpd[1365]: Deleting interface #11 eth0:2, 10.11.12.5#123, interface stats: received=0, sent=0, dropped=0, active_time=3121756 s 
ecs
Sep 22 15:37:06 lb1 ntpd[1365]: Deleting interface #12 eth0:1, 10.11.12.4#123, interface stats: received=0, sent=0, dropped=0, active_time=3121756 s 
ecs
Sep 22 15:37:06 lb1 ntpd[1365]: Deleting interface #13 eth1:1, 10.11.12.131#123, interface stats: received=0, sent=0, dropped=0, active_time=3121756
 secs

Comment 2 Ryan O'Hara 2011-10-27 16:17:01 UTC
This happens because you first remove the ipvsadm rule, then stop the service. The pulse service does not check ipvsadm rules and will not recognize that you've removed one. When you stop the service, nanny detects this and attempts to remove the ipvsadm rule and fails because it is already gone.

The workaround is simple -- don't remove the ipvsadm rule, just stop the service. I do think it is reasonable to expect that a failed attempt to remove a non-existent ipvsadm rule should not cause all nanny processes to fail. I've asked for feedback from the author of the hard_shutdown option.

Comment 4 RHEL Program Management 2012-04-02 10:51:08 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 6 Ryan O'Hara 2012-04-19 22:35:42 UTC
I've recreated this problem on RHEL6.3 using piranha-0.8.5-15.el5.

On the director, start pulse:

# service pulse start
Starting pulse:                                            [  OK  ]

Now check the output of ipvsadm. Here we have a single virtual service (192.168.1.201:80) with two real servers.

# ipvsadm
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.1.201:http rr
  -> 192.168.1.108:http           Route   1      0          0         
  -> 192.168.1.109:http           Route   1      0          0         

Now remove one the of real servers by using ipvsadm. Note that the nanny process responsible for health checking this real server is still active and the service itself (http in this case) is still active on the real server.

# ipvsadm -d -t 192.168.1.201:80 -r 192.168.1.108

Check that the real server was removed:

# ipvsadm
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.1.201:http rr
  -> 192.168.1.109:http           Route   1      0          2         

On the real server, stop the actual service (httpd in this case):

# service httpd stop
Stopping httpd:                                            [  OK  ]

Now the health check will fail and nanny will attempt to remove the real server, which will fail because the real server was already removed. From /var/log/messaes:

Apr 19 17:25:25 mobil-virt-06 nanny[25340]: [inactive] shutting down 192.168.1.108:80 due to connection failure
Apr 19 17:25:26 mobil-virt-06 nanny[25340]: /sbin/ipvsadm command failed!
Apr 19 17:25:26 mobil-virt-06 lvs[25330]: nanny died! shutting down lvs
Apr 19 17:25:26 mobil-virt-06 lvs[25330]: shutting down virtual service VIP_201
Apr 19 17:25:26 mobil-virt-06 nanny[25341]: Terminating due to signal 15

Note that nanny dies and lvsd dies in turn.

Comment 7 Ryan O'Hara 2012-04-23 15:41:57 UTC
The crude fix is to ignore the error returned by nanny when it fails to remove a real server that had already been removed, but this seems like a bad idea. Unfortunately there does not appear to be a simple way to check for the existence of a real server prior to removing it, so we'll need something clever. One idea is to adjust a real server's weight to 0 prior to removal. It this fails can we be reasonably sure that the real server had already been removed? More investigation needed.

Comment 9 Ryan O'Hara 2012-07-03 21:35:18 UTC
It appears that removing a real server from a service via the ipvsadm tool can result in one of two errors in libipvs: ENOENT is the real server does not exist or ESRCH if the service does not exist. These error conditions aren't available to nanny, so when nanny sees that the ipvsadm call to remove a real service has failed, it simply exits.

I think that failing to remove a real server that does not exist should not be a fatal error. I was curious to see how keepalived handled this situation, so I ran the same test as above with keepalived version 1.2.2 on RHEL6.3. Keepalived will write a message to the logs if the real server being deleted does not exist, but it will not exist. This seems reasonable to me.

Comment 10 Ryan O'Hara 2012-07-05 13:53:50 UTC
This problem also occurs when using the quiesce_server option and nanny attempts to adjust the real server's weight to 0. Again, if the real server has been removed manually with ipvsadm, the ipvsadm command exec'd by nanny will fail and nanny/lvsd will terminate.

Comment 11 Ryan O'Hara 2012-07-05 21:40:05 UTC
After further investigation, this problem can occur for more than just removal of a real server. The nanny process does fork/exec of ipvsadm as a means to add/remove/modify real servers. If any of these commands encounters an error, nanny/lvsd will exit abnormally. Specifically, nanny will exit if attempts to do any of the following:

- Add a real server that already exists.
- Remove a real server that does not exist.
- Modofy a real server that does not exist.

None of these should cause a fatal error. Instead, log a message and continue. Patch forthcoming.

Comment 12 Ryan O'Hara 2012-07-09 17:55:51 UTC
Created attachment 597119 [details]
Errors from ivpsadm should not be fatal.

This patch changes to way that nanny handles errors from ipvsadmm. Previously, any error encountered when nanny exec's ipvsadm would can nanny to exit abnormally. With this patch, errors from running ipvsadm do not causes nanny to exit but instead print a message to syslog. For example, if nanny wants to remove a real server but that real server does not exist, this should not be a fatal error. Similarly, if nanny tries to add a real server that already exists or modify a real server (eg. change its weight) that does not exist, these are no longer considered fatal errors. The nanny process will print an error and continue.

Comment 26 errata-xmlrpc 2013-01-08 07:33:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0065.html