Bug 739223
Summary: | Nanny crashes and shuts LVS down if a service is deleted using ipvsadm and then the corresponding real server goes down. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Joel <jweirauch> | ||||
Component: | piranha | Assignee: | Ryan O'Hara <rohara> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 5.5 | CC: | bugzilla-redhat, cluster-maint, djansa, lscalabr, mgrac, mjuricek, uwe.knop | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | piranha-0.8.4-25.el5 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-01-08 07:33:18 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 807971 | ||||||
Attachments: |
|
Description
Joel
2011-09-16 21:07:32 UTC
Description of problem: Hi, I have exactly the same behaviour here. I hit this bug while disabling real server and doing maintenance. Backup LVS router didn't initiated failover, because pulse process was still running on master. Version-Release number of selected component (if applicable): piranha-0.8.5-7.el6.i686 Scientific Linux release 6.1 (Carbon) Actual results: /var/log/messages: Sep 22 15:05:46 lb1 nanny[2360]: Trouble. Received results are not what we expected from (10.11.12.141:443) Sep 22 15:05:46 lb1 nanny[2360]: [inactive] shutting down 10.11.12.141:443 due to connection failure Sep 22 15:05:46 lb1 nanny[2360]: /sbin/ipvsadm command failed! Sep 22 15:05:46 lb1 lvs[2343]: nanny died! shutting down lvs Sep 22 15:05:46 lb1 lvs[2343]: shutting down virtual service service1-nonssl Sep 22 15:05:46 lb1 nanny[2356]: Terminating due to signal 15 Sep 22 15:05:46 lb1 nanny[2356]: /sbin/ipvsadm command failed! Sep 22 15:05:46 lb1 nanny[2357]: Terminating due to signal 15 Sep 22 15:05:46 lb1 lvs[2343]: shutting down virtual service service1-ssl Sep 22 15:05:46 lb1 nanny[2361]: Terminating due to signal 15 Sep 22 15:05:46 lb1 lvs[2343]: shutting down virtual service service2-nonssl Sep 22 15:05:46 lb1 nanny[2366]: Terminating due to signal 15 Sep 22 15:05:46 lb1 nanny[2367]: Terminating due to signal 15 Sep 22 15:05:46 lb1 lvs[2343]: shutting down virtual service service2-ssl Sep 22 15:05:46 lb1 nanny[2371]: Terminating due to signal 15 Sep 22 15:05:46 lb1 nanny[2372]: Terminating due to signal 15 Sep 22 15:37:05 lb1 pulse[2332]: Terminating due to signal 15 Sep 22 15:37:06 lb1 ntpd[1365]: Deleting interface #11 eth0:2, 10.11.12.5#123, interface stats: received=0, sent=0, dropped=0, active_time=3121756 s ecs Sep 22 15:37:06 lb1 ntpd[1365]: Deleting interface #12 eth0:1, 10.11.12.4#123, interface stats: received=0, sent=0, dropped=0, active_time=3121756 s ecs Sep 22 15:37:06 lb1 ntpd[1365]: Deleting interface #13 eth1:1, 10.11.12.131#123, interface stats: received=0, sent=0, dropped=0, active_time=3121756 secs This happens because you first remove the ipvsadm rule, then stop the service. The pulse service does not check ipvsadm rules and will not recognize that you've removed one. When you stop the service, nanny detects this and attempts to remove the ipvsadm rule and fails because it is already gone. The workaround is simple -- don't remove the ipvsadm rule, just stop the service. I do think it is reasonable to expect that a failed attempt to remove a non-existent ipvsadm rule should not cause all nanny processes to fail. I've asked for feedback from the author of the hard_shutdown option. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. I've recreated this problem on RHEL6.3 using piranha-0.8.5-15.el5. On the director, start pulse: # service pulse start Starting pulse: [ OK ] Now check the output of ipvsadm. Here we have a single virtual service (192.168.1.201:80) with two real servers. # ipvsadm IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 192.168.1.201:http rr -> 192.168.1.108:http Route 1 0 0 -> 192.168.1.109:http Route 1 0 0 Now remove one the of real servers by using ipvsadm. Note that the nanny process responsible for health checking this real server is still active and the service itself (http in this case) is still active on the real server. # ipvsadm -d -t 192.168.1.201:80 -r 192.168.1.108 Check that the real server was removed: # ipvsadm IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 192.168.1.201:http rr -> 192.168.1.109:http Route 1 0 2 On the real server, stop the actual service (httpd in this case): # service httpd stop Stopping httpd: [ OK ] Now the health check will fail and nanny will attempt to remove the real server, which will fail because the real server was already removed. From /var/log/messaes: Apr 19 17:25:25 mobil-virt-06 nanny[25340]: [inactive] shutting down 192.168.1.108:80 due to connection failure Apr 19 17:25:26 mobil-virt-06 nanny[25340]: /sbin/ipvsadm command failed! Apr 19 17:25:26 mobil-virt-06 lvs[25330]: nanny died! shutting down lvs Apr 19 17:25:26 mobil-virt-06 lvs[25330]: shutting down virtual service VIP_201 Apr 19 17:25:26 mobil-virt-06 nanny[25341]: Terminating due to signal 15 Note that nanny dies and lvsd dies in turn. The crude fix is to ignore the error returned by nanny when it fails to remove a real server that had already been removed, but this seems like a bad idea. Unfortunately there does not appear to be a simple way to check for the existence of a real server prior to removing it, so we'll need something clever. One idea is to adjust a real server's weight to 0 prior to removal. It this fails can we be reasonably sure that the real server had already been removed? More investigation needed. It appears that removing a real server from a service via the ipvsadm tool can result in one of two errors in libipvs: ENOENT is the real server does not exist or ESRCH if the service does not exist. These error conditions aren't available to nanny, so when nanny sees that the ipvsadm call to remove a real service has failed, it simply exits. I think that failing to remove a real server that does not exist should not be a fatal error. I was curious to see how keepalived handled this situation, so I ran the same test as above with keepalived version 1.2.2 on RHEL6.3. Keepalived will write a message to the logs if the real server being deleted does not exist, but it will not exist. This seems reasonable to me. This problem also occurs when using the quiesce_server option and nanny attempts to adjust the real server's weight to 0. Again, if the real server has been removed manually with ipvsadm, the ipvsadm command exec'd by nanny will fail and nanny/lvsd will terminate. After further investigation, this problem can occur for more than just removal of a real server. The nanny process does fork/exec of ipvsadm as a means to add/remove/modify real servers. If any of these commands encounters an error, nanny/lvsd will exit abnormally. Specifically, nanny will exit if attempts to do any of the following: - Add a real server that already exists. - Remove a real server that does not exist. - Modofy a real server that does not exist. None of these should cause a fatal error. Instead, log a message and continue. Patch forthcoming. Created attachment 597119 [details]
Errors from ivpsadm should not be fatal.
This patch changes to way that nanny handles errors from ipvsadmm. Previously, any error encountered when nanny exec's ipvsadm would can nanny to exit abnormally. With this patch, errors from running ipvsadm do not causes nanny to exit but instead print a message to syslog. For example, if nanny wants to remove a real server but that real server does not exist, this should not be a fatal error. Similarly, if nanny tries to add a real server that already exists or modify a real server (eg. change its weight) that does not exist, these are no longer considered fatal errors. The nanny process will print an error and continue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0065.html |