Hide Forgot
Description of problem: We have a customer that reported all http/https services were suspended when they encountered and issue where the nanny process for a test script was killed by the timeout command. After that all web services were unavailable until Pulse was restarted. Both routers end up with nothing in the LVS routing table: Before on active when all is good: # ipvsadm -L IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.1.1.10:http wrr -> 10.1.1.8:http Route 1 0 0 -> 10.1.1.9:http Route 1 0 0 TCP 10.1.1.10:https wrr -> 10.1.1.8:https Route 1 0 0 -> 10.1.1.9:https Route 1 0 0 Reproduce: Mar 17 16:07:57 pulse[32412]: STARTING PULSE AS MASTER Mar 17 16:09:17 pulse[32412]: backup inactive: activating lvs Mar 17 16:09:17 lvsd[32418]: starting virtual service sxxx_http active: 80 Mar 17 16:09:17 lvsd[32418]: create_monitor for sxxx_http/hosta01-prod running as pid 32422 Mar 17 16:09:17 lvsd[32418]: create_monitor for sxxx_http/hosta02-prod running as pid 32423 Mar 17 16:09:17 lvsd[32418]: starting virtual service sxxx_https active: 443 Mar 17 16:09:17 nanny[32423]: starting LVS client monitor for VIP:80 -> real_server_ip1:80 Mar 17 16:09:17 nanny[32422]: starting LVS client monitor for VIP:80 -> real_server_ip2:80 Mar 17 16:09:17 lvsd[32418]: create_monitor for sxxx_https/hosta01-prod running as pid 32425 Mar 17 16:09:17 lvsd[32418]: create_monitor for sxxx_https/hosta02-prod running as pid 32426 Mar 17 16:09:17 nanny[32425]: External program use requested: (/usr/local/sbin/https-test.sh %), IGNORING send string option ((null)) Mar 17 16:09:17 nanny[32425]: starting LVS client monitor for VIP:443 -> real_server_ip2:443 Mar 17 16:09:17 nanny[32426]: External program use requested: (/usr/local/sbin/https-test.sh %), IGNORING send string option ((null)) Mar 17 16:09:17 nanny[32426]: starting LVS client monitor for VIP:443 -> real_server_ip1:443 Mar 17 16:09:17 nanny[32423]: [ active ] making real_server_ip1:80 available Mar 17 16:09:17 nanny[32426]: Trouble. Received results are not what we expected from (real_server_ip1:443) Note the 2 second interval before the services are stopped: Raw Mar 17 16:09:19 nanny[32425]: Terminating due to signal 15 Mar 17 16:09:19 nanny[32425]: Killing child 32431 Mar 17 16:09:19 lvsd[32418]: nanny died! shutting down lvs Mar 17 16:09:19 lvsd[32418]: shutting down virtual service spcd_http Mar 17 16:09:19 nanny[32422]: Terminating due to signal 15 Mar 17 16:09:19 nanny[32423]: Terminating due to signal 15 Mar 17 16:09:19 kernel: IPVS: __ip_vs_del_service: enter Mar 17 16:09:19 lvsd[32418]: shutting down virtual service spcd_https Mar 17 16:09:19 nanny[32426]: Terminating due to signal 15 Mar 17 16:09:19 kernel: IPVS: __ip_vs_del_service: enter Mar 17 16:09:19 pulse[32412]: Child process 32418 exited with status 0 Mar 17 16:09:22 pulse[32428: gratuitous lvs arps finished After on both routers: # ipvsadm -L IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn # # ipvsadm -L IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn # Version-Release number of selected component (if applicable): piranha-0.8.6-4.el6_5.2.x86_64 How reproducible: Always. Steps to Reproduce: 1. Just configure a basic Piranha LVS setup and kill one of the nanny processes on the active router with SIGTERM (kill -15). Actual results: All services are disabled and become unavailable. Expected results: With "hard_shutdown = 0" we would expect the other services to remain functional, But the "hard_shutdown" derivative does not seem to make a difference in this behaviour. Additional info: I have a reproducer setup on some VMs. /etc/ha/lvs.cf extract: serial_no = 8 primary = IP p primary_private = 1.1.1.2 service = lvs backup_active = 1 backup = IP b backup_private = 1.1.1.1 heartbeat = 1 heartbeat_port = 539 keepalive = 6 deadtime = 18 network = direct debug_level = NONE hard_shutdown = 0 <--- monitor_links = 0 syncdaemon = 0 . . . This was supposed to be fixed as per: https://bugzilla.redhat.com/show_bug.cgi?id=505172 But does not appear to be.
Created attachment 1144774 [details] Set remote port prior to comparison This patch will correctly set the port and rport in findClientConfig prior to doing the comparison. It changes the "port" to be the virtual service port and the "rport" (remote port) to the service's port. Prior to this, the remote port was not being set prior to comparison and thus would always be 0. This caused the comparison to always fail. In the case where a nanny process had died, lvsd was unable to locate the assocated service entry which in turn caused lvsd to exit immediately. The result is that the hard_shutdown option was never honored. With this patch, the service entry associated with a failed nanny process should always be identified correctly.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0722.html