Bug 1319376

Summary: The "hard_shutdown" lvs.cf derivative in lvs.cf does not work
Product: Red Hat Enterprise Linux 6 Reporter: Jonathan Maxwell <jmaxwell>
Component: piranhaAssignee: Ryan O'Hara <rohara>
Status: CLOSED ERRATA QA Contact: Brandon Perkins <bperkins>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.7CC: cluster-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: piranha-0.8.6-7.el6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-21 11:10:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1269194    
Attachments:
Description Flags
Set remote port prior to comparison none

Description Jonathan Maxwell 2016-03-19 05:13:39 UTC
Description of problem:

We have a customer that reported all http/https services were suspended when they encountered and issue where the nanny process for a test script was killed by the timeout command. After that all web services were unavailable until Pulse was restarted.

Both routers end up with nothing in the LVS routing table:

Before on active when all is good:

 # ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.1.1.10:http wrr
  -> 10.1.1.8:http         Route   1      0          0         
  -> 10.1.1.9:http         Route   1      0          0         
TCP  10.1.1.10:https wrr
  -> 10.1.1.8:https        Route   1      0          0         
  -> 10.1.1.9:https        Route   1      0          0        

Reproduce:

Mar 17 16:07:57  pulse[32412]: STARTING PULSE AS MASTER
Mar 17 16:09:17  pulse[32412]: backup inactive: activating lvs
Mar 17 16:09:17  lvsd[32418]: starting virtual service sxxx_http active: 80
Mar 17 16:09:17  lvsd[32418]: create_monitor for sxxx_http/hosta01-prod running as pid 32422
Mar 17 16:09:17  lvsd[32418]: create_monitor for sxxx_http/hosta02-prod running as pid 32423
Mar 17 16:09:17  lvsd[32418]: starting virtual service sxxx_https active: 443
Mar 17 16:09:17  nanny[32423]: starting LVS client monitor for VIP:80 -> real_server_ip1:80
Mar 17 16:09:17  nanny[32422]: starting LVS client monitor for VIP:80 -> real_server_ip2:80
Mar 17 16:09:17  lvsd[32418]: create_monitor for sxxx_https/hosta01-prod running as pid 32425
Mar 17 16:09:17  lvsd[32418]: create_monitor for sxxx_https/hosta02-prod running as pid 32426
Mar 17 16:09:17  nanny[32425]: External program use requested: (/usr/local/sbin/https-test.sh %), IGNORING send string option ((null))
Mar 17 16:09:17  nanny[32425]: starting LVS client monitor for VIP:443 -> real_server_ip2:443
Mar 17 16:09:17  nanny[32426]: External program use requested: (/usr/local/sbin/https-test.sh %), IGNORING send string option ((null))
Mar 17 16:09:17  nanny[32426]: starting LVS client monitor for VIP:443 -> real_server_ip1:443
Mar 17 16:09:17  nanny[32423]: [ active ] making real_server_ip1:80 available
Mar 17 16:09:17  nanny[32426]: Trouble. Received results are not what we expected from (real_server_ip1:443)

Note the 2 second interval before the services are stopped:
Raw

Mar 17 16:09:19  nanny[32425]: Terminating due to signal 15
Mar 17 16:09:19  nanny[32425]: Killing child 32431
Mar 17 16:09:19  lvsd[32418]: nanny died! shutting down lvs
Mar 17 16:09:19  lvsd[32418]: shutting down virtual service spcd_http
Mar 17 16:09:19  nanny[32422]: Terminating due to signal 15
Mar 17 16:09:19  nanny[32423]: Terminating due to signal 15
Mar 17 16:09:19  kernel: IPVS: __ip_vs_del_service: enter
Mar 17 16:09:19  lvsd[32418]: shutting down virtual service spcd_https
Mar 17 16:09:19  nanny[32426]: Terminating due to signal 15
Mar 17 16:09:19  kernel: IPVS: __ip_vs_del_service: enter
Mar 17 16:09:19  pulse[32412]: Child process 32418 exited with status 0
Mar 17 16:09:22  pulse[32428: gratuitous lvs arps finished

After on both routers:

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
#

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
#

Version-Release number of selected component (if applicable):

piranha-0.8.6-4.el6_5.2.x86_64

How reproducible:

Always.

Steps to Reproduce:

1. Just configure a basic Piranha LVS setup and kill one of the nanny processes on the active router with SIGTERM (kill -15).

Actual results:

All services are disabled and become unavailable.

Expected results:

With "hard_shutdown = 0" we would expect the other services to remain functional, But the "hard_shutdown" derivative does not seem to make a difference in this behaviour. 

Additional info:

I have a reproducer setup on some VMs.

/etc/ha/lvs.cf extract:

serial_no = 8
primary = IP p
primary_private = 1.1.1.2
service = lvs
backup_active = 1
backup = IP b
backup_private = 1.1.1.1
heartbeat = 1
heartbeat_port = 539
keepalive = 6
deadtime = 18
network = direct
debug_level = NONE
hard_shutdown = 0 <--- 
monitor_links = 0
syncdaemon = 0
.
.
.

This was supposed to be fixed as per:

https://bugzilla.redhat.com/show_bug.cgi?id=505172

But does not appear to be.

Comment 4 Ryan O'Hara 2016-04-07 15:09:36 UTC
Created attachment 1144774 [details]
Set remote port prior to comparison

This patch will correctly set the port and rport in findClientConfig prior to doing the comparison. It changes the "port" to be the virtual service port and the "rport" (remote port) to the service's port.

Prior to this, the remote port was not being set prior to comparison and thus would always be 0. This caused the comparison to always fail. In the case where a nanny process had died, lvsd was unable to locate the assocated service entry which in turn caused lvsd to exit immediately. The result is that the hard_shutdown option was never honored. With this patch, the service entry associated with a failed nanny process should always be identified correctly.

Comment 14 errata-xmlrpc 2017-03-21 11:10:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0722.html