1319376 – The "hard_shutdown" lvs.cf derivative in lvs.cf does not work

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1319376 - The "hard_shutdown" lvs.cf derivative in lvs.cf does not work

Summary: The "hard_shutdown" lvs.cf derivative in lvs.cf does not work

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	piranha
Sub Component:
Version:	6.7
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ryan O'Hara
QA Contact:	Brandon Perkins
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1269194
TreeView+	depends on / blocked

Reported:	2016-03-19 05:13 UTC by Jonathan Maxwell
Modified:	2019-10-10 11:37 UTC (History)
CC List:	1 user (show)
Fixed In Version:	piranha-0.8.6-7.el6
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-21 11:10:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Set remote port prior to comparison (943 bytes, patch) 2016-04-07 15:09 UTC, Ryan O'Hara	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0722	0	normal	SHIPPED_LIVE	piranha bug fix update	2017-03-21 12:42:13 UTC

Description Jonathan Maxwell 2016-03-19 05:13:39 UTC

Description of problem:

We have a customer that reported all http/https services were suspended when they encountered and issue where the nanny process for a test script was killed by the timeout command. After that all web services were unavailable until Pulse was restarted.

Both routers end up with nothing in the LVS routing table:

Before on active when all is good:

 # ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.1.1.10:http wrr
  -> 10.1.1.8:http         Route   1      0          0         
  -> 10.1.1.9:http         Route   1      0          0         
TCP  10.1.1.10:https wrr
  -> 10.1.1.8:https        Route   1      0          0         
  -> 10.1.1.9:https        Route   1      0          0        

Reproduce:

Mar 17 16:07:57  pulse[32412]: STARTING PULSE AS MASTER
Mar 17 16:09:17  pulse[32412]: backup inactive: activating lvs
Mar 17 16:09:17  lvsd[32418]: starting virtual service sxxx_http active: 80
Mar 17 16:09:17  lvsd[32418]: create_monitor for sxxx_http/hosta01-prod running as pid 32422
Mar 17 16:09:17  lvsd[32418]: create_monitor for sxxx_http/hosta02-prod running as pid 32423
Mar 17 16:09:17  lvsd[32418]: starting virtual service sxxx_https active: 443
Mar 17 16:09:17  nanny[32423]: starting LVS client monitor for VIP:80 -> real_server_ip1:80
Mar 17 16:09:17  nanny[32422]: starting LVS client monitor for VIP:80 -> real_server_ip2:80
Mar 17 16:09:17  lvsd[32418]: create_monitor for sxxx_https/hosta01-prod running as pid 32425
Mar 17 16:09:17  lvsd[32418]: create_monitor for sxxx_https/hosta02-prod running as pid 32426
Mar 17 16:09:17  nanny[32425]: External program use requested: (/usr/local/sbin/https-test.sh %), IGNORING send string option ((null))
Mar 17 16:09:17  nanny[32425]: starting LVS client monitor for VIP:443 -> real_server_ip2:443
Mar 17 16:09:17  nanny[32426]: External program use requested: (/usr/local/sbin/https-test.sh %), IGNORING send string option ((null))
Mar 17 16:09:17  nanny[32426]: starting LVS client monitor for VIP:443 -> real_server_ip1:443
Mar 17 16:09:17  nanny[32423]: [ active ] making real_server_ip1:80 available
Mar 17 16:09:17  nanny[32426]: Trouble. Received results are not what we expected from (real_server_ip1:443)

Note the 2 second interval before the services are stopped:
Raw

Mar 17 16:09:19  nanny[32425]: Terminating due to signal 15
Mar 17 16:09:19  nanny[32425]: Killing child 32431
Mar 17 16:09:19  lvsd[32418]: nanny died! shutting down lvs
Mar 17 16:09:19  lvsd[32418]: shutting down virtual service spcd_http
Mar 17 16:09:19  nanny[32422]: Terminating due to signal 15
Mar 17 16:09:19  nanny[32423]: Terminating due to signal 15
Mar 17 16:09:19  kernel: IPVS: __ip_vs_del_service: enter
Mar 17 16:09:19  lvsd[32418]: shutting down virtual service spcd_https
Mar 17 16:09:19  nanny[32426]: Terminating due to signal 15
Mar 17 16:09:19  kernel: IPVS: __ip_vs_del_service: enter
Mar 17 16:09:19  pulse[32412]: Child process 32418 exited with status 0
Mar 17 16:09:22  pulse[32428: gratuitous lvs arps finished

After on both routers:

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
#

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
#

Version-Release number of selected component (if applicable):

piranha-0.8.6-4.el6_5.2.x86_64

How reproducible:

Always.

Steps to Reproduce:

1. Just configure a basic Piranha LVS setup and kill one of the nanny processes on the active router with SIGTERM (kill -15).

Actual results:

All services are disabled and become unavailable.

Expected results:

With "hard_shutdown = 0" we would expect the other services to remain functional, But the "hard_shutdown" derivative does not seem to make a difference in this behaviour. 

Additional info:

I have a reproducer setup on some VMs.

/etc/ha/lvs.cf extract:

serial_no = 8
primary = IP p
primary_private = 1.1.1.2
service = lvs
backup_active = 1
backup = IP b
backup_private = 1.1.1.1
heartbeat = 1
heartbeat_port = 539
keepalive = 6
deadtime = 18
network = direct
debug_level = NONE
hard_shutdown = 0 <--- 
monitor_links = 0
syncdaemon = 0
.
.
.

This was supposed to be fixed as per:

https://bugzilla.redhat.com/show_bug.cgi?id=505172

But does not appear to be.

Comment 4 Ryan O'Hara 2016-04-07 15:09:36 UTC

Created attachment 1144774 [details]
Set remote port prior to comparison

This patch will correctly set the port and rport in findClientConfig prior to doing the comparison. It changes the "port" to be the virtual service port and the "rport" (remote port) to the service's port.

Prior to this, the remote port was not being set prior to comparison and thus would always be 0. This caused the comparison to always fail. In the case where a nanny process had died, lvsd was unable to locate the assocated service entry which in turn caused lvsd to exit immediately. The result is that the hard_shutdown option was never honored. With this patch, the service entry associated with a failed nanny process should always be identified correctly.

Comment 14 errata-xmlrpc 2017-03-21 11:10:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0722.html

Note You need to log in before you can comment on or make changes to this bug.