Bug 1837316

Summary:	Incorrect behavior with UDP health-monitors on CentOS/RHEL
Product:	Red Hat OpenStack	Reporter:	Gregory Thiemonge <gthiemon>
Component:	openstack-octavia	Assignee:	Gregory Thiemonge <gthiemon>
Status:	CLOSED ERRATA	QA Contact:	Bruna Bonguardo <bbonguar>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	16.1 (Train)	CC:	amcleod, atragler, gregraka, ihrachys, lpeer, majopela, michjohn, scohen, spower
Target Milestone:	z2	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-octavia-5.0.2-0.20200626184853.e6b94f7.el8ost	Doc Type:	Known Issue
Doc Text:	The keepalived instance in the Red Hat OpenStack Platform Load-balancing service (octavia) instance (amphora) can abnormally terminate and interrupt UDP traffic. The cause of this issue is that the timeout value for the UDP health monitor is too small. + Workaround: specify a new timeout value that is greater than two seconds: `$ openstack loadbalancer healthmonitor set --timeout 3 <heath_monitor_id>` + For more information, search for "loadbalancer healthmonitor" in the Command Line Interface Reference.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-28 15:37:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1801721

Description Gregory Thiemonge 2020-05-19 09:12:08 UTC

Incorrect behavior with UDP health-monitors on CentOS using devstack + octavia master (a33d42fa59ada38dd5a5d3114a9a83f03a30d0b2)

Tempest failure
---------------

Octavia tempest scenario octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_udp_traffic fails on CentOS with the following error:

--
Traceback (most recent call last):

File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 471, in test_healthmonitor_udp_traffic
self._test_healthmonitor_traffic(const.UDP, 8080)

File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 415, in _test_healthmonitor_traffic
traffic_member_count=1)

File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1090, in check_members_balanced
protocol_port, protocol, verify)

File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1016, in _wait_for_lb_functional
raise Exception()

Exception
--

It fails because a member that should be seen as ONLINE goes into ERROR.


Amphora logs
------------

I reproduced it manually:

# openstack loadbalancer healthmonitor create --delay 3 --timeout 2 --max-retries 2 --type UDP-CONNECT --name hm1 pool1

Amphora logs:

--
May 19 06:25:00 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Activating healthchecker for service [10.0.0.81]:udp:8080 for VS [10.0.0.146]:udp:80
May 19 06:25:00 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Activating healthchecker for service [10.0.0.91]:udp:8080 for VS [10.0.0.146]:udp:80
May 19 06:25:05 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Misc check for [[10.0.0.91]:udp:8080 VS [10.0.0.146]:udp:80] by [/var/lib/octavia/lvs/check/udp_check.sh] timed out.
May 19 06:25:05 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Removing service [10.0.0.91]:udp:8080 to VS [10.0.0.146]:udp:80
May 19 06:25:06 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Misc check for [[10.0.0.81]:udp:8080 VS [10.0.0.146]:udp:80] by [/var/lib/octavia/lvs/check/udp_check.sh] timed out.
May 19 06:25:06 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Removing service [10.0.0.81]:udp:8080 to VS [10.0.0.146]:udp:80
--

Using tcpdump I can see that the UDP packets from the server are correctly received by the amphora.


Checking health check script manually
-------------------------------------

keepalived configuration contains:

--
MISC_CHECK {
    misc_path "/var/lib/octavia/lvs/check/udp_check.sh 10.0.0.81 8080"
    misc_timeout 2
}
--

But when running manually the udp_check.sh script, I got a strange behavior:

--
# ip netns exec amphora-haproxy /var/lib/octavia/lvs/check/udp_check.sh 10.0.0.81 8080
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.0.81:8080.
Ncat: UDP packet sent successfully
Ncat: 1 bytes sent, 9 bytes received in 2.02 seconds.
--

The health check script successfully ran but netcat exited after 2.02 seconds, which is a longer duration than misc_timeout.
It means that keepalived killed the script and considered that the member was in ERROR.

After investigation, it seems that the '-w' option passed to nmap-netcat (used by CentOS) doesn't do anything (from the nmap source code, it's only used for stream sockets).
But netcat provides a '-i' option that is the timeout duration after going into idle. The default for -i is 2sec, so it explains previous run duration, it also means that the bug is not triggered if --timeout is greater (>) than 2 sec.

--
# ip netns exec amphora-haproxy /bin/nc -uzv -i1 10.0.0.81 8080
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.0.81:8080.
10.0.0.81Ncat: UDP packet sent successfully
Ncat: 1 bytes sent, 9 bytes received in 1.01 seconds.
Note that https://review.opendev.org/#/c/718881/ fixed the UDP health-check timeout value that was incorrectly using the delay timeout value.
Before that commit, tempest test was using 3sec as timeout and the bug was not triggered.
--

I would recommend to detect which nmap tool is used (i.e "nc --version" returns "Ncat: Version 7.70 ( https://nmap.org/ncat )" on CentOS) and to adapt netcat options based on the results.

Note that '-w1' or '-i1' makes timeout configuration for UDP health check useless, nc will timeout after only 1 sec.

Comment 1 Gregory Thiemonge 2020-05-19 09:13:42 UTC

This issue also affects OSP16.1

Comment 4 Gregory Thiemonge 2020-05-27 13:02:06 UTC

Note that this behavior can be worked around by using a timeout value greater than 2 sec (>= 3 sec) for UDP health monitors.

Comment 10 spower 2020-07-15 11:36:13 UTC

The z1 TM is removed because if was not approved for inclusion as per the process.

Comment 18 errata-xmlrpc 2020-10-28 15:37:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284