1837316 – Incorrect behavior with UDP health-monitors on CentOS/RHEL

Bug 1837316 - Incorrect behavior with UDP health-monitors on CentOS/RHEL

Summary: Incorrect behavior with UDP health-monitors on CentOS/RHEL

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-octavia
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	z2
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Gregory Thiemonge
QA Contact:	Bruna Bonguardo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1801721
TreeView+	depends on / blocked

Reported:	2020-05-19 09:12 UTC by Gregory Thiemonge
Modified:	2020-10-28 15:38 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-octavia-5.0.2-0.20200626184853.e6b94f7.el8ost
Doc Type:	Known Issue
Doc Text:	The keepalived instance in the Red Hat OpenStack Platform Load-balancing service (octavia) instance (amphora) can abnormally terminate and interrupt UDP traffic. The cause of this issue is that the timeout value for the UDP health monitor is too small. + Workaround: specify a new timeout value that is greater than two seconds: `$ openstack loadbalancer healthmonitor set --timeout 3 <heath_monitor_id>` + For more information, search for "loadbalancer healthmonitor" in the Command Line Interface Reference.
Clone Of:
Environment:
Last Closed:	2020-10-28 15:37:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	729230	None	MERGED	Fix netcat option in udp_check.sh for CentOS/RHEL	2021-01-25 11:11:00 UTC
OpenStack gerrit	738023	None	MERGED	Fix netcat option in udp_check.sh for CentOS/RHEL	2021-01-25 11:10:18 UTC
Red Hat Product Errata	RHEA-2020:4284	None	None	None	2020-10-28 15:38:12 UTC
Storyboard	2007688	None	None	None	2020-05-19 09:12:08 UTC

Description Gregory Thiemonge 2020-05-19 09:12:08 UTC

Incorrect behavior with UDP health-monitors on CentOS using devstack + octavia master (a33d42fa59ada38dd5a5d3114a9a83f03a30d0b2)

Tempest failure
---------------

Octavia tempest scenario octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_udp_traffic fails on CentOS with the following error:

--
Traceback (most recent call last):

File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 471, in test_healthmonitor_udp_traffic
self._test_healthmonitor_traffic(const.UDP, 8080)

File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 415, in _test_healthmonitor_traffic
traffic_member_count=1)

File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1090, in check_members_balanced
protocol_port, protocol, verify)

File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1016, in _wait_for_lb_functional
raise Exception()

Exception
--

It fails because a member that should be seen as ONLINE goes into ERROR.


Amphora logs
------------

I reproduced it manually:

# openstack loadbalancer healthmonitor create --delay 3 --timeout 2 --max-retries 2 --type UDP-CONNECT --name hm1 pool1

Amphora logs:

--
May 19 06:25:00 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Activating healthchecker for service [10.0.0.81]:udp:8080 for VS [10.0.0.146]:udp:80
May 19 06:25:00 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Activating healthchecker for service [10.0.0.91]:udp:8080 for VS [10.0.0.146]:udp:80
May 19 06:25:05 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Misc check for [[10.0.0.91]:udp:8080 VS [10.0.0.146]:udp:80] by [/var/lib/octavia/lvs/check/udp_check.sh] timed out.
May 19 06:25:05 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Removing service [10.0.0.91]:udp:8080 to VS [10.0.0.146]:udp:80
May 19 06:25:06 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Misc check for [[10.0.0.81]:udp:8080 VS [10.0.0.146]:udp:80] by [/var/lib/octavia/lvs/check/udp_check.sh] timed out.
May 19 06:25:06 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Removing service [10.0.0.81]:udp:8080 to VS [10.0.0.146]:udp:80
--

Using tcpdump I can see that the UDP packets from the server are correctly received by the amphora.


Checking health check script manually
-------------------------------------

keepalived configuration contains:

--
MISC_CHECK {
    misc_path "/var/lib/octavia/lvs/check/udp_check.sh 10.0.0.81 8080"
    misc_timeout 2
}
--

But when running manually the udp_check.sh script, I got a strange behavior:

--
# ip netns exec amphora-haproxy /var/lib/octavia/lvs/check/udp_check.sh 10.0.0.81 8080
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.0.81:8080.
Ncat: UDP packet sent successfully
Ncat: 1 bytes sent, 9 bytes received in 2.02 seconds.
--

The health check script successfully ran but netcat exited after 2.02 seconds, which is a longer duration than misc_timeout.
It means that keepalived killed the script and considered that the member was in ERROR.

After investigation, it seems that the '-w' option passed to nmap-netcat (used by CentOS) doesn't do anything (from the nmap source code, it's only used for stream sockets).
But netcat provides a '-i' option that is the timeout duration after going into idle. The default for -i is 2sec, so it explains previous run duration, it also means that the bug is not triggered if --timeout is greater (>) than 2 sec.

--
# ip netns exec amphora-haproxy /bin/nc -uzv -i1 10.0.0.81 8080
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.0.81:8080.
10.0.0.81Ncat: UDP packet sent successfully
Ncat: 1 bytes sent, 9 bytes received in 1.01 seconds.
Note that https://review.opendev.org/#/c/718881/ fixed the UDP health-check timeout value that was incorrectly using the delay timeout value.
Before that commit, tempest test was using 3sec as timeout and the bug was not triggered.
--

I would recommend to detect which nmap tool is used (i.e "nc --version" returns "Ncat: Version 7.70 ( https://nmap.org/ncat )" on CentOS) and to adapt netcat options based on the results.

Note that '-w1' or '-i1' makes timeout configuration for UDP health check useless, nc will timeout after only 1 sec.

Comment 1 Gregory Thiemonge 2020-05-19 09:13:42 UTC

This issue also affects OSP16.1

Comment 4 Gregory Thiemonge 2020-05-27 13:02:06 UTC

Note that this behavior can be worked around by using a timeout value greater than 2 sec (>= 3 sec) for UDP health monitors.

Comment 10 spower 2020-07-15 11:36:13 UTC

The z1 TM is removed because if was not approved for inclusion as per the process.

Comment 18 errata-xmlrpc 2020-10-28 15:37:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284

Note You need to log in before you can comment on or make changes to this bug.