The keepalived instance in the Red Hat OpenStack Platform Load-balancing service (octavia) instance (amphora) can abnormally terminate and interrupt UDP traffic. The cause of this issue is that the timeout value for the UDP health monitor is too small.
+
Workaround: specify a new timeout value that is greater than two seconds: `$ openstack loadbalancer healthmonitor set --timeout 3 <heath_monitor_id>`
+
For more information, search for "loadbalancer healthmonitor" in the Command Line Interface Reference.
DescriptionGregory Thiemonge
2020-05-19 09:12:08 UTC
Incorrect behavior with UDP health-monitors on CentOS using devstack + octavia master (a33d42fa59ada38dd5a5d3114a9a83f03a30d0b2)
Tempest failure
---------------
Octavia tempest scenario octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_udp_traffic fails on CentOS with the following error:
--
Traceback (most recent call last):
File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 471, in test_healthmonitor_udp_traffic
self._test_healthmonitor_traffic(const.UDP, 8080)
File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 415, in _test_healthmonitor_traffic
traffic_member_count=1)
File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1090, in check_members_balanced
protocol_port, protocol, verify)
File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1016, in _wait_for_lb_functional
raise Exception()
Exception
--
It fails because a member that should be seen as ONLINE goes into ERROR.
Amphora logs
------------
I reproduced it manually:
# openstack loadbalancer healthmonitor create --delay 3 --timeout 2 --max-retries 2 --type UDP-CONNECT --name hm1 pool1
Amphora logs:
--
May 19 06:25:00 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Activating healthchecker for service [10.0.0.81]:udp:8080 for VS [10.0.0.146]:udp:80
May 19 06:25:00 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Activating healthchecker for service [10.0.0.91]:udp:8080 for VS [10.0.0.146]:udp:80
May 19 06:25:05 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Misc check for [[10.0.0.91]:udp:8080 VS [10.0.0.146]:udp:80] by [/var/lib/octavia/lvs/check/udp_check.sh] timed out.
May 19 06:25:05 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Removing service [10.0.0.91]:udp:8080 to VS [10.0.0.146]:udp:80
May 19 06:25:06 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Misc check for [[10.0.0.81]:udp:8080 VS [10.0.0.146]:udp:80] by [/var/lib/octavia/lvs/check/udp_check.sh] timed out.
May 19 06:25:06 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Removing service [10.0.0.81]:udp:8080 to VS [10.0.0.146]:udp:80
--
Using tcpdump I can see that the UDP packets from the server are correctly received by the amphora.
Checking health check script manually
-------------------------------------
keepalived configuration contains:
--
MISC_CHECK {
misc_path "/var/lib/octavia/lvs/check/udp_check.sh 10.0.0.81 8080"
misc_timeout 2
}
--
But when running manually the udp_check.sh script, I got a strange behavior:
--
# ip netns exec amphora-haproxy /var/lib/octavia/lvs/check/udp_check.sh 10.0.0.81 8080
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.0.81:8080.
Ncat: UDP packet sent successfully
Ncat: 1 bytes sent, 9 bytes received in 2.02 seconds.
--
The health check script successfully ran but netcat exited after 2.02 seconds, which is a longer duration than misc_timeout.
It means that keepalived killed the script and considered that the member was in ERROR.
After investigation, it seems that the '-w' option passed to nmap-netcat (used by CentOS) doesn't do anything (from the nmap source code, it's only used for stream sockets).
But netcat provides a '-i' option that is the timeout duration after going into idle. The default for -i is 2sec, so it explains previous run duration, it also means that the bug is not triggered if --timeout is greater (>) than 2 sec.
--
# ip netns exec amphora-haproxy /bin/nc -uzv -i1 10.0.0.81 8080
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.0.81:8080.
10.0.0.81Ncat: UDP packet sent successfully
Ncat: 1 bytes sent, 9 bytes received in 1.01 seconds.
Note that https://review.opendev.org/#/c/718881/ fixed the UDP health-check timeout value that was incorrectly using the delay timeout value.
Before that commit, tempest test was using 3sec as timeout and the bug was not triggered.
--
I would recommend to detect which nmap tool is used (i.e "nc --version" returns "Ncat: Version 7.70 ( https://nmap.org/ncat )" on CentOS) and to adapt netcat options based on the results.
Note that '-w1' or '-i1' makes timeout configuration for UDP health check useless, nc will timeout after only 1 sec.
Comment 1Gregory Thiemonge
2020-05-19 09:13:42 UTC
This issue also affects OSP16.1
Comment 4Gregory Thiemonge
2020-05-27 13:02:06 UTC
Note that this behavior can be worked around by using a timeout value greater than 2 sec (>= 3 sec) for UDP health monitors.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2020:4284
Incorrect behavior with UDP health-monitors on CentOS using devstack + octavia master (a33d42fa59ada38dd5a5d3114a9a83f03a30d0b2) Tempest failure --------------- Octavia tempest scenario octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_udp_traffic fails on CentOS with the following error: -- Traceback (most recent call last): File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 471, in test_healthmonitor_udp_traffic self._test_healthmonitor_traffic(const.UDP, 8080) File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 415, in _test_healthmonitor_traffic traffic_member_count=1) File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1090, in check_members_balanced protocol_port, protocol, verify) File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1016, in _wait_for_lb_functional raise Exception() Exception -- It fails because a member that should be seen as ONLINE goes into ERROR. Amphora logs ------------ I reproduced it manually: # openstack loadbalancer healthmonitor create --delay 3 --timeout 2 --max-retries 2 --type UDP-CONNECT --name hm1 pool1 Amphora logs: -- May 19 06:25:00 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Activating healthchecker for service [10.0.0.81]:udp:8080 for VS [10.0.0.146]:udp:80 May 19 06:25:00 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Activating healthchecker for service [10.0.0.91]:udp:8080 for VS [10.0.0.146]:udp:80 May 19 06:25:05 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Misc check for [[10.0.0.91]:udp:8080 VS [10.0.0.146]:udp:80] by [/var/lib/octavia/lvs/check/udp_check.sh] timed out. May 19 06:25:05 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Removing service [10.0.0.91]:udp:8080 to VS [10.0.0.146]:udp:80 May 19 06:25:06 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Misc check for [[10.0.0.81]:udp:8080 VS [10.0.0.146]:udp:80] by [/var/lib/octavia/lvs/check/udp_check.sh] timed out. May 19 06:25:06 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Removing service [10.0.0.81]:udp:8080 to VS [10.0.0.146]:udp:80 -- Using tcpdump I can see that the UDP packets from the server are correctly received by the amphora. Checking health check script manually ------------------------------------- keepalived configuration contains: -- MISC_CHECK { misc_path "/var/lib/octavia/lvs/check/udp_check.sh 10.0.0.81 8080" misc_timeout 2 } -- But when running manually the udp_check.sh script, I got a strange behavior: -- # ip netns exec amphora-haproxy /var/lib/octavia/lvs/check/udp_check.sh 10.0.0.81 8080 Ncat: Version 7.70 ( https://nmap.org/ncat ) Ncat: Connected to 10.0.0.81:8080. Ncat: UDP packet sent successfully Ncat: 1 bytes sent, 9 bytes received in 2.02 seconds. -- The health check script successfully ran but netcat exited after 2.02 seconds, which is a longer duration than misc_timeout. It means that keepalived killed the script and considered that the member was in ERROR. After investigation, it seems that the '-w' option passed to nmap-netcat (used by CentOS) doesn't do anything (from the nmap source code, it's only used for stream sockets). But netcat provides a '-i' option that is the timeout duration after going into idle. The default for -i is 2sec, so it explains previous run duration, it also means that the bug is not triggered if --timeout is greater (>) than 2 sec. -- # ip netns exec amphora-haproxy /bin/nc -uzv -i1 10.0.0.81 8080 Ncat: Version 7.70 ( https://nmap.org/ncat ) Ncat: Connected to 10.0.0.81:8080. 10.0.0.81Ncat: UDP packet sent successfully Ncat: 1 bytes sent, 9 bytes received in 1.01 seconds. Note that https://review.opendev.org/#/c/718881/ fixed the UDP health-check timeout value that was incorrectly using the delay timeout value. Before that commit, tempest test was using 3sec as timeout and the bug was not triggered. -- I would recommend to detect which nmap tool is used (i.e "nc --version" returns "Ncat: Version 7.70 ( https://nmap.org/ncat )" on CentOS) and to adapt netcat options based on the results. Note that '-w1' or '-i1' makes timeout configuration for UDP health check useless, nc will timeout after only 1 sec.