Incorrect behavior with UDP health-monitors on CentOS using devstack + octavia master (a33d42fa59ada38dd5a5d3114a9a83f03a30d0b2) Tempest failure --------------- Octavia tempest scenario octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_udp_traffic fails on CentOS with the following error: -- Traceback (most recent call last): File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 471, in test_healthmonitor_udp_traffic self._test_healthmonitor_traffic(const.UDP, 8080) File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 415, in _test_healthmonitor_traffic traffic_member_count=1) File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1090, in check_members_balanced protocol_port, protocol, verify) File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1016, in _wait_for_lb_functional raise Exception() Exception -- It fails because a member that should be seen as ONLINE goes into ERROR. Amphora logs ------------ I reproduced it manually: # openstack loadbalancer healthmonitor create --delay 3 --timeout 2 --max-retries 2 --type UDP-CONNECT --name hm1 pool1 Amphora logs: -- May 19 06:25:00 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Activating healthchecker for service [10.0.0.81]:udp:8080 for VS [10.0.0.146]:udp:80 May 19 06:25:00 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Activating healthchecker for service [10.0.0.91]:udp:8080 for VS [10.0.0.146]:udp:80 May 19 06:25:05 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Misc check for [[10.0.0.91]:udp:8080 VS [10.0.0.146]:udp:80] by [/var/lib/octavia/lvs/check/udp_check.sh] timed out. May 19 06:25:05 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Removing service [10.0.0.91]:udp:8080 to VS [10.0.0.146]:udp:80 May 19 06:25:06 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Misc check for [[10.0.0.81]:udp:8080 VS [10.0.0.146]:udp:80] by [/var/lib/octavia/lvs/check/udp_check.sh] timed out. May 19 06:25:06 amphora-b55220fa-faef-4688-b13c-84b23092fba7.novalocal Keepalived_healthcheckers_amphora-haproxy[11839]: Removing service [10.0.0.81]:udp:8080 to VS [10.0.0.146]:udp:80 -- Using tcpdump I can see that the UDP packets from the server are correctly received by the amphora. Checking health check script manually ------------------------------------- keepalived configuration contains: -- MISC_CHECK { misc_path "/var/lib/octavia/lvs/check/udp_check.sh 10.0.0.81 8080" misc_timeout 2 } -- But when running manually the udp_check.sh script, I got a strange behavior: -- # ip netns exec amphora-haproxy /var/lib/octavia/lvs/check/udp_check.sh 10.0.0.81 8080 Ncat: Version 7.70 ( https://nmap.org/ncat ) Ncat: Connected to 10.0.0.81:8080. Ncat: UDP packet sent successfully Ncat: 1 bytes sent, 9 bytes received in 2.02 seconds. -- The health check script successfully ran but netcat exited after 2.02 seconds, which is a longer duration than misc_timeout. It means that keepalived killed the script and considered that the member was in ERROR. After investigation, it seems that the '-w' option passed to nmap-netcat (used by CentOS) doesn't do anything (from the nmap source code, it's only used for stream sockets). But netcat provides a '-i' option that is the timeout duration after going into idle. The default for -i is 2sec, so it explains previous run duration, it also means that the bug is not triggered if --timeout is greater (>) than 2 sec. -- # ip netns exec amphora-haproxy /bin/nc -uzv -i1 10.0.0.81 8080 Ncat: Version 7.70 ( https://nmap.org/ncat ) Ncat: Connected to 10.0.0.81:8080. 10.0.0.81Ncat: UDP packet sent successfully Ncat: 1 bytes sent, 9 bytes received in 1.01 seconds. Note that https://review.opendev.org/#/c/718881/ fixed the UDP health-check timeout value that was incorrectly using the delay timeout value. Before that commit, tempest test was using 3sec as timeout and the bug was not triggered. -- I would recommend to detect which nmap tool is used (i.e "nc --version" returns "Ncat: Version 7.70 ( https://nmap.org/ncat )" on CentOS) and to adapt netcat options based on the results. Note that '-w1' or '-i1' makes timeout configuration for UDP health check useless, nc will timeout after only 1 sec.
This issue also affects OSP16.1
Note that this behavior can be worked around by using a timeout value greater than 2 sec (>= 3 sec) for UDP health monitors.
The z1 TM is removed because if was not approved for inclusion as per the process.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284