Description of problem: octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_udp_traffic (upstream octavia-tempest-plugin) fails on OSP 16.1. Tempest log file ends with: 2020-04-06 10:44:04,540 298003 INFO [tempest.lib.common.rest_client] Request (TrafficOperationsScenarioTest:_run_cleanups): 200 GET http://10.0.0.124:9876/v2.0/lbaas/loadbalancers/ef7a8120-e038-4dc0-8973-db9b3395d8ea 0.131s 2020-04-06 10:44:04,541 298003 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'} Body: None Response - Headers: {'date': 'Mon, 06 Apr 2020 14:44:04 GMT', 'server': 'Apache', 'content-length': '895', 'x-openstack-request-id': 'req-c29b703b-297e-48b4-8200-c44ea3466049', 'vary': 'Accept-Encoding', 'connection': 'close', 'content-type': 'application/json', 'status': '200', 'content-location': 'http://10.0.0.124:9876/v2.0/lbaas/loadbalancers/ef7a8120-e038-4dc0-8973-db9b3395d8ea'} Body: b'{"loadbalancer": {"id": "ef7a8120-e038-4dc0-8973-db9b3395d8ea", "name": "tempest-lb_member_lb1_operations-636131152", "description": "", "provisioning_status": "ACTIVE", "operating_status": "ONLINE", "admin_state_up": true, "project_id": "6969a921cc82481fa85681cc6c0630ef", "created_at": "2020-04-06T14:31:23", "updated_at": "2020-04-06T14:44:02", "vip_address": "10.1.1.119", "vip_port_id": "cc02180d-9510-47a0-bf88-a88b5b9711e9", "vip_subnet_id": "a9615df9-161a-4ecc-86cc-5684e0faed11", "vip_network_id": "e1ce7387-1b8b-4276-b9af-7ef0a58f84ca", "listeners": [{"id": "c62bc04e-6037-4de5-858e-e3a5468bcfe4"}, {"id": "1046aa2b-71bb-4318-b9d0-0b3d33846816"}], "pools": [{"id": "30d6174a-2bc3-4d3a-8638-69262702ad4e"}, {"id": "9521cdfc-ad56-4f73-8f9d-7aa73dc5ded1"}], "provider": "octavia", "flavor_id": null, "vip_qos_policy_id": null, "tags": [], "tenant_id": "6969a921cc82481fa85681cc6c0630ef"}}' 2020-04-06 10:44:04,541 298003 INFO [octavia_tempest_plugin.tests.waiters] show_loadbalancer's status updated to ACTIVE. 2020-04-06 10:44:04,542 298003 INFO [octavia_tempest_plugin.services.load_balancer.v2.base_client] Cleanup complete for member b6f2863a-b9ca-4fd2-9881-53458105b501... }}} Traceback (most recent call last): File "/home/stack/plugins/octavia/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 471, in test_healthmonitor_udp_traffic self._test_healthmonitor_traffic(const.UDP, 8080) File "/home/stack/plugins/octavia/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 402, in _test_healthmonitor_traffic pool_id=self.pool_ids[protocol]) File "/home/stack/plugins/octavia/octavia_tempest_plugin/tests/waiters.py", line 96, in wait_for_status raise exceptions.TimeoutException(message) tempest.lib.exceptions.TimeoutException: Request timed out Details: (TrafficOperationsScenarioTest:test_healthmonitor_udp_traffic) show_member operating_status failed to update to ERROR within the required time 300. Current status of show_member: ONLINE It means than the failure of a server instance has not been detected by Octavia Version-Release number of selected component (if applicable): 16.1 How reproducible: 100% Steps to Reproduce: 1. Deploy Octavia on OSP16.1 2. Run upstream octavia-tempest-plugin tests 3. Actual results: octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_udp_traffic fails Expected results: Octavia should be able to detect that a server instance is down, octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_udp_traffic should pass Additional info:
After investigations, IP control (ICMP) packets are dropped by ml2/ovn. Octavia relies on ICMP destination unreachable messages to detect if a server is down (LB's health monitor sends a UDP datagram to the server and if the port is closed - i.e the application is not running - the Linux kernel should reply with an ICMP message). Using tcpdump I checked that the ICMP destination unreachable was sent by the server to the LB, but the LB never received it.
I managed to reproduce the issue without using Octavia, with a simple heat stack: heat_template_version: 2016-10-14 parameters: image: type: string default: cirros-0.4.0-x86_64-disk.img network: type: string default: private public_network: type: string default: nova resources: flavor_tiny: type: OS::Nova::Flavor properties: disk: 1 ram: 512 vcpus: 1 security_group1: type: OS::Neutron::SecurityGroup properties: rules: - remote_ip_prefix: 0.0.0.0/0 protocol: tcp port_range_min: 22 port_range_max: 22 server1: type: OS::Nova::Server properties: image: { get_param: image } flavor: { get_resource: flavor_tiny } security_groups: - { get_resource: security_group1 } networks: - network: { get_param: network } fip1: type: OS::Neutron::FloatingIP properties: floating_network_id: { get_param: public_network } fipa1: type: OS::Neutron::FloatingIPAssociation properties: port_id: { get_attr: [ server1, addresses, { get_param: network }, 0, port ] } floatingip_id: { get_resource: fip1 } security_group2: type: OS::Neutron::SecurityGroup properties: rules: - remote_ip_prefix: 0.0.0.0/0 protocol: tcp port_range_min: 22 port_range_max: 22 - remote_ip_prefix: 0.0.0.0/0 protocol: udp port_range_min: 2222 port_range_max: 2222 - remote_ip_prefix: 0.0.0.0/0 protocol: tcp port_range_min: 2222 port_range_max: 2222 server2: type: OS::Nova::Server properties: image: { get_param: image } flavor: { get_resource: flavor_tiny } security_groups: - { get_resource: security_group2 } networks: - network: { get_param: network } fip2: type: OS::Neutron::FloatingIP properties: floating_network_id: { get_param: public_network } fipa2: type: OS::Neutron::FloatingIPAssociation properties: port_id: { get_attr: [ server2, addresses, { get_param: network }, 0, port ] } floatingip_id: { get_resource: fip2 } $ openstack create stack -t servers.yml --wait servers The stack launches 2 servers: - server1 is a "client", used to connect to server2 (netcat on TCP/UDP port 2222) - server2 acts as the "server", SGs allow connections on port 2222 (TCP and UDP), but no applications are running on those ports. Steps: 1 ssh server1 2 observe network traffic using tcpdump on server1 and server2's ports (ssh on computes, tcpdump -nn -i tap<port_id>) 3 send a UDP datagram to server2 from server1: $ date | nc -u 10.0.0.235 2222 4 UDP packet is seen on server1 port: 07:56:36.493472 IP 10.0.1.92.33002 > 10.0.0.235.2222: UDP, length 29 5 UDP packet arrives on server2 port, as the UDP port is not reachable, the server replies with an ICMP msg: 07:56:36.495365 IP 10.0.0.236.33002 > 10.0.1.87.2222: UDP, length 29 07:56:36.497758 IP 10.0.1.87 > 10.0.0.236: ICMP 10.0.1.87 udp port 2222 unreachable, length 65 6 ICMP message is never received by server1 I've found out that we have a similar issue with TCP: 1...2 same as above 3 connect to server2 from server1 using TCP $ nc 10.0.0.235 2222 4 TCP SYN packet is seen on server1 port: 08:01:38.505955 IP 10.0.1.92.37403 > 10.0.0.235.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1098874 ecr 0,nop,wscale 6], length 0 5 TCP SYN packet arrives on server2 port, TCP port is closed, Linux kernel replies with a TCP RST packet: 08:01:38.507662 IP 10.0.0.236.37403 > 10.0.1.87.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1098874 ecr 0,nop,wscale 6], length 0 08:01:38.510137 IP 10.0.1.87.2222 > 10.0.0.236.37403: Flags [R.], seq 0, ack 4252833580, win 0, length 0 6 RST packet is never received by server1, and it still tries to connect by re-sending TCP SYN packet: 08:01:39.504242 IP 10.0.1.92.37403 > 10.0.0.235.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1099124 ecr 0,nop,wscale 6], length 0 08:01:41.508209 IP 10.0.1.92.37403 > 10.0.0.235.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1099625 ecr 0,nop,wscale 6], length 0 7 server2 replies with some other RST packets: 08:01:39.504563 IP 10.0.0.236.37403 > 10.0.1.87.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1099124 ecr 0,nop,wscale 6], length 0 08:01:39.504773 IP 10.0.1.87.2222 > 10.0.0.236.37403: Flags [R.], seq 0, ack 1, win 0, length 0 08:01:41.508651 IP 10.0.0.236.37403 > 10.0.1.87.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1099625 ecr 0,nop,wscale 6], length 0 08:01:41.508890 IP 10.0.1.87.2222 > 10.0.0.236.37403: Flags [R.], seq 0, ack 1, win 0, length 0 That issue with TCP also affects Octavia: TCP health monitors are still able to detect if a server is down but after a long timeout (they should be able to detect it immediately). I also reproduced that issue in devstack using ML2/OVN. Note that this behavior is not reproducible with ML2/OVS: ICMP destination unreachable and TCP RST packets are correctly forwarded to the client.
Updated BZ because I used floating ip to connect to the server, and it makes tcpdump logs difficult to read. So using only IP addresses from the private subnet: UDP issue: send UDP packet from server1: $ date | nc -u 10.0.1.87 2222 tcpdump on server1 port: 08:33:36.123229 IP 10.0.1.92.40133 > 10.0.1.87.2222: UDP, length 29 tcpdump on server2 port: 08:33:36.123689 IP 10.0.1.92.40133 > 10.0.1.87.2222: UDP, length 29 08:33:36.124319 IP 10.0.1.87 > 10.0.1.92: ICMP 10.0.1.87 udp port 2222 unreachable, length 65 TCP issue: connect to server2 from server1 using TCP: $ nc 10.0.1.87 2222 tcpdump on server1 port: 08:36:02.882839 IP 10.0.1.92.33261 > 10.0.1.87.2222: Flags [S], seq 3424505559, win 28040, options [mss 1402,sackOK,TS val 1614968 ecr 0,nop,wscale 6], length 0 08:36:03.880069 IP 10.0.1.92.33261 > 10.0.1.87.2222: Flags [S], seq 3424505559, win 28040, options [mss 1402,sackOK,TS val 1615218 ecr 0,nop,wscale 6], length 0 tcpdump on server2 port: 08:36:02.883883 IP 10.0.1.92.33261 > 10.0.1.87.2222: Flags [S], seq 3424505559, win 28040, options [mss 1402,sackOK,TS val 1614968 ecr 0,nop,wscale 6], length 0 08:36:02.885862 IP 10.0.1.87.2222 > 10.0.1.92.33261: Flags [R.], seq 0, ack 3424505560, win 0, length 0 08:36:03.880374 IP 10.0.1.92.33261 > 10.0.1.87.2222: Flags [S], seq 3424505559, win 28040, options [mss 1402,sackOK,TS val 1615218 ecr 0,nop,wscale 6], length 0 08:36:03.880602 IP 10.0.1.87.2222 > 10.0.1.92.33261: Flags [R.], seq 0, ack 1, win 0, length 0
CI job run from May 25 passed successfully for the TrafficOperationsScenarioTest: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/QE/view/OSP16.1/job/DFG-network-octavia-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve/lastCompletedBuild/testReport/octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops/TrafficOperationsScenarioTest/ Moving the bug to VERIFIED