Bug 1823661 - Octavia UDP health monitor test fails
Summary: Octavia UDP health monitor test fails
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-octavia
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Assaf Muller
QA Contact: Bruna Bonguardo
URL:
Whiteboard:
Depends On: 1823755
Blocks: 1801721
TreeView+ depends on / blocked
 
Reported: 2020-04-14 07:41 UTC by Gregory Thiemonge
Modified: 2020-07-08 17:52 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-08 17:52:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Gregory Thiemonge 2020-04-14 07:41:48 UTC
Description of problem:

octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_udp_traffic (upstream octavia-tempest-plugin) fails on OSP 16.1.

Tempest log file ends with:

2020-04-06 10:44:04,540 298003 INFO     [tempest.lib.common.rest_client] Request (TrafficOperationsScenarioTest:_run_cleanups): 200 GET http://10.0.0.124:9876/v2.0/lbaas/loadbalancers/ef7a8120-e038-4dc0-8973-db9b3395d8ea 0.131s
2020-04-06 10:44:04,541 298003 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Mon, 06 Apr 2020 14:44:04 GMT', 'server': 'Apache', 'content-length': '895', 'x-openstack-request-id': 'req-c29b703b-297e-48b4-8200-c44ea3466049', 'vary': 'Accept-Encoding', 'connection': 'close', 'content-type': 'application/json', 'status': '200', 'content-location': 'http://10.0.0.124:9876/v2.0/lbaas/loadbalancers/ef7a8120-e038-4dc0-8973-db9b3395d8ea'}
        Body: b'{"loadbalancer": {"id": "ef7a8120-e038-4dc0-8973-db9b3395d8ea", "name": "tempest-lb_member_lb1_operations-636131152", "description": "", "provisioning_status": "ACTIVE", "operating_status": "ONLINE", "admin_state_up": true, "project_id": "6969a921cc82481fa85681cc6c0630ef", "created_at": "2020-04-06T14:31:23", "updated_at": "2020-04-06T14:44:02", "vip_address": "10.1.1.119", "vip_port_id": "cc02180d-9510-47a0-bf88-a88b5b9711e9", "vip_subnet_id": "a9615df9-161a-4ecc-86cc-5684e0faed11", "vip_network_id": "e1ce7387-1b8b-4276-b9af-7ef0a58f84ca", "listeners": [{"id": "c62bc04e-6037-4de5-858e-e3a5468bcfe4"}, {"id": "1046aa2b-71bb-4318-b9d0-0b3d33846816"}], "pools": [{"id": "30d6174a-2bc3-4d3a-8638-69262702ad4e"}, {"id": "9521cdfc-ad56-4f73-8f9d-7aa73dc5ded1"}], "provider": "octavia", "flavor_id": null, "vip_qos_policy_id": null, "tags": [], "tenant_id": "6969a921cc82481fa85681cc6c0630ef"}}'
2020-04-06 10:44:04,541 298003 INFO     [octavia_tempest_plugin.tests.waiters] show_loadbalancer's status updated to ACTIVE.
2020-04-06 10:44:04,542 298003 INFO     [octavia_tempest_plugin.services.load_balancer.v2.base_client] Cleanup complete for member b6f2863a-b9ca-4fd2-9881-53458105b501...
}}}

Traceback (most recent call last):
  File "/home/stack/plugins/octavia/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 471, in test_healthmonitor_udp_traffic
    self._test_healthmonitor_traffic(const.UDP, 8080)
  File "/home/stack/plugins/octavia/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 402, in _test_healthmonitor_traffic
    pool_id=self.pool_ids[protocol])
  File "/home/stack/plugins/octavia/octavia_tempest_plugin/tests/waiters.py", line 96, in wait_for_status
    raise exceptions.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: (TrafficOperationsScenarioTest:test_healthmonitor_udp_traffic) show_member operating_status failed to update to ERROR within the required time 300. Current status of show_member: ONLINE


It means than the failure of a server instance has not been detected by Octavia



Version-Release number of selected component (if applicable):
16.1

How reproducible:
100%

Steps to Reproduce:
1. Deploy Octavia on OSP16.1
2. Run upstream octavia-tempest-plugin tests
3.

Actual results:

octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_udp_traffic fails


Expected results:

Octavia should be able to detect that a server instance is down, octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_udp_traffic should pass
Additional info:

Comment 2 Gregory Thiemonge 2020-04-14 07:50:50 UTC
After investigations, IP control (ICMP) packets are dropped by ml2/ovn.

Octavia relies on ICMP destination unreachable messages to detect if a server is down (LB's health monitor sends a UDP datagram to the server and if the port is closed - i.e the application is not running - the Linux kernel should reply with an ICMP message). Using tcpdump I checked that the ICMP destination unreachable was sent by the server to the LB, but the LB never received it.

Comment 3 Gregory Thiemonge 2020-04-14 08:13:03 UTC
I managed to reproduce the issue without using Octavia, with a simple heat stack:


heat_template_version: 2016-10-14

parameters:
    image:
        type: string
        default: cirros-0.4.0-x86_64-disk.img

    network:
        type: string
        default: private

    public_network:
        type: string
        default: nova

resources:
    flavor_tiny:
        type: OS::Nova::Flavor
        properties:
            disk: 1
            ram: 512
            vcpus: 1

    security_group1:
        type: OS::Neutron::SecurityGroup
        properties:
            rules:
                - remote_ip_prefix: 0.0.0.0/0
                  protocol: tcp
                  port_range_min: 22
                  port_range_max: 22

    server1:
        type: OS::Nova::Server
        properties:
            image: { get_param: image }
            flavor: { get_resource: flavor_tiny }
            security_groups:
               - { get_resource: security_group1 }
            networks:
              - network: { get_param: network }

    fip1:
        type: OS::Neutron::FloatingIP
        properties:
            floating_network_id: { get_param: public_network }

    fipa1:
        type: OS::Neutron::FloatingIPAssociation
        properties:
            port_id: { get_attr: [ server1, addresses, { get_param: network }, 0, port ] }
            floatingip_id: { get_resource: fip1 }

    security_group2:
        type: OS::Neutron::SecurityGroup
        properties:
            rules:
                - remote_ip_prefix: 0.0.0.0/0
                  protocol: tcp
                  port_range_min: 22
                  port_range_max: 22
                - remote_ip_prefix: 0.0.0.0/0
                  protocol: udp
                  port_range_min: 2222
                  port_range_max: 2222
                - remote_ip_prefix: 0.0.0.0/0
                  protocol: tcp
                  port_range_min: 2222
                  port_range_max: 2222

    server2:
        type: OS::Nova::Server
        properties:
            image: { get_param: image }
            flavor: { get_resource: flavor_tiny }
            security_groups:
               - { get_resource: security_group2 }
            networks:
              - network: { get_param: network }

    fip2:
        type: OS::Neutron::FloatingIP
        properties:
            floating_network_id: { get_param: public_network }

    fipa2:
        type: OS::Neutron::FloatingIPAssociation
        properties:
            port_id: { get_attr: [ server2, addresses, { get_param: network }, 0, port ] }
            floatingip_id: { get_resource: fip2 }



$ openstack create stack -t servers.yml --wait servers


The stack launches 2 servers:
- server1 is a "client", used to connect to server2 (netcat on TCP/UDP port 2222)
- server2 acts as the "server", SGs allow connections on port 2222 (TCP and UDP), but no applications are running on those ports.


Steps:

1 ssh server1
2 observe network traffic using tcpdump on server1 and server2's ports (ssh on computes, tcpdump -nn -i tap<port_id>)
3 send a UDP datagram to server2 from server1:

$ date | nc -u 10.0.0.235 2222

4 UDP packet is seen on server1 port:

07:56:36.493472 IP 10.0.1.92.33002 > 10.0.0.235.2222: UDP, length 29

5 UDP packet arrives on server2 port, as the UDP port is not reachable, the server replies with an ICMP msg:

07:56:36.495365 IP 10.0.0.236.33002 > 10.0.1.87.2222: UDP, length 29
07:56:36.497758 IP 10.0.1.87 > 10.0.0.236: ICMP 10.0.1.87 udp port 2222 unreachable, length 65

6 ICMP message is never received by server1


I've found out that we have a similar issue with TCP:

1...2 same as above
3 connect to server2 from server1 using TCP

$ nc 10.0.0.235 2222

4 TCP SYN packet is seen on server1 port:

08:01:38.505955 IP 10.0.1.92.37403 > 10.0.0.235.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1098874 ecr 0,nop,wscale 6], length 0


5 TCP SYN packet arrives on server2 port, TCP port is closed, Linux kernel replies with a TCP RST packet:

08:01:38.507662 IP 10.0.0.236.37403 > 10.0.1.87.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1098874 ecr 0,nop,wscale 6], length 0
08:01:38.510137 IP 10.0.1.87.2222 > 10.0.0.236.37403: Flags [R.], seq 0, ack 4252833580, win 0, length 0

6 RST packet is never received by server1, and it still tries to connect by re-sending TCP SYN packet:

08:01:39.504242 IP 10.0.1.92.37403 > 10.0.0.235.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1099124 ecr 0,nop,wscale 6], length 0
08:01:41.508209 IP 10.0.1.92.37403 > 10.0.0.235.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1099625 ecr 0,nop,wscale 6], length 0

7 server2 replies with some other RST packets:

08:01:39.504563 IP 10.0.0.236.37403 > 10.0.1.87.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1099124 ecr 0,nop,wscale 6], length 0
08:01:39.504773 IP 10.0.1.87.2222 > 10.0.0.236.37403: Flags [R.], seq 0, ack 1, win 0, length 0
08:01:41.508651 IP 10.0.0.236.37403 > 10.0.1.87.2222: Flags [S], seq 4252833579, win 28040, options [mss 1402,sackOK,TS val 1099625 ecr 0,nop,wscale 6], length 0
08:01:41.508890 IP 10.0.1.87.2222 > 10.0.0.236.37403: Flags [R.], seq 0, ack 1, win 0, length 0


That issue with TCP also affects Octavia: TCP health monitors are still able to detect if a server is down but after a long timeout (they should be able to detect it immediately).


I also reproduced that issue in devstack using ML2/OVN.
Note that this behavior is not reproducible with ML2/OVS: ICMP destination unreachable and TCP RST packets are correctly forwarded to the client.

Comment 4 Gregory Thiemonge 2020-04-14 08:37:30 UTC
Updated BZ because I used floating ip to connect to the server, and it makes tcpdump logs difficult to read.
So using only IP addresses from the private subnet:


UDP issue:

send UDP packet from server1:

$ date | nc -u 10.0.1.87 2222

tcpdump on server1 port:

08:33:36.123229 IP 10.0.1.92.40133 > 10.0.1.87.2222: UDP, length 29

tcpdump on server2 port:

08:33:36.123689 IP 10.0.1.92.40133 > 10.0.1.87.2222: UDP, length 29
08:33:36.124319 IP 10.0.1.87 > 10.0.1.92: ICMP 10.0.1.87 udp port 2222 unreachable, length 65


TCP issue:

connect to server2 from server1 using TCP:

$ nc 10.0.1.87 2222

tcpdump on server1 port:

08:36:02.882839 IP 10.0.1.92.33261 > 10.0.1.87.2222: Flags [S], seq 3424505559, win 28040, options [mss 1402,sackOK,TS val 1614968 ecr 0,nop,wscale 6], length 0
08:36:03.880069 IP 10.0.1.92.33261 > 10.0.1.87.2222: Flags [S], seq 3424505559, win 28040, options [mss 1402,sackOK,TS val 1615218 ecr 0,nop,wscale 6], length 0

tcpdump on server2 port:

08:36:02.883883 IP 10.0.1.92.33261 > 10.0.1.87.2222: Flags [S], seq 3424505559, win 28040, options [mss 1402,sackOK,TS val 1614968 ecr 0,nop,wscale 6], length 0
08:36:02.885862 IP 10.0.1.87.2222 > 10.0.1.92.33261: Flags [R.], seq 0, ack 3424505560, win 0, length 0
08:36:03.880374 IP 10.0.1.92.33261 > 10.0.1.87.2222: Flags [S], seq 3424505559, win 28040, options [mss 1402,sackOK,TS val 1615218 ecr 0,nop,wscale 6], length 0
08:36:03.880602 IP 10.0.1.87.2222 > 10.0.1.92.33261: Flags [R.], seq 0, ack 1, win 0, length 0


Note You need to log in before you can comment on or make changes to this bug.