Bug 1792157

Summary: TrafficOperationsScenarioTest fails when ACTIVE_STANDBY is enabled
Product: Red Hat OpenStack Reporter: Gregory Thiemonge <gthiemon>
Component: openstack-octaviaAssignee: Gregory Thiemonge <gthiemon>
Status: CLOSED CURRENTRELEASE QA Contact: Bruna Bonguardo <bbonguar>
Severity: high Docs Contact:
Priority: high    
Version: 16.0 (Train)CC: ihrachys, lpeer, majopela, mfuruta, scohen
Target Milestone: betaKeywords: TestOnly, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-04 16:00:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1792160, 1801895    
Bug Blocks: 1774913    

Description Gregory Thiemonge 2020-01-17 07:30:34 UTC
Description of problem:
Running TrafficOperationsScenarioTest tempest scenarios fails if ACTIVE_STANDBY topology is enabled.

Version-Release number of selected component (if applicable):
16 (RHOS_TRUNK-16.0-RHEL-8-20200113.n.0)


How reproducible:
100%


Steps to Reproduce:
1. enable ACTIVE_STANDBY topology in octavia controllers
2. launch tempest TrafficOperationsScenarioTest
3.

Actual results:
Tests fail because LB doesn't process the traffic:

{0} octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_basic_traffic [75.602990s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_traffic [86.241716s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_l7policies_and_l7rules [125.917772s] ... FAILED
[..]

    Traceback (most recent call last):
      File "/home/stack/devel/octavia-tempest-plugin/octavia_tempest_plugin/tests/scenario/v2/test_traffic_ops.py", line 618, in test_l7policies_and_l7rules
        url_for_member1)
      File "/home/stack/devel/octavia-tempest-plugin/octavia_tempest_plugin/tests/test_base.py", line 973, in assertConsistentResponse
        timeout=timeout, **kwargs)
      File "/home/stack/devel/tempest/.tox/tempest/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
        resp = self.send(prep, **send_kwargs)
      File "/home/stack/devel/tempest/.tox/tempest/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
        r = adapter.send(request, **kwargs)
      File "/home/stack/devel/tempest/.tox/tempest/lib/python3.6/site-packages/requests/adapters.py", line 504, in send
        raise ConnectTimeout(e, request=request)
    requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='10.0.0.217', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f8e69ee8048>, 'Connection to 10.0.0.217 timed out. (connect timeout=2)'))


Expected results:
Tests should pass

Additional info:

Bug is also reproducible with centos8 amphora.

Reading amphorae logs, it seems that keepalived (2.0.10) goes into a FAULT STATE during the execution of the test:

In MASTER amphora:

Jan 15 16:46:33 amphora-fe00831a-b328-4587-87aa-11063687deb8 gunicorn.gunicorn.error: [898] GET //
Jan 15 16:46:33 amphora-fe00831a-b328-4587-87aa-11063687deb8 gunicorn.gunicorn.access: [898] ::ffff:192.168.0.156 - - [15/Jan/2020:16:46:33 +0000] "GET // HTTP/1.1" 200 21 "-" "Octavia HaProxy Rest Client/0.5 (https://wiki.openstack.org/wi
ki/Octavia)"
Jan 15 16:46:33 amphora-fe00831a-b328-4587-87aa-11063687deb8 gunicorn.gunicorn.error: [898] POST /1.0/plug/network
Jan 15 16:46:33 amphora-fe00831a-b328-4587-87aa-11063687deb8 amphora-agent[898]: 2020-01-15 16:46:33.810 898 INFO octavia.amphorae.backends.agent.api_server.plug [-] Plugged interface ens8 will become eth2 in the namespace amphora-haproxy
Jan 15 16:46:33 amphora-fe00831a-b328-4587-87aa-11063687deb8 Keepalived_vrrp[1312]: Interface name has changed from ens8 to eth2
Jan 15 16:46:33 amphora-fe00831a-b328-4587-87aa-11063687deb8 Keepalived_vrrp[1312]: (baa7f2985b7f42ce8901194bc4c482bc) Entering FAULT STATE
Jan 15 16:46:33 amphora-fe00831a-b328-4587-87aa-11063687deb8 Keepalived_vrrp[1312]: (baa7f2985b7f42ce8901194bc4c482bc) sent 0 priority

In BACKUP amphora:

Jan 15 16:46:34 amphora-29f925aa-0c9c-4e04-99c4-ba359ce0e785 Keepalived_vrrp[1301]: (baa7f2985b7f42ce8901194bc4c482bc) Entering MASTER STATE
[...]
Jan 15 16:46:53 amphora-29f925aa-0c9c-4e04-99c4-ba359ce0e785 gunicorn.gunicorn.error: [885] GET //
Jan 15 16:46:53 amphora-29f925aa-0c9c-4e04-99c4-ba359ce0e785 gunicorn.gunicorn.access: [885] ::ffff:192.168.0.156 - - [15/Jan/2020:16:46:53 +0000] "GET // HTTP/1.1" 200 21 "-" "Octavia HaProxy Rest Client/0.5 (https://wiki.openstack.org/wiki/Octavia)"
Jan 15 16:46:53 amphora-29f925aa-0c9c-4e04-99c4-ba359ce0e785 gunicorn.gunicorn.error: [885] POST /1.0/plug/network
Jan 15 16:46:53 amphora-29f925aa-0c9c-4e04-99c4-ba359ce0e785 amphora-agent[885]: 2020-01-15 16:46:53.256 885 INFO octavia.amphorae.backends.agent.api_server.plug [-] Plugged interface ens9 will become eth3 in the namespace amphora-haproxy
Jan 15 16:46:53 amphora-29f925aa-0c9c-4e04-99c4-ba359ce0e785 Keepalived_vrrp[1301]: Interface name has changed from ens9 to eth3
Jan 15 16:46:53 amphora-29f925aa-0c9c-4e04-99c4-ba359ce0e785 Keepalived_vrrp[1301]: (baa7f2985b7f42ce8901194bc4c482bc) Entering FAULT STATE

Then no amphora has a VRRP VIP address on eth1, so the LB is not reachable


Simple reproducer

I reproduced the issue using one LB in ACTIVE_STANDBY, one listener and some iproute commands that simulate what amphora-agent does, it shows that the issue might be in keepalived code.

Connect into the MASTER amphora, check VRRP ip address and VRRP state in the logs

bash-4.4# ip -n amphora-haproxy a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:1f:f5:b5 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.189/26 brd 10.0.0.191 scope global eth1
       valid_lft forever preferred_lft forever
    inet 10.0.0.161/32 scope global eth1
       valid_lft forever preferred_lft forever

bash-4.4# journalctl -le | grep Keep
[..]
Jan 15 14:56:25 amphora-f3af9d57-790c-4471-ae1f-82ee48ec2f68.novalocal Keepalived_vrrp[1346]: (2877477fd523485ebb7edbb2bf0967e4) Entering BACKUP STATE
Jan 15 14:56:28 amphora-f3af9d57-790c-4471-ae1f-82ee48ec2f68.novalocal Keepalived_vrrp[1346]: (2877477fd523485ebb7edbb2bf0967e4) received lower priority (90) advert from 10.0.0.173 - discarding
Jan 15 14:56:29 amphora-f3af9d57-790c-4471-ae1f-82ee48ec2f68.novalocal Keepalived_vrrp[1346]: (2877477fd523485ebb7edbb2bf0967e4) Entering MASTER STATE
Create a dummy interface, move it to amphora-haproxy ns, check VRRP ip address

bash-4.4# ip link add veth0 type veth peer veth1
bash-4.4# ip link set veth1 up
bash-4.4# ip link set veth1 netns amphora-haproxy
bash-4.4# ip -n amphora-haproxy a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:1f:f5:b5 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.189/26 brd 10.0.0.191 scope global eth1
       valid_lft forever preferred_lft forever
    inet 10.0.0.161/32 scope global eth1
       valid_lft forever preferred_lft forever
4: veth1@if5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether fa:0f:5b:55:44:f5 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Rename the dummy interface, check VRRP ip address and the logs -> address has been removed, keepalived is in FAULT STATE

bash-4.4# ip -n amphora-haproxy link set veth1 name foo0
bash-4.4# ip -n amphora-haproxy a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:1f:f5:b5 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.189/26 brd 10.0.0.191 scope global eth1
       valid_lft forever preferred_lft forever
4: foo0@if5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether fa:0f:5b:55:44:f5 brd ff:ff:ff:ff:ff:ff link-netnsid 0

bash-4.4# journalctl -le | grep Keep
[..]
Jan 15 16:31:27 amphora-f3af9d57-790c-4471-ae1f-82ee48ec2f68.novalocal Keepalived_vrrp[1346]: Interface name has changed from veth1 to foo0
Jan 15 16:31:27 amphora-f3af9d57-790c-4471-ae1f-82ee48ec2f68.novalocal Keepalived_vrrp[1346]: (2877477fd523485ebb7edbb2bf0967e4) Entering FAULT STATE
Jan 15 16:31:27 amphora-f3af9d57-790c-4471-ae1f-82ee48ec2f68.novalocal Keepalived_vrrp[1346]: (2877477fd523485ebb7edbb2bf0967e4) sent 0 priority

Comment 1 Gregory Thiemonge 2020-01-17 07:33:50 UTC
The bug is not present when using more recent keepalived releases (compiled from source). Using git bisect I found out that https://github.com/acassen/keepalived/commit/30eeb48b1a0737dc7443fd421fd6613e0d55fd17 fixes the issue, this commit is included in keepalived>=2.0.11 releases.

After cherrypicking that commit in keepalived 2.0.10, and updating the keepalived binary in the amphora, tests passed successfully:

all run-test-pre: PYTHONHASHSEED='567330168'
all run-test: commands[0] | find . -type f -name '*.pyc' -delete
all run-test: commands[1] | tempest run --regex test_traffic_ops
{0} octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_basic_traffic [80.633249s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_healthmonitor_traffic [86.448448s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_l7policies_and_l7rules [139.114542s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest.test_mixed_ipv4_ipv6_members_traffic [49.931187s] ... ok