Description of problem: 3 controller L3 HA setup fails when running the tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps and the tempest.scenario.test_network_basic_ops.TestNetworkBasicOps in parallel. 2016-10-19 08:23:50,648 1853 DEBUG [tempest.scenario.manager] TestNetworkAdvancedServerOps:test_server_connectivity_suspend_resume finishes ping 10.0.0.215 in 120 sec and the ping result is unexpected 2016-10-19 08:23:50.649 1853 ERROR tempest.scenario.manager AssertionError: False is not true : Timed out waiting for 10.0.0.215 to become reachable After the failure tempest usually prints out the serial console log of the vm: checking http://169.254.169.254/2009-04-04/instance-id successful after 1/20 tries: up 3.45. iid=i-00000160 ... === cirros: current=0.3.4 uptime=8.96 === ... The instance was able to reach the metadata service, instance was also able to boot up within 5 sec (nested guest virt). The test would fail even with 1200 sec timeout, 120s is already a huge timeout. Version-Release number of selected component (if applicable): openstack-neutron-openvswitch-9.0.0-0.20160929051647.71f2d2b.el7ost.noarch python-neutron-lib-0.4.0-0.20160915221324.705fd90.el7ost.noarch openstack-neutron-bigswitch-lldp-7.0.5-0.20160927101710.06e6160.el7ost.noarch puppet-neutron-9.4.0-1.el7ost.noarch python-neutron-9.0.0-0.20160929051647.71f2d2b.el7ost.noarch openstack-neutron-9.0.0-0.20160929051647.71f2d2b.el7ost.noarch openstack-neutron-lbaas-9.0.0-0.20160921180958.6528738.el7ost.noarch openstack-neutron-metering-agent-9.0.0-0.20160929051647.71f2d2b.el7ost.noarch python-neutronclient-6.0.0-0.20160916123315.f53d624.el7ost.noarch python-neutron-tests-9.0.0-0.20160929051647.71f2d2b.el7ost.noarch python-neutron-lbaas-9.0.0-0.20160921180958.6528738.el7ost.noarch openstack-neutron-sriov-nic-agent-9.0.0-0.20160929051647.71f2d2b.el7ost.noarch openstack-neutron-bigswitch-agent-7.0.5-0.20160927101710.06e6160.el7ost.noarch openstack-neutron-common-9.0.0-0.20160929051647.71f2d2b.el7ost.noarch openstack-neutron-ml2-9.0.0-0.20160929051647.71f2d2b.el7ost.noarch How reproducible: In my setup at least one test failed on every try while ostestr -c 2 -r net.*ops; do echo retry; done Steps to Reproduce: 1. tripleo setup with 3 controller 1 compute controllers has 12..16G memory, 4 core . L0 Host does not have swap, and nested visualization is enabled. L0,L1 is rhel 7.3, L2 cirros 3.4 single disk version. Actual results: Random test failures ${PYTHON:-python} -m subunit.run discover -t ${OS_TOP_LEVEL:-./} ${OS_TEST_PATH:-./tempest/test_discover} --load-list /tmp/tmpRODP_c {1} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_connectivity_between_vms_on_different_networks ... SKIPPED: Skipped until Bug: 1610994 is resolved. {1} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_hotplug_nic [233.720208s] ... ok {0} tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_pause_unpause [324.674474s] ... FAILED {1} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops [269.790599s] ... ok {0} tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_reboot [305.526910s] ... FAILED {1} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_port_security_macspoofing_port [281.436929s] ... FAILED {0} tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_rebuild [183.523064s] ... ok {0} tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_resize ... SKIPPED: Resize is not available. {1} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_preserve_preexisting_port [218.333963s] ... ok {0} tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_stop_start [298.827240s] ... FAILED {1} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_router_rescheduling [250.539876s] ... ok {1} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_subnet_details [154.856716s] ... ok {0} tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_suspend_resume [305.881300s] ... FAILED {1} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_update_instance_port_admin_state [189.813823s] ... ok {1} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_update_router_admin_state [183.177830s] ... ok
Test id Runtime (s) ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------- tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_reboot[compute,id-7b6860c2-afa3-4846-9522-adeb38dfbe08,network] 557.334 tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_rebuild[compute,id-88a529c2-1daa-4c85-9aec-d541ba3eb699,network] 464.926 tempest.conf [validation]ping_timeout=7200 export OS_TEST_TIMEOUT=7200; If not the API queries or pings made the thing working, looks like very high time out same cases can help. 9 min for a test what supposed to finish in less than 2 min is a lot.
Checking is it happens just because swift tries to kill the controller nodes. https://bugzilla.redhat.com/show_bug.cgi?id=1387129 .
After killing the not mandatory swift components, at least one test still takes lot of time, for ex.: Slowest Tests: Test id Runtime (s) ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------- tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_rebuild[compute,id-88a529c2-1daa-4c85-9aec-d541ba3eb699,network] 665.545 tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_update_router_admin_state[compute,id-04b9fe4e-85e8-4aea-b937-ea93885ac59f,network] 187.484 tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_hotplug_nic[compute,id-c5adff73-e961-41f1-b4a9-343614f18cfa,network] 157.486 tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_reboot[compute,id-7b6860c2-afa3-4846-9522-adeb38dfbe08,network] 156.837 tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_stop_start[compute,id-61f1aa9a-1573-410e-9054-afa557cab021,network,stress] 154.163 tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_port_security_macspoofing_port[compute,id-7c0bb1a2-d053-49a4-98f9-ca1a1d849f63,network] 147.403 tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops[compute,id-f323b3ba-82f8-4db7-8ea6-6a895869ec49,network,smoke] 145.183 tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_pause_unpause[compute,id-2b2642db-6568-4b35-b812-eceed3fa20ce,network] 137.101 tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_suspend_resume[compute,id-5cdf9499-541d-4923-804e-b9a60620a7f0,network] 133.159 tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_router_rescheduling[compute,id-2e788c46-fb3f-4ac9-8f82-0561555bea73,network] 113.859
Removing blocker as right now we think that this is a test only issue.
I have doubled the CPU resources given for the controller nodes and it did not helped. I am also using newer version from everything so it is possible the tests are failing on a different reason now... It was an automated run, and got destroyed. I did not tried to debug the issue in a `manual` setup so far. 8vcpu,20G (mem) controller vms are definitely not overload on the 32 thread 2.4GHz host, so it cannot be HW resource (CPU/io/memory) starvation issues . These tempest test ran by 10000 upstream gate jobs /week in parallel, they does not have a known issue. AFAIK L3HA setup was not intensively tested via tempest before. The bad thing, even if you `prove` tempest test is wrong (unlikely) this RH OSP default setup also expected to fail on any 3 party certification (compatibility test) process as well, because most of these tool are also tempest based. It is important to be able to do these test reliably in all cases. So please take this issue seriously, until you cannot _prove_ your thesis. I know, it is not easy to find a root cause for these unexpected delays, but just because you are unpatient (or your dice told it) you should not remove the blocker flag.
Setting the l3_ha=false and restarting the neutron services on the controllers helps. I wonder can it be because of some unfair (not first came first served) task queue let's one users request to starve ?
ipv6 setups for an unknown reason does not seams to be affected.
I did some troubleshooting today and I tracked down the issue to point where iptables nat rule don't get match even though packets should match. I have an output of successful and failed run: SUCCESS: Chain PREROUTING (policy ACCEPT 2 packets, 402 bytes) pkts bytes target prot opt in out source destination 42 2922 neutron-l3-agent-PREROUTING all -- * * 0.0.0.0/0 0.0.0.0/0 Chain INPUT (policy ACCEPT 34 packets, 2322 bytes) pkts bytes target prot opt in out source destination Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination 2 80 neutron-l3-agent-OUTPUT all -- * * 0.0.0.0/0 0.0.0.0/0 Chain POSTROUTING (policy ACCEPT 8 packets, 600 bytes) pkts bytes target prot opt in out source destination 10 680 neutron-l3-agent-POSTROUTING all -- * * 0.0.0.0/0 0.0.0.0/0 8 600 neutron-postrouting-bottom all -- * * 0.0.0.0/0 0.0.0.0/0 Chain neutron-l3-agent-OUTPUT (1 references) pkts bytes target prot opt in out source destination 0 0 DNAT all -- * * 0.0.0.0/0 10.0.0.217 to:10.100.0.7 Chain neutron-l3-agent-POSTROUTING (1 references) pkts bytes target prot opt in out source destination 2 80 ACCEPT all -- !qg-97b7ef5d-1c !qg-97b7ef5d-1c 0.0.0.0/0 0.0.0.0/0 ! ctstate DNAT Chain neutron-l3-agent-PREROUTING (1 references) pkts bytes target prot opt in out source destination 32 1920 REDIRECT tcp -- qr-+ * 0.0.0.0/0 169.254.169.254 tcp dpt:80 redir ports 9697 8 600 DNAT all -- * * 0.0.0.0/0 10.0.0.217 to:10.100.0.7 Chain neutron-l3-agent-float-snat (1 references) pkts bytes target prot opt in out source destination 0 0 SNAT all -- * * 10.100.0.7 0.0.0.0/0 to:10.0.0.217 Chain neutron-l3-agent-snat (1 references) pkts bytes target prot opt in out source destination 8 600 neutron-l3-agent-float-snat all -- * * 0.0.0.0/0 0.0.0.0/0 0 0 SNAT all -- * qg-97b7ef5d-1c 0.0.0.0/0 0.0.0.0/0 to:10.0.0.211 0 0 SNAT all -- * * 0.0.0.0/0 0.0.0.0/0 mark match ! 0x2/0xffff ctstate DNAT to:10.0.0.211 Chain neutron-postrouting-bottom (1 references) pkts bytes target prot opt in out source destination 8 600 neutron-l3-agent-snat all -- * * 0.0.0.0/0 0.0.0.0/0 /* Perform source NAT on outgoing traffic. */ FAILED: Chain PREROUTING (policy ACCEPT 3 packets, 730 bytes) pkts bytes target prot opt in out source destination 19 1690 neutron-l3-agent-PREROUTING all -- * * 0.0.0.0/0 0.0.0.0/0 Chain INPUT (policy ACCEPT 19 packets, 1690 bytes) pkts bytes target prot opt in out source destination Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination 2 80 neutron-l3-agent-OUTPUT all -- * * 0.0.0.0/0 0.0.0.0/0 Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination 2 80 neutron-l3-agent-POSTROUTING all -- * * 0.0.0.0/0 0.0.0.0/0 0 0 neutron-postrouting-bottom all -- * * 0.0.0.0/0 0.0.0.0/0 Chain neutron-l3-agent-OUTPUT (1 references) pkts bytes target prot opt in out source destination 0 0 DNAT all -- * * 0.0.0.0/0 10.0.0.211 to:10.100.0.5 Chain neutron-l3-agent-POSTROUTING (1 references) pkts bytes target prot opt in out source destination 2 80 ACCEPT all -- !qg-e2553c2b-d5 !qg-e2553c2b-d5 0.0.0.0/0 0.0.0.0/0 ! ctstate DNAT Chain neutron-l3-agent-PREROUTING (1 references) pkts bytes target prot opt in out source destination 16 960 REDIRECT tcp -- qr-+ * 0.0.0.0/0 169.254.169.254 tcp dpt:80 redir ports 9697 0 0 DNAT all -- * * 0.0.0.0/0 10.0.0.211 to:10.100.0.5 Chain neutron-l3-agent-float-snat (1 references) pkts bytes target prot opt in out source destination 0 0 SNAT all -- * * 10.100.0.5 0.0.0.0/0 to:10.0.0.211 Chain neutron-l3-agent-snat (1 references) pkts bytes target prot opt in out source destination 0 0 neutron-l3-agent-float-snat all -- * * 0.0.0.0/0 0.0.0.0/0 0 0 SNAT all -- * qg-e2553c2b-d5 0.0.0.0/0 0.0.0.0/0 to:10.0.0.214 0 0 SNAT all -- * * 0.0.0.0/0 0.0.0.0/0 mark match ! 0x2/0xffff ctstate DNAT to:10.0.0.214 Chain neutron-postrouting-bottom (1 references) pkts bytes target prot opt in out source destination 0 0 neutron-l3-agent-snat all -- * * 0.0.0.0/0 0.0.0.0/0 /* Perform source NAT on outgoing traffic. */ tcpdumps from both runs: In successful run, as fips are re-used, I see ARPs send on unicast to wrong MAC address fa:16:3e:51:5f:9d, later router replies with correct fa:16:3e:21:7b:c8 and ICMP starts passing: 2016-11-01 17:31:36.217230 qg-97b7ef5d-1c 52:54:00:36:7d:f6 > fa:16:3e:51:5f:9d, ethertype ARP (0x0806), length 42: Request who-has 10.0.0.217 tell 10.0.0.92, length 28 2016-11-01 17:31:37.232220 qg-97b7ef5d-1c 52:54:00:36:7d:f6 > fa:16:3e:51:5f:9d, ethertype IPv4 (0x0800), length 98: 10.0.0.92 > 10.0.0.217: ICMP echo request, id 8520, seq 2, length 64 2016-11-01 17:31:37.232220 qg-a473bf60-83 52:54:00:36:7d:f6 > fa:16:3e:51:5f:9d, ethertype IPv4 (0x0800), length 98: 10.0.0.92 > 10.0.0.217: ICMP echo request, id 8520, seq 2, length 64 2016-11-01 17:31:37.885429 tap9a2b52de-f7 fa:16:3e:72:37:04 > fa:16:3e:2c:1d:3c, ethertype ARP (0x0806), length 42: Request who-has 10.100.0.7 tell 10.100.0.2, length 28 2016-11-01 17:31:37.887657 tap9a2b52de-f7 fa:16:3e:2c:1d:3c > fa:16:3e:72:37:04, ethertype ARP (0x0806), length 42: Reply 10.100.0.7 is-at fa:16:3e:2c:1d:3c, length 28 2016-11-01 17:31:38.221431 qr-431ca732-04 fa:16:3e:56:4e:05 > fa:16:3e:2c:1d:3c, ethertype ARP (0x0806), length 42: Request who-has 10.100.0.7 tell 10.100.0.1, length 28 2016-11-01 17:31:38.224582 qr-431ca732-04 fa:16:3e:2c:1d:3c > fa:16:3e:56:4e:05, ethertype ARP (0x0806), length 42: Reply 10.100.0.7 is-at fa:16:3e:2c:1d:3c, length 28 2016-11-01 17:31:38.240401 qg-97b7ef5d-1c 52:54:00:36:7d:f6 > Broadcast, ethertype ARP (0x0806), length 42: Request who-has 10.0.0.217 tell 10.0.0.92, length 28 2016-11-01 17:31:38.240454 qg-97b7ef5d-1c fa:16:3e:21:7b:c8 > 52:54:00:36:7d:f6, ethertype ARP (0x0806), length 42: Reply 10.0.0.217 is-at fa:16:3e:21:7b:c8, length 28 2016-11-01 17:31:38.240399 qg-a473bf60-83 52:54:00:36:7d:f6 > Broadcast, ethertype ARP (0x0806), length 42: Request who-has 10.0.0.217 tell 10.0.0.92, length 28 2016-11-01 17:31:38.241854 qg-97b7ef5d-1c 52:54:00:36:7d:f6 > fa:16:3e:21:7b:c8, ethertype IPv4 (0x0800), length 98: 10.0.0.92 > 10.0.0.217: ICMP echo request, id 8524, seq 1, length 64 2016-11-01 17:31:38.241904 qr-431ca732-04 fa:16:3e:56:4e:05 > fa:16:3e:2c:1d:3c, ethertype IPv4 (0x0800), length 98: 10.0.0.92 > 10.100.0.7: ICMP echo request, id 8524, seq 1, length 64 2016-11-01 17:31:38.243078 qg-97b7ef5d-1c fa:16:3e:21:7b:c8 > 52:54:00:36:7d:f6, ethertype IPv4 (0x0800), length 98: 10.0.0.217 > 10.0.0.92: ICMP echo reply, id 8524, seq 1, length 64 2016-11-01 17:31:38.243050 qr-431ca732-04 fa:16:3e:2c:1d:3c > fa:16:3e:56:4e:05, ethertype IPv4 (0x0800), length 98: 10.100.0.7 > 10.0.0.92: ICMP echo reply, id 8524, seq 1, length 64 In failed run, we don't see traffic being sent to 10.100.0.0/24 subnet.
The culprit was detected and verified. keepalived doesn't send out garps when it receives SIGHUP. Due to recent change in IPAM, fips are reuse more frequently which makes machine running tempest sending ICMP packets with destination MAC of old FIP as it didn't get gratuitous ARP. After changing L3 agent to send those gratuitous ARPs by itself, the issue is gone. I'll send a patch upstream soon.
For the record: we need to make sure that only the agent running the master keepalived send the GARP. For that, we probably need to monitor new VIPs on the qg interface, and once they are up, attempt to send a GARP (and be prepared to fail in case keepalived immediately falls back to standby). We also need to make sure that once in standby, GARP attempt fails; that should be guaranteed by ip_nonlocal_bind sysctl knob being *off*.
I believe that keepalived is the one responsible for sending the garps - it doesn't make sense that keepalived switches to master -> notifies the python script -> goes through to the l3 agent to its' queue -> have the l3 agent send garps. That's just waaay too many loops. Can we get Ryan Ohara to comment on this, please?
John, keepalived does not notify a script. The -change script in neutron is managed by the agent (thru process manager). But I agree keepalived should handle GARPs itself. It actually does on clean start; just not on HUPs. That's the bug Jakub reported (1391553).
(In reply to John Schwarz from comment #17) > I believe that keepalived is the one responsible for sending the garps - it > doesn't make sense that keepalived switches to master -> notifies the python > script -> goes through to the l3 agent to its' queue -> have the l3 agent > send garps. That's just waaay too many loops. > > Can we get Ryan Ohara to comment on this, please? A keepalived process will send gratuitous ARPs for the VIPs when it becomes master. If a keepalived process is already master for a particular VRID and you send the process a SIGHUP, there is no transition to master and therefore no gratuitous ARPs.
According to the latest CI reports that executed on openstack-neutron-9.1.0-4.el7ost.src.rpm there were not basic and advanced network failures on those jobs.
Can we get this backported in mitaka? Not nec. a blocker but pretty critical.
(In reply to David Paterson from comment #26) > Can we get this backported in mitaka? Not nec. a blocker but pretty > critical. Yes, I've cloned this to OSP 9 (Mitaka). https://bugzilla.redhat.com/show_bug.cgi?id=1397926 It was already in the works. We are currently waiting for the following upstream change to be merged in stable/mitaka, then we already have a RH gerrit for it to be backported into OSP 9. https://review.openstack.org/#/c/400348/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html