Description of problem: The following test fail whenever RHOSP 12 is deployed with neutron-lbaas: neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromLegacy.test_from_legacy_to_ha Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py", line 68, in test_from_legacy_to_ha after_dvr=False, after_ha=True) File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py", line 55, in _test_migration self._check_connectivity() File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_dvr.py", line 29, in _check_connectivity self.keypair['private_key']) File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/base.py", line 204, in check_connectivity ssh_client.test_connection_auth() File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 207, in test_connection_auth connection = self._get_ssh_connection() File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection password=self.password) tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.224 via SSH timed out. User: cirros, Password: None This one might be also failing due to the same reason: tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_suspend_resume Version-Release number of selected component (if applicable): Latest RHOSP 12 How reproducible: 100% Steps to Reproduce: 1. Deploy RHOSP 12 + LBaaS 2. Run Tempest API tests Expected results: Tests passed successfully
This job DFG-network-neutron-lbaas-12_director-rhel-virthost-3cont_2comp-ipv4-vxlan-dvr doesn't exist on CI now. @Arie Is it possible to setup this job and reproduce this issue? Thanks Anil
Migration scenario tests are stable in u/s master. Need to backport any relevant changes to make them pass in https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-neutron-12_director-rhel-virthost-3cont_2comp-ipv4-vxlan-dvr/2//artifact/tempest-results/tempest-results-neutron.1.html
We may need to backport these patches https://review.openstack.org/#/c/507856/ https://review.openstack.org/#/c/500379/ https://review.openstack.org/#/c/501717/ https://review.openstack.org/#/c/501366/ https://review.openstack.org/#/c/493501/ https://review.openstack.org/#/c/383827/
I will first try to backport patches pointed by Anil to Pike and will then see if it will help. If not, I will dig more why it's failing...
https://review.openstack.org/#/c/507856/ - patch for Pike proposed https://review.openstack.org/#/c/500379/ - already in stable/Pike https://review.openstack.org/#/c/501717/ - already in stable/Pike https://review.openstack.org/#/c/501366/ - already in stable/Pike https://review.openstack.org/#/c/493501/ - already in stable/Pike https://review.openstack.org/#/c/383827/ - already in stable/Pike
I also checked in OSP-12 repo, all patches which are already in stable/Pike are also in OSP-12 repo.
After backport of last missing patch to OSP-12 I checked this CI job and this test is still failing there. See: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-neutron-12_director-rhel-virthost-3cont_2comp-ipv4-vxlan-dvr/lastSuccessfulBuild/artifact/tempest-results/tempest-results-neutron.1.html So after checking logs from this job it looks for me that after migration to ha, router was transitioned to master on one node but FIP wasn't configured on this router after transition. I will now deploy OSP-12 env with 2 controllers and try to reproduce this issue manually.
I was wrong with those logs. FIP is added by keepalived process based on its config file. There is anything about that in L3 agent's logs as I checked locally. I tried to reproduce same migration process manually and it was fine every time. So I couldn't reproduce this issue on my local env. I will have to try to connect to CI node with this job and try to debug there...
I checked few runs of those tests on OSP-12 and I also deployed it on my node with our Jenkins CI. From what I found is that neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromLegacy.test_from_legacy_to_ha is passing now (at least on my node deployed with Jenkins) But I found that neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromHA.test_from_ha_to_legacy is failing. And it is failing also in upstream stable/pike jobs, see e.g.: http://logs.openstack.org/80/586980/1/check/legacy-tempest-dsvm-neutron-dvr-multinode-scenario/0dce998/logs/testr_results.html.gz So I think that we should find now the reason why this test is failing.
I just found that there is error like: 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager [req-529f617b-f65b-40b5-82f0-fe7cad6ae4d7 6cd46c101f494126abbb658ff71103b7 fdf5da99bcfb4544bf380e215dcfd25c - default default] Error during notification for neutron.services.l3_router.l3_router_plugin.L3RouterPlugin._reconfigure_ha_resources-679142 router, after_update: StaleDataError: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched. 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager Traceback (most recent call last): 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib/python2.7/site-packages/neutron_lib/callbacks/manager.py", line 171, in _notify_loop 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager callback(resource, event, trigger, **kwargs) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib/python2.7/site-packages/neutron/db/l3_hamode_db.py", line 491, in _reconfigure_ha_resources 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager self.schedule_router(context, router_id) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib/python2.7/site-packages/neutron/db/l3_agentschedulers_db.py", line 469, in schedule_router 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager self, context, router, candidates=candidates) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib/python2.7/site-packages/neutron/scheduler/l3_agent_scheduler.py", line 54, in schedule 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager plugin, context, router_id, candidates=candidates) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib/python2.7/site-packages/neutron/scheduler/l3_agent_scheduler.py", line 233, in _schedule_router 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager if not plugin.router_supports_scheduling(context, router_id): 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib/python2.7/site-packages/neutron/services/l3_router/l3_router_plugin.py", line 128, in router_supports_scheduling 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager return self.l3_driver_controller.uses_scheduler(context, router_id) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib/python2.7/site-packages/neutron/services/l3_router/service_providers/driver_controller.py", line 195, in uses_scheduler 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager return (self.get_provider_for_router(context, router_id). 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib/python2.7/site-packages/neutron/services/l3_router/service_providers/driver_controller.py", line 149, in get_provider_for_router 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager context, [router_id]).get(router_id) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib/python2.7/site-packages/neutron/db/servicetype_db.py", line 73, in get_provider_names_by_resource_ids 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager context, resource_id=resource_ids) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib/python2.7/site-packages/neutron/objects/base.py", line 463, in get_objects 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager with context.session.begin(subtransactions=True): 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 824, in begin 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager self, nested=nested) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 218, in __init__ 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager self._take_snapshot() 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 327, in _take_snapshot 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager self.session.flush() 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2171, in flush 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager self._flush(objects) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2291, in _flush 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager transaction.rollback(_capture_exception=True) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__ 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager compat.reraise(exc_type, exc_value, exc_tb) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2255, in _flush 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager flush_context.execute() 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 389, in execute 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager rec.execute(self) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 548, in execute 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager uow 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 177, in save_obj 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager mapper, table, update) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 760, in _emit_update_statements 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager (table.description, len(records), rows)) 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager StaleDataError: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched. 2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager during reconfigure router ports from ha router to legacy router.
I just found that this issue might be fixed with patch https://review.openstack.org/#/c/584785/ which is already proposed to stable/pike in upstream. I will test it now on my env.
Backport of https://review.openstack.org/#/c/584785/ proposed to OSP-12: https://code.engineering.redhat.com/gerrit/#/c/144683/ - let's wait how it will be with this one also :)
So patch which fixes neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromHA.test_from_ha_to_legacy is merged to OSP-12 now but in CI again test: neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromLegacy.test_from_legacy_to_ha fails. I will try to reproduce it somehow on my node then.
I think that I found what is breaking this connectivity there. Unfortunately I'm not sure yet why it happens like that and I could reproduce it only when I run all migration tests together. If I run only this one test, it didn't fails for me :/ Ok, so what I found is: MAC address of router’s port on contr-0 (active one) is fa:16:3e:2b:4c:ea Controller-0 (active): [root@controller-0 site-packages]# ip netns exec qrouter-0dd10372-9b0c-4995-81a0-b7c59d89fc23 ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 966: ha-cafff8d4-6d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:46:99:dd brd ff:ff:ff:ff:ff:ff inet 169.254.192.7/18 brd 169.254.255.255 scope global ha-cafff8d4-6d valid_lft forever preferred_lft forever inet 169.254.0.2/24 scope global ha-cafff8d4-6d valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe46:99dd/64 scope link valid_lft forever preferred_lft forever 967: qr-8c6cb474-bf: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:2b:4c:ea brd ff:ff:ff:ff:ff:ff inet 10.100.0.1/28 scope global qr-8c6cb474-bf valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe2b:4cea/64 scope link nodad valid_lft forever preferred_lft forever 968: qg-c04955f5-a3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:65:1a:87 brd ff:ff:ff:ff:ff:ff inet 10.0.0.236/32 scope global qg-c04955f5-a3 valid_lft forever preferred_lft forever inet 10.0.0.237/24 scope global qg-c04955f5-a3 valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe65:1a87/64 scope l [root@controller-0 site-packages]# ovs-vsctl show 4ce8648e-793b-4f6b-b59e-307fa2443220 Manager "ptcp:6640:127.0.0.1" is_connected: true Bridge br-tun Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port br-tun Interface br-tun type: internal Port "vxlan-ac110211" Interface "vxlan-ac110211" type: vxlan options: {df_default="true", in_key=flow, local_ip="172.17.2.18", out_key=flow, remote_ip="172.17.2.17"} Port patch-int Interface patch-int type: patch options: {peer=patch-tun} Port "vxlan-ac110213" Interface "vxlan-ac110213" type: vxlan options: {df_default="true", in_key=flow, local_ip="172.17.2.18", out_key=flow, remote_ip="172.17.2.19"} Port "vxlan-ac110210" Interface "vxlan-ac110210" type: vxlan options: {df_default="true", in_key=flow, local_ip="172.17.2.18", out_key=flow, remote_ip="172.17.2.16"} Compute-1 (vm is on it) [root@compute-1 heat-admin]# ovs-vsctl show 0a9ed74b-8353-489c-b06a-383e382daf5b Manager "ptcp:6640:127.0.0.1" is_connected: true Bridge br-tun Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port "vxlan-ac110210" (controller-1) Interface "vxlan-ac110210" type: vxlan options: {df_default="true", in_key=flow, local_ip="172.17.2.17", out_key=flow, remote_ip="172.17.2.16"} Port br-tun Interface br-tun type: internal Port patch-int Interface patch-int type: patch options: {peer=patch-tun} Port "vxlan-ac110212" (controller-0) Interface "vxlan-ac110212" type: vxlan options: {df_default="true", in_key=flow, local_ip="172.17.2.17", out_key=flow, remote_ip="172.17.2.18"} Port "vxlan-ac110213" (controller-2) Interface "vxlan-ac110213" type: vxlan options: {df_default="true", in_key=flow, local_ip="172.17.2.17", out_key=flow, remote_ip="172.17.2.19"} [root@compute-1 heat-admin]# ovs-ofctl show br-tun OFPT_FEATURES_REPLY (xid=0x2): dpid:00003ec091751e4e n_tables:254, n_buffers:0 capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 1(patch-int): addr:2e:f1:2f:10:09:c3 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 168(vxlan-ac110212): addr:c6:87:c4:c2:f9:95 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 169(vxlan-ac110213): addr:76:07:fe:09:1f:fc config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 170(vxlan-ac110210): addr:de:d6:ea:bb:a7:8e config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max LOCAL(br-tun): addr:3e:c0:91:75:1e:4e config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0 [root@compute-1 heat-admin]# ovs-ofctl dump-flows br-tun cookie=0x799fae05fa32071e, duration=177814.034s, table=0, n_packets=24105, n_bytes=1921372, priority=1,in_port="patch-int" actions=resubmit(,1) cookie=0x799fae05fa32071e, duration=2638.912s, table=0, n_packets=130, n_bytes=7718, priority=1,in_port="vxlan-ac110212" actions=resubmit(,4) cookie=0x799fae05fa32071e, duration=2638.871s, table=0, n_packets=102, n_bytes=11801, priority=1,in_port="vxlan-ac110213" actions=resubmit(,4) cookie=0x799fae05fa32071e, duration=2638.830s, table=0, n_packets=37, n_bytes=4928, priority=1,in_port="vxlan-ac110210" actions=resubmit(,4) cookie=0x799fae05fa32071e, duration=177814.635s, table=0, n_packets=0, n_bytes=0, priority=0 actions=drop cookie=0x799fae05fa32071e, duration=177814.028s, table=1, n_packets=23625, n_bytes=1881309, priority=0 actions=resubmit(,2) cookie=0x799fae05fa32071e, duration=177814.633s, table=2, n_packets=10312, n_bytes=1040962, priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,20) cookie=0x799fae05fa32071e, duration=177814.628s, table=2, n_packets=13524, n_bytes=864716, priority=0,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,22) cookie=0x799fae05fa32071e, duration=177814.625s, table=3, n_packets=0, n_bytes=0, priority=0 actions=drop cookie=0x799fae05fa32071e, duration=2645.661s, table=4, n_packets=223, n_bytes=18547, priority=1,tun_id=0x3e actions=mod_vlan_vid:117,resubmit(,9) cookie=0x799fae05fa32071e, duration=177814.622s, table=4, n_packets=634, n_bytes=59656, priority=0 actions=drop cookie=0x799fae05fa32071e, duration=177814.619s, table=6, n_packets=0, n_bytes=0, priority=0 actions=drop cookie=0x799fae05fa32071e, duration=177813.875s, table=9, n_packets=62, n_bytes=11781, priority=1,dl_src=fa:16:3f:2c:d7:79 actions=output:"patch-int" cookie=0x799fae05fa32071e, duration=177745.764s, table=9, n_packets=28, n_bytes=2488, priority=1,dl_src=fa:16:3f:1c:85:07 actions=output:"patch-int" cookie=0x799fae05fa32071e, duration=177738.398s, table=9, n_packets=22, n_bytes=1640, priority=1,dl_src=fa:16:3f:67:8b:37 actions=output:"patch-int" cookie=0x799fae05fa32071e, duration=177690.087s, table=9, n_packets=24, n_bytes=2020, priority=1,dl_src=fa:16:3f:b4:9f:27 actions=output:"patch-int" cookie=0x799fae05fa32071e, duration=177814.031s, table=9, n_packets=10123, n_bytes=1215986, priority=0 actions=resubmit(,10) cookie=0x799fae05fa32071e, duration=177814.617s, table=10, n_packets=10123, n_bytes=1215986, priority=1 actions=learn(table=20,hard_timeout=300,priority=1,cookie=0x799fae05fa32071e,NXM_OF_VLAN_TCI[0..11],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->NXM_OF_VLAN_TCI[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:OXM_OF_IN_PORT[]),output:"patch-int" cookie=0x799fae05fa32071e, duration=2629.361s, table=20, n_packets=7, n_bytes=552, priority=2,dl_vlan=117,dl_dst=fa:16:3e:fb:7c:85 actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110212" cookie=0x799fae05fa32071e, duration=2629.326s, table=20, n_packets=7, n_bytes=552, priority=2,dl_vlan=117,dl_dst=fa:16:3e:76:a8:c0 actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110213" cookie=0x799fae05fa32071e, duration=2629.306s, table=20, n_packets=7, n_bytes=552, priority=2,dl_vlan=117,dl_dst=fa:16:3e:ce:87:fc actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110210" cookie=0x799fae05fa32071e, duration=2618.870s, table=20, n_packets=193, n_bytes=14808, priority=2,dl_vlan=117,dl_dst=fa:16:3e:2b:4c:ea actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110213" cookie=0x799fae05fa32071e, duration=177814.615s, table=20, n_packets=350, n_bytes=33044, priority=0 actions=resubmit(,22) cookie=0x799fae05fa32071e, duration=2433.527s, table=22, n_packets=12, n_bytes=1312, priority=1,dl_vlan=117 actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110212",output:"vxlan-ac110213",output:"vxlan-ac110210" cookie=0x799fae05fa32071e, duration=177814.613s, table=22, n_packets=12635, n_bytes=743918, priority=0 actions=drop So this one rule should not be here: cookie=0x799fae05fa32071e, duration=2618.870s, table=20, n_packets=193, n_bytes=14808, priority=2,dl_vlan=117,dl_dst=fa:16:3e:2b:4c:ea actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110213" As it sends packages which should be send to fa:16:3e:2b:4c:ea to controller-2 instead of controller-0 on which there is this router configured now. Just for confirmation. I manually removed this one rule from br-tun and connection was fine.
I checked few times and when I run only this one test with migration from legacy to HA then everything works fine. I didn't found adding such fdb entries for this qr-XX port on compute node. It appears only when I run more tests in parallel. And I don't know why it is like that. I will continue debugging this issue in next days....
I was testing this today and if I run all tests one by one or if I run only this one test, it's passing every time. When I run all tests with 4 workers - this one test if failing very often. I still don't know exactly why it is like that... From logs analysis it looks that when all if fine, there is rpc message remove_fdb_entries send to agents when router is set to admin_state_up=False. But when test fails there is no this information in neutron-server's log file. So message is not send and because of that there is an issue with fdb entries. But I still don't know why it isn't sent properly.
So I did another test and what I found is (it's together with some my additional logs): Router ID: b1da9aca-046a-4354-911d-6cf4832956be Router's port: cb31cdca-810b-45c8-a8b4-177c7f1aed09 So when legacy router is set to "DOWN", port should be also set to down. OVS agent should then send update_device_list rpc message to neutron-server. And that is really done in neutron-openvwitch-agent: 2018-08-03 17:46:37.718 633108 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-b4346f5d-cdbd-4f60-b6f0-42b994722205 - - - - -] Ports set([u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', u'f7a40997-efdc-405b-9bf9-5b35b4e1d604']) removed And it is also visible in neutron-server logs: 2018-08-03 17:46:37.740 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; update_device_list; kwargs = {u'devices_up': [], u'host': u'controller -1.localdomain', u'devices_down': [u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', u'f7a40997-efdc-405b-9bf9-5b35b4e1d604'], u'agent_id': u'ovs-agent-controller-1.localdomain'} 2018-08-03 17:46:37.741 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; devices_up = [] 2018-08-03 17:46:37.741 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; devices_down = [u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', u'f7a40997-ef dc-405b-9bf9-5b35b4e1d604'] 2018-08-03 17:46:37.742 519815 DEBUG neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] Device cb31cdca-810b-45c8-a8b4-177c7f1aed09 no longer exists at agent ovs- agent-controller-1.localdomain update_device_down /usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:213 It's log from neutron.plugins.ml2.rpc.RpcCallbacks.update_device_list() method. So device is set to DOWN and neutron.plugins.ml2.rpc.RpcCallbacks.update_device_down() is called: 2018-08-03 17:46:39.101 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; PORT = <neutron.db.models_v2.Port[object at 7fce176756d0] {project_id=u'768d5f7a31184ee1aacd9552c63e5723', id=u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', name=u'', network_id=u'42500424-d630-4ab9-b6f0-9a35e5831639', mac_address=u'fa:16:3e:68:9a:d3', admin_state_up=True, status=u'DOWN', device_id=u'b1da9aca-046a-4354-911d-6cf4832956be', device_owner=u'network:ha_router_replicated_interface', ip_allocation=u'immediate', standard_attr_id=37601}> 2018-08-03 17:46:39.102 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; HOST = controller-1.localdomain 2018-08-03 17:46:39.103 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; STATUS = DOWN But then in l2_population update_port_down() method it checks that this port is "backup" port of HA router, so it doesn't send fdb entries to be removed: 2018-08-03 17:46:40.295 519815 ERROR neutron.plugins.ml2.drivers.l2pop.mech_driver [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; L2pop update_port_down; port = {'status': 'DOWN', 'binding:host_id': u'controller-1.localdomain', 'description': u'', 'allowed_address_pairs': [], 'tags': [], 'extra_dhcp_opts': [], 'updated_at': '2018-08-03T17:46:38Z', 'device_owner': u'network:ha_router_replicated_interface', 'revision_number': 10, 'port_security_enabled': False, 'binding:profile': {}, 'fixed_ips': [{'subnet_id': u'9945eac0-0e0d-48b7-92eb-128f6d7b62c2', 'ip_address': u'10.100.0.1'}], 'id': u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', 'security_groups': [], 'device_id': u'b1da9aca-046a-4354-911d-6cf4832956be', 'name': u'', 'admin_state_up': True, 'network_id': u'42500424-d630-4ab9-b6f0-9a35e5831639', 'tenant_id': u'768d5f7a31184ee1aacd9552c63e5723', 'binding:vif_details': {u'port_filter': True, u'datapath_type': u'system', u'ovs_hybrid_plug': True}, 'binding:vnic_type': u'normal', 'binding:vif_type': u'ovs', 'qos_policy_id': None, 'mac_address': u'fa:16:3e:68:9a:d3', 'project_id': u'768d5f7a31184ee1aacd9552c63e5723', 'created_at': '2018-08-03T17:44:20Z'} 2018-08-03 17:46:41.649 519815 ERROR neutron.plugins.ml2.drivers.l2pop.mech_driver [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; L3 plugin on controller-1.localdomain still have port {'status': 'DOWN', 'binding:host_id': u'controller-1.localdomain', 'description': u'', 'allowed_address_pairs': [], 'tags': [], 'extra_dhcp_opts': [], 'updated_at': '2018-08-03T17:46:38Z', 'device_owner': u'network:ha_router_replicated_interface', 'revision_number': 10, 'port_security_enabled': False, 'binding:profile': {}, 'fixed_ips': [{'subnet_id': u'9945eac0-0e0d-48b7-92eb-128f6d7b62c2', 'ip_address': u'10.100.0.1'}], 'id': u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', 'security_groups': [], 'device_id': u'b1da9aca-046a-4354-911d-6cf4832956be', 'name': u'', 'admin_state_up': True, 'network_id': u'42500424-d630-4ab9-b6f0-9a35e5831639', 'tenant_id': u'768d5f7a31184ee1aacd9552c63e5723', 'binding:vif_details': {u'port_filter': True, u'datapath_type': u'system', u'ovs_hybrid_plug': True}, 'binding:vnic_type': u'normal', 'binding:vif_type': u'ovs', 'qos_policy_id': None, 'mac_address': u'fa:16:3e:68:9a:d3', 'project_id': u'768d5f7a31184ee1aacd9552c63e5723', 'created_at': '2018-08-03T17:44:20Z'}; don't sent fdb_remove Because router was transitioned to HA 10 seconds before this rpc was finished: 2018-08-03 17:46:30.010 519796 DEBUG neutron.services.l3_router.service_providers.driver_controller [req-94f31138-4650-4612-ae92-6038cc367b03 07f9e30fd7a14a11806856ad5b60ec9b ab6138b7124d425c8956108d8d9aa268 - default default] Router b1da9aca-046a-4354-911d-6cf4832956be migrating from <neutron.services.l3_router.service_providers.single_node.SingleNodeDriver object at 0x7fce17fb8f10> provider to <neutron.services.l3_router.service_providers.ha.HaDriver object at 0x7fce17fb8cd0> provider. _update_router_provider /usr/lib/python2.7/site-packages/neutron/services/l3_router/service_providers/driver_controller.py:138 ---- From tempest.log we can see that update router to admin_state_up=False was done at: 2018-08-03 13:46:23.406 18722 INFO tempest.lib.common.rest_client [req-1808254d-e24c-4d53-a015-79bf743be2dd ] Request (NetworkMigrationFromLegacy:test_from_legacy_to_ha): 200 PUT http://10.0.0.104:9696/v2.0/routers/b1da9aca-046a-4354-911d-6cf4832956be 8.390s 2018-08-03 13:46:23.410 18722 DEBUG tempest.lib.common.rest_client [req-1808254d-e24c-4d53-a015-79bf743be2dd ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'} Body: {"router": {"external_gateway_info": {"network_id": "9f0fa67d-34b4-47e5-ba4e-3d3138bee8da"}, "name": "tempest-router-1000907614", "admin_state_up": false}} And transform router to HA: 2018-08-03 13:47:04.391 18722 INFO tempest.lib.common.rest_client [req-94f31138-4650-4612-ae92-6038cc367b03 ] Request (NetworkMigrationFromLegacy:test_from_legacy_to_ha): 200 PUT http://10.0.0.104:9696/v2.0/routers/b1da9aca-046a-4354-911d-6cf4832956be 40.331s 2018-08-03 13:47:04.391 18722 DEBUG tempest.lib.common.rest_client [req-94f31138-4650-4612-ae92-6038cc367b03 ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'} Body: {"router": {"external_gateway_info": {"network_id": "9f0fa67d-34b4-47e5-ba4e-3d3138bee8da"}, "ha": true, "distributed": false, "name": "tempest-router-1000907614", "admin_state_up": false}}
To confirm that there issue is because of this race between setting router to DOWN and migration to HA, I added some wait until router's port will be really DOWN before migrating it to HA (in test code). I run such modified migration tests for whole weekend. It passed 151 times. So now I'm sure that this is an issue here and I will need to figure out how to fix it :)
I reported this bug in upstream also.
As can be seen on https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-neutron-12_director-rhel-virthost-3cont_2comp-ipv4-vxlan-dvr/26//artifact/tempest-results/tempest-results-neutron.1.html my last patch: https://code.engineering.redhat.com/gerrit/#/c/146252/2 fixed issue with migration tests. I will now have to find a way how to solve this issue in L2pop mech_driver but this "test patch" should at least unblock CI
There is patch with fix proposed in upstream now: https://review.openstack.org/589885
Fix for failing test is merged to OSP-12 (it's test only fix: https://code.engineering.redhat.com/gerrit/#/c/146252/) and it's included in openstack-neutron-11.0.4-4.el7ost There will be also "real" fix for this issue in L2pop: https://review.openstack.org/#/c/589885/ but as for now I think that issue can be set to MODIFIED as it should not impact CI systems.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3792