Bug 1539638 - neutron tempest scenario "test_from_legacy_to_ha" fails
Summary: neutron tempest scenario "test_from_legacy_to_ha" fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 12.0 (Pike)
Assignee: Slawek Kaplonski
QA Contact: Roee Agiman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-29 11:22 UTC by Arie Bregman
Modified: 2022-07-09 12:32 UTC (History)
10 users (show)

Fixed In Version: openstack-neutron-11.0.4-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-05 18:53:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1785582 0 None None None 2018-08-06 09:18:28 UTC
OpenStack gerrit 584785 0 None None None 2018-07-23 10:15:44 UTC
OpenStack gerrit 589410 0 None None None 2018-08-07 09:40:00 UTC
OpenStack gerrit 589412 0 None None None 2018-08-07 09:48:32 UTC
OpenStack gerrit 589885 0 None None None 2018-08-08 13:08:26 UTC
Red Hat Issue Tracker OSP-17342 0 None None None 2022-07-09 12:32:07 UTC
Red Hat Product Errata RHSA-2018:3792 0 None None None 2018-12-05 18:53:35 UTC

Description Arie Bregman 2018-01-29 11:22:59 UTC
Description of problem:

The following test fail whenever RHOSP 12 is deployed with neutron-lbaas:

neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromLegacy.test_from_legacy_to_ha

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py", line 68, in test_from_legacy_to_ha
    after_dvr=False, after_ha=True)
  File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py", line 55, in _test_migration
    self._check_connectivity()
  File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_dvr.py", line 29, in _check_connectivity
    self.keypair['private_key'])
  File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/base.py", line 204, in check_connectivity
    ssh_client.test_connection_auth()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 207, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.224 via SSH timed out.
User: cirros, Password: None

This one might be also failing due to the same reason:
tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_suspend_resume


Version-Release number of selected component (if applicable): Latest RHOSP 12


How reproducible: 100%


Steps to Reproduce:
1. Deploy RHOSP 12 + LBaaS
2. Run Tempest API tests


Expected results: Tests passed successfully

Comment 2 anil venkata 2018-06-26 06:58:46 UTC
This job DFG-network-neutron-lbaas-12_director-rhel-virthost-3cont_2comp-ipv4-vxlan-dvr doesn't exist on CI now.

@Arie
Is it possible to setup this job and reproduce this issue?

Thanks
Anil

Comment 4 anil venkata 2018-06-26 08:04:05 UTC
Migration scenario tests are stable in u/s master. Need to backport any relevant changes to make them pass in
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-neutron-12_director-rhel-virthost-3cont_2comp-ipv4-vxlan-dvr/2//artifact/tempest-results/tempest-results-neutron.1.html

Comment 7 Slawek Kaplonski 2018-07-23 09:43:40 UTC
I will first try to backport patches pointed by Anil to Pike and will then see if it will help. If not, I will dig more why it's failing...

Comment 8 Slawek Kaplonski 2018-07-23 10:20:49 UTC
https://review.openstack.org/#/c/507856/ - patch for Pike proposed
https://review.openstack.org/#/c/500379/ - already in stable/Pike
https://review.openstack.org/#/c/501717/ - already in stable/Pike
https://review.openstack.org/#/c/501366/ - already in stable/Pike
https://review.openstack.org/#/c/493501/ - already in stable/Pike
https://review.openstack.org/#/c/383827/ - already in stable/Pike

Comment 9 Slawek Kaplonski 2018-07-23 10:32:22 UTC
I also checked in OSP-12 repo, all patches which are already in stable/Pike are also in OSP-12 repo.

Comment 10 Slawek Kaplonski 2018-07-25 09:18:58 UTC
After backport of last missing patch to OSP-12 I checked this CI job and this test is still failing there. See: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-neutron-12_director-rhel-virthost-3cont_2comp-ipv4-vxlan-dvr/lastSuccessfulBuild/artifact/tempest-results/tempest-results-neutron.1.html

So after checking logs from this job it looks for me that after migration to ha, router was transitioned to master on one node but FIP wasn't configured on this router after transition.
I will now deploy OSP-12 env with 2 controllers and try to reproduce this issue manually.

Comment 11 Slawek Kaplonski 2018-07-25 15:04:48 UTC
I was wrong with those logs. FIP is added by keepalived process based on its config file. There is anything about that in L3 agent's logs as I checked locally.

I tried to reproduce same migration process manually and it was fine every time. So I couldn't reproduce this issue on my local env.
I will have to try to connect to CI node with this job and try to debug there...

Comment 12 Slawek Kaplonski 2018-07-30 12:24:24 UTC
I checked few runs of those tests on OSP-12 and I also deployed it on my node with our Jenkins CI.
From what I found is that neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromLegacy.test_from_legacy_to_ha is passing now (at least on my node deployed with Jenkins)

But I found that neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromHA.test_from_ha_to_legacy
is failing. And it is failing also in upstream stable/pike jobs, see e.g.: http://logs.openstack.org/80/586980/1/check/legacy-tempest-dsvm-neutron-dvr-multinode-scenario/0dce998/logs/testr_results.html.gz


So I think that we should find now the reason why this test is failing.

Comment 13 Slawek Kaplonski 2018-07-30 12:56:18 UTC
I just found that there is error like:

2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager [req-529f617b-f65b-40b5-82f0-fe7cad6ae4d7 6cd46c101f494126abbb658ff71103b7 fdf5da99bcfb4544bf380e215dcfd25c - default default] Error during notification for neutron.services.l3_router.l3_router_plugin.L3RouterPlugin._reconfigure_ha_resources-679142 router, after_update: StaleDataError: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched.
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager Traceback (most recent call last):
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron_lib/callbacks/manager.py", line 171, in _notify_loop
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     callback(resource, event, trigger, **kwargs)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/db/l3_hamode_db.py", line 491, in _reconfigure_ha_resources
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self.schedule_router(context, router_id)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/db/l3_agentschedulers_db.py", line 469, in schedule_router
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self, context, router, candidates=candidates)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/scheduler/l3_agent_scheduler.py", line 54, in schedule
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     plugin, context, router_id, candidates=candidates)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/scheduler/l3_agent_scheduler.py", line 233, in _schedule_router
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     if not plugin.router_supports_scheduling(context, router_id):
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/services/l3_router/l3_router_plugin.py", line 128, in router_supports_scheduling
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     return self.l3_driver_controller.uses_scheduler(context, router_id)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/services/l3_router/service_providers/driver_controller.py", line 195, in uses_scheduler
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     return (self.get_provider_for_router(context, router_id).
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/services/l3_router/service_providers/driver_controller.py", line 149, in get_provider_for_router
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     context, [router_id]).get(router_id)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/db/servicetype_db.py", line 73, in get_provider_names_by_resource_ids
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     context, resource_id=resource_ids)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/objects/base.py", line 463, in get_objects
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     with context.session.begin(subtransactions=True):
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 824, in begin
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self, nested=nested)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 218, in __init__
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self._take_snapshot()
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 327, in _take_snapshot
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self.session.flush()
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2171, in flush
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self._flush(objects)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2291, in _flush
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     transaction.rollback(_capture_exception=True)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     compat.reraise(exc_type, exc_value, exc_tb)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2255, in _flush
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     flush_context.execute()
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 389, in execute
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     rec.execute(self)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 548, in execute
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     uow
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 177, in save_obj
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     mapper, table, update)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 760, in _emit_update_statements
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     (table.description, len(records), rows))
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager StaleDataError: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched.
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager


during reconfigure router ports from ha router to legacy router.

Comment 14 Slawek Kaplonski 2018-07-30 13:00:37 UTC
I just found that there is error like:

2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager [req-529f617b-f65b-40b5-82f0-fe7cad6ae4d7 6cd46c101f494126abbb658ff71103b7 fdf5da99bcfb4544bf380e215dcfd25c - default default] Error during notification for neutron.services.l3_router.l3_router_plugin.L3RouterPlugin._reconfigure_ha_resources-679142 router, after_update: StaleDataError: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched.
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager Traceback (most recent call last):
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron_lib/callbacks/manager.py", line 171, in _notify_loop
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     callback(resource, event, trigger, **kwargs)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/db/l3_hamode_db.py", line 491, in _reconfigure_ha_resources
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self.schedule_router(context, router_id)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/db/l3_agentschedulers_db.py", line 469, in schedule_router
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self, context, router, candidates=candidates)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/scheduler/l3_agent_scheduler.py", line 54, in schedule
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     plugin, context, router_id, candidates=candidates)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/scheduler/l3_agent_scheduler.py", line 233, in _schedule_router
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     if not plugin.router_supports_scheduling(context, router_id):
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/services/l3_router/l3_router_plugin.py", line 128, in router_supports_scheduling
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     return self.l3_driver_controller.uses_scheduler(context, router_id)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/services/l3_router/service_providers/driver_controller.py", line 195, in uses_scheduler
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     return (self.get_provider_for_router(context, router_id).
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/services/l3_router/service_providers/driver_controller.py", line 149, in get_provider_for_router
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     context, [router_id]).get(router_id)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/db/servicetype_db.py", line 73, in get_provider_names_by_resource_ids
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     context, resource_id=resource_ids)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib/python2.7/site-packages/neutron/objects/base.py", line 463, in get_objects
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     with context.session.begin(subtransactions=True):
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 824, in begin
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self, nested=nested)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 218, in __init__
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self._take_snapshot()
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 327, in _take_snapshot
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self.session.flush()
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2171, in flush
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     self._flush(objects)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2291, in _flush
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     transaction.rollback(_capture_exception=True)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     compat.reraise(exc_type, exc_value, exc_tb)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2255, in _flush
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     flush_context.execute()
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 389, in execute
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     rec.execute(self)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 548, in execute
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     uow
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 177, in save_obj
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     mapper, table, update)
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 760, in _emit_update_statements
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager     (table.description, len(records), rows))
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager StaleDataError: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched.
2018-07-30 12:55:19.443 289096 ERROR neutron_lib.callbacks.manager


during reconfigure router ports from ha router to legacy router.

Comment 15 Slawek Kaplonski 2018-07-30 13:01:45 UTC
I just found that this issue might be fixed with patch https://review.openstack.org/#/c/584785/ which is already proposed to stable/pike in upstream.
I will test it now on my env.

Comment 16 Slawek Kaplonski 2018-07-30 15:10:13 UTC
Backport of https://review.openstack.org/#/c/584785/ proposed to OSP-12: https://code.engineering.redhat.com/gerrit/#/c/144683/ - let's wait how it will be with this one also :)

Comment 17 Slawek Kaplonski 2018-07-31 08:11:43 UTC
So patch which fixes neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromHA.test_from_ha_to_legacy is merged to OSP-12 now but in CI again test:
neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromLegacy.test_from_legacy_to_ha
fails.

I will try to reproduce it somehow on my node then.

Comment 18 Slawek Kaplonski 2018-08-02 11:57:20 UTC
I think that I found what is breaking this connectivity there.
Unfortunately I'm not sure yet why it happens like that and I could reproduce it only when I run all migration tests together. If I run only this one test, it didn't fails for me :/

Ok, so what I found is:

MAC address of router’s port on contr-0 (active one) is fa:16:3e:2b:4c:ea

Controller-0 (active):

[root@controller-0 site-packages]# ip netns exec qrouter-0dd10372-9b0c-4995-81a0-b7c59d89fc23 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
966: ha-cafff8d4-6d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:46:99:dd brd ff:ff:ff:ff:ff:ff
    inet 169.254.192.7/18 brd 169.254.255.255 scope global ha-cafff8d4-6d
       valid_lft forever preferred_lft forever
    inet 169.254.0.2/24 scope global ha-cafff8d4-6d
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe46:99dd/64 scope link
       valid_lft forever preferred_lft forever
967: qr-8c6cb474-bf: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:2b:4c:ea brd ff:ff:ff:ff:ff:ff
    inet 10.100.0.1/28 scope global qr-8c6cb474-bf
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe2b:4cea/64 scope link nodad
       valid_lft forever preferred_lft forever
968: qg-c04955f5-a3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:65:1a:87 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.236/32 scope global qg-c04955f5-a3
       valid_lft forever preferred_lft forever
    inet 10.0.0.237/24 scope global qg-c04955f5-a3
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe65:1a87/64 scope l


[root@controller-0 site-packages]# ovs-vsctl show
4ce8648e-793b-4f6b-b59e-307fa2443220
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-tun
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port br-tun
            Interface br-tun
                type: internal
        Port "vxlan-ac110211"
            Interface "vxlan-ac110211"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.18", out_key=flow, remote_ip="172.17.2.17"}
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port "vxlan-ac110213"
            Interface "vxlan-ac110213"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.18", out_key=flow, remote_ip="172.17.2.19"}
        Port "vxlan-ac110210"
            Interface "vxlan-ac110210"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.18", out_key=flow, remote_ip="172.17.2.16"}


Compute-1 (vm is on it)

[root@compute-1 heat-admin]# ovs-vsctl show
0a9ed74b-8353-489c-b06a-383e382daf5b
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-tun
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "vxlan-ac110210" (controller-1)
            Interface "vxlan-ac110210"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.17", out_key=flow, remote_ip="172.17.2.16"}
        Port br-tun
            Interface br-tun
                type: internal
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port "vxlan-ac110212" (controller-0)
            Interface "vxlan-ac110212"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.17", out_key=flow, remote_ip="172.17.2.18"}
        Port "vxlan-ac110213" (controller-2)
            Interface "vxlan-ac110213"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.17", out_key=flow, remote_ip="172.17.2.19"}


[root@compute-1 heat-admin]# ovs-ofctl show br-tun
OFPT_FEATURES_REPLY (xid=0x2): dpid:00003ec091751e4e
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(patch-int): addr:2e:f1:2f:10:09:c3
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 168(vxlan-ac110212): addr:c6:87:c4:c2:f9:95
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 169(vxlan-ac110213): addr:76:07:fe:09:1f:fc
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 170(vxlan-ac110210): addr:de:d6:ea:bb:a7:8e
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br-tun): addr:3e:c0:91:75:1e:4e
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0



[root@compute-1 heat-admin]# ovs-ofctl dump-flows br-tun
 cookie=0x799fae05fa32071e, duration=177814.034s, table=0, n_packets=24105, n_bytes=1921372, priority=1,in_port="patch-int" actions=resubmit(,1)
 cookie=0x799fae05fa32071e, duration=2638.912s, table=0, n_packets=130, n_bytes=7718, priority=1,in_port="vxlan-ac110212" actions=resubmit(,4)
 cookie=0x799fae05fa32071e, duration=2638.871s, table=0, n_packets=102, n_bytes=11801, priority=1,in_port="vxlan-ac110213" actions=resubmit(,4)
 cookie=0x799fae05fa32071e, duration=2638.830s, table=0, n_packets=37, n_bytes=4928, priority=1,in_port="vxlan-ac110210" actions=resubmit(,4)
 cookie=0x799fae05fa32071e, duration=177814.635s, table=0, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x799fae05fa32071e, duration=177814.028s, table=1, n_packets=23625, n_bytes=1881309, priority=0 actions=resubmit(,2)
 cookie=0x799fae05fa32071e, duration=177814.633s, table=2, n_packets=10312, n_bytes=1040962, priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,20)
 cookie=0x799fae05fa32071e, duration=177814.628s, table=2, n_packets=13524, n_bytes=864716, priority=0,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,22)
 cookie=0x799fae05fa32071e, duration=177814.625s, table=3, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x799fae05fa32071e, duration=2645.661s, table=4, n_packets=223, n_bytes=18547, priority=1,tun_id=0x3e actions=mod_vlan_vid:117,resubmit(,9)
 cookie=0x799fae05fa32071e, duration=177814.622s, table=4, n_packets=634, n_bytes=59656, priority=0 actions=drop
 cookie=0x799fae05fa32071e, duration=177814.619s, table=6, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x799fae05fa32071e, duration=177813.875s, table=9, n_packets=62, n_bytes=11781, priority=1,dl_src=fa:16:3f:2c:d7:79 actions=output:"patch-int"
 cookie=0x799fae05fa32071e, duration=177745.764s, table=9, n_packets=28, n_bytes=2488, priority=1,dl_src=fa:16:3f:1c:85:07 actions=output:"patch-int"
 cookie=0x799fae05fa32071e, duration=177738.398s, table=9, n_packets=22, n_bytes=1640, priority=1,dl_src=fa:16:3f:67:8b:37 actions=output:"patch-int"
 cookie=0x799fae05fa32071e, duration=177690.087s, table=9, n_packets=24, n_bytes=2020, priority=1,dl_src=fa:16:3f:b4:9f:27 actions=output:"patch-int"
 cookie=0x799fae05fa32071e, duration=177814.031s, table=9, n_packets=10123, n_bytes=1215986, priority=0 actions=resubmit(,10)
 cookie=0x799fae05fa32071e, duration=177814.617s, table=10, n_packets=10123, n_bytes=1215986, priority=1 actions=learn(table=20,hard_timeout=300,priority=1,cookie=0x799fae05fa32071e,NXM_OF_VLAN_TCI[0..11],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->NXM_OF_VLAN_TCI[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:OXM_OF_IN_PORT[]),output:"patch-int"
 cookie=0x799fae05fa32071e, duration=2629.361s, table=20, n_packets=7, n_bytes=552, priority=2,dl_vlan=117,dl_dst=fa:16:3e:fb:7c:85 actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110212"
 cookie=0x799fae05fa32071e, duration=2629.326s, table=20, n_packets=7, n_bytes=552, priority=2,dl_vlan=117,dl_dst=fa:16:3e:76:a8:c0 actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110213"
 cookie=0x799fae05fa32071e, duration=2629.306s, table=20, n_packets=7, n_bytes=552, priority=2,dl_vlan=117,dl_dst=fa:16:3e:ce:87:fc actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110210"
 cookie=0x799fae05fa32071e, duration=2618.870s, table=20, n_packets=193, n_bytes=14808, priority=2,dl_vlan=117,dl_dst=fa:16:3e:2b:4c:ea actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110213"
 cookie=0x799fae05fa32071e, duration=177814.615s, table=20, n_packets=350, n_bytes=33044, priority=0 actions=resubmit(,22)
 cookie=0x799fae05fa32071e, duration=2433.527s, table=22, n_packets=12, n_bytes=1312, priority=1,dl_vlan=117 actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110212",output:"vxlan-ac110213",output:"vxlan-ac110210"
 cookie=0x799fae05fa32071e, duration=177814.613s, table=22, n_packets=12635, n_bytes=743918, priority=0 actions=drop



So this one rule should not be here:
 cookie=0x799fae05fa32071e, duration=2618.870s, table=20, n_packets=193, n_bytes=14808, priority=2,dl_vlan=117,dl_dst=fa:16:3e:2b:4c:ea actions=strip_vlan,load:0x3e->NXM_NX_TUN_ID[],output:"vxlan-ac110213"

As it sends packages which should be send to fa:16:3e:2b:4c:ea to controller-2 instead of controller-0 on which there is this router configured now.
Just for confirmation. I manually removed this one rule from br-tun and connection was fine.

Comment 19 Slawek Kaplonski 2018-08-02 19:42:56 UTC
I checked few times and when I run only this one test with migration from legacy to HA then everything works fine. I didn't found adding such fdb entries for this qr-XX port on compute node.
It appears only when I run more tests in parallel. And I don't know why it is like that.
I will continue debugging this issue in next days....

Comment 20 Slawek Kaplonski 2018-08-03 15:48:21 UTC
I was testing this today and if I run all tests one by one or if I run only this one test, it's passing every time.
When I run all tests with 4 workers - this one test if failing very often.
I still don't know exactly why it is like that...

From logs analysis it looks that when all if fine, there is rpc message remove_fdb_entries send to agents when router is set to admin_state_up=False.
But when test fails there is no this information in neutron-server's log file. So message is not send and because of that there is an issue with fdb entries.

But I still don't know why it isn't sent properly.

Comment 21 Slawek Kaplonski 2018-08-03 19:34:35 UTC
So I did another test and what I found is (it's together with some my additional logs):

Router ID: b1da9aca-046a-4354-911d-6cf4832956be
Router's port: cb31cdca-810b-45c8-a8b4-177c7f1aed09

So when legacy router is set to "DOWN", port should be also set to down. OVS agent should then send update_device_list rpc message to neutron-server. And that is really done in neutron-openvwitch-agent:

2018-08-03 17:46:37.718 633108 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-b4346f5d-cdbd-4f60-b6f0-42b994722205 - - - - -] Ports set([u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', u'f7a40997-efdc-405b-9bf9-5b35b4e1d604']) removed

And it is also visible in neutron-server logs:

2018-08-03 17:46:37.740 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; update_device_list; kwargs = {u'devices_up': [], u'host': u'controller
-1.localdomain', u'devices_down': [u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', u'f7a40997-efdc-405b-9bf9-5b35b4e1d604'], u'agent_id': u'ovs-agent-controller-1.localdomain'}
2018-08-03 17:46:37.741 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; devices_up = []
2018-08-03 17:46:37.741 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; devices_down = [u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', u'f7a40997-ef
dc-405b-9bf9-5b35b4e1d604']
2018-08-03 17:46:37.742 519815 DEBUG neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] Device cb31cdca-810b-45c8-a8b4-177c7f1aed09 no longer exists at agent ovs-
agent-controller-1.localdomain update_device_down /usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:213

It's log from neutron.plugins.ml2.rpc.RpcCallbacks.update_device_list() method.

So device is set to DOWN and neutron.plugins.ml2.rpc.RpcCallbacks.update_device_down() is called:

2018-08-03 17:46:39.101 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; PORT = <neutron.db.models_v2.Port[object at 7fce176756d0] {project_id=u'768d5f7a31184ee1aacd9552c63e5723', id=u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', name=u'', network_id=u'42500424-d630-4ab9-b6f0-9a35e5831639', mac_address=u'fa:16:3e:68:9a:d3', admin_state_up=True, status=u'DOWN', device_id=u'b1da9aca-046a-4354-911d-6cf4832956be', device_owner=u'network:ha_router_replicated_interface', ip_allocation=u'immediate', standard_attr_id=37601}>
2018-08-03 17:46:39.102 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; HOST = controller-1.localdomain
2018-08-03 17:46:39.103 519815 ERROR neutron.plugins.ml2.rpc [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; STATUS = DOWN


But then in l2_population update_port_down() method it checks that this port is "backup" port of HA router, so it doesn't send fdb entries to be removed:

2018-08-03 17:46:40.295 519815 ERROR neutron.plugins.ml2.drivers.l2pop.mech_driver [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; L2pop update_port_down; port = {'status': 'DOWN', 'binding:host_id': u'controller-1.localdomain', 'description': u'', 'allowed_address_pairs': [], 'tags': [], 'extra_dhcp_opts': [], 'updated_at': '2018-08-03T17:46:38Z', 'device_owner': u'network:ha_router_replicated_interface', 'revision_number': 10, 'port_security_enabled': False, 'binding:profile': {}, 'fixed_ips': [{'subnet_id': u'9945eac0-0e0d-48b7-92eb-128f6d7b62c2', 'ip_address': u'10.100.0.1'}], 'id': u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', 'security_groups': [], 'device_id': u'b1da9aca-046a-4354-911d-6cf4832956be', 'name': u'', 'admin_state_up': True, 'network_id': u'42500424-d630-4ab9-b6f0-9a35e5831639', 'tenant_id': u'768d5f7a31184ee1aacd9552c63e5723', 'binding:vif_details': {u'port_filter': True, u'datapath_type': u'system', u'ovs_hybrid_plug': True}, 'binding:vnic_type': u'normal', 'binding:vif_type': u'ovs', 'qos_policy_id': None, 'mac_address': u'fa:16:3e:68:9a:d3', 'project_id': u'768d5f7a31184ee1aacd9552c63e5723', 'created_at': '2018-08-03T17:44:20Z'}
2018-08-03 17:46:41.649 519815 ERROR neutron.plugins.ml2.drivers.l2pop.mech_driver [req-831f2d85-c079-4b56-ae5e-e48705d478c5 - - - - -] SK; L3 plugin on controller-1.localdomain still have port {'status': 'DOWN', 'binding:host_id': u'controller-1.localdomain', 'description': u'', 'allowed_address_pairs': [], 'tags': [], 'extra_dhcp_opts': [], 'updated_at': '2018-08-03T17:46:38Z', 'device_owner': u'network:ha_router_replicated_interface', 'revision_number': 10, 'port_security_enabled': False, 'binding:profile': {}, 'fixed_ips': [{'subnet_id': u'9945eac0-0e0d-48b7-92eb-128f6d7b62c2', 'ip_address': u'10.100.0.1'}], 'id': u'cb31cdca-810b-45c8-a8b4-177c7f1aed09', 'security_groups': [], 'device_id': u'b1da9aca-046a-4354-911d-6cf4832956be', 'name': u'', 'admin_state_up': True, 'network_id': u'42500424-d630-4ab9-b6f0-9a35e5831639', 'tenant_id': u'768d5f7a31184ee1aacd9552c63e5723', 'binding:vif_details': {u'port_filter': True, u'datapath_type': u'system', u'ovs_hybrid_plug': True}, 'binding:vnic_type': u'normal', 'binding:vif_type': u'ovs', 'qos_policy_id': None, 'mac_address': u'fa:16:3e:68:9a:d3', 'project_id': u'768d5f7a31184ee1aacd9552c63e5723', 'created_at': '2018-08-03T17:44:20Z'}; don't sent fdb_remove


Because router was transitioned to HA 10 seconds before this rpc was finished:

2018-08-03 17:46:30.010 519796 DEBUG neutron.services.l3_router.service_providers.driver_controller [req-94f31138-4650-4612-ae92-6038cc367b03 07f9e30fd7a14a11806856ad5b60ec9b ab6138b7124d425c8956108d8d9aa268 - default default] Router b1da9aca-046a-4354-911d-6cf4832956be migrating from <neutron.services.l3_router.service_providers.single_node.SingleNodeDriver object at 0x7fce17fb8f10> provider to <neutron.services.l3_router.service_providers.ha.HaDriver object at 0x7fce17fb8cd0> provider. _update_router_provider /usr/lib/python2.7/site-packages/neutron/services/l3_router/service_providers/driver_controller.py:138




---- From tempest.log we can see that update router to admin_state_up=False was done at:

2018-08-03 13:46:23.406 18722 INFO tempest.lib.common.rest_client [req-1808254d-e24c-4d53-a015-79bf743be2dd ] Request (NetworkMigrationFromLegacy:test_from_legacy_to_ha): 200 PUT http://10.0.0.104:9696/v2.0/routers/b1da9aca-046a-4354-911d-6cf4832956be 8.390s
2018-08-03 13:46:23.410 18722 DEBUG tempest.lib.common.rest_client [req-1808254d-e24c-4d53-a015-79bf743be2dd ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: {"router": {"external_gateway_info": {"network_id": "9f0fa67d-34b4-47e5-ba4e-3d3138bee8da"}, "name": "tempest-router-1000907614", "admin_state_up": false}}


And transform router to HA:

2018-08-03 13:47:04.391 18722 INFO tempest.lib.common.rest_client [req-94f31138-4650-4612-ae92-6038cc367b03 ] Request (NetworkMigrationFromLegacy:test_from_legacy_to_ha): 200 PUT http://10.0.0.104:9696/v2.0/routers/b1da9aca-046a-4354-911d-6cf4832956be 40.331s
2018-08-03 13:47:04.391 18722 DEBUG tempest.lib.common.rest_client [req-94f31138-4650-4612-ae92-6038cc367b03 ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: {"router": {"external_gateway_info": {"network_id": "9f0fa67d-34b4-47e5-ba4e-3d3138bee8da"}, "ha": true, "distributed": false, "name": "tempest-router-1000907614", "admin_state_up": false}}

Comment 23 Slawek Kaplonski 2018-08-06 07:51:33 UTC
To confirm that there issue is because of this race between setting router to DOWN and migration to HA, I added some wait until router's port will be really DOWN before migrating it to HA (in test code).
I run such modified migration tests for whole weekend. It passed 151 times. So now I'm sure that this is an issue here and I will need to figure out how to fix it :)

Comment 24 Slawek Kaplonski 2018-08-06 09:18:29 UTC
I reported this bug in upstream also.

Comment 25 Slawek Kaplonski 2018-08-08 07:00:06 UTC
As can be seen on https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-neutron-12_director-rhel-virthost-3cont_2comp-ipv4-vxlan-dvr/26//artifact/tempest-results/tempest-results-neutron.1.html my last patch: https://code.engineering.redhat.com/gerrit/#/c/146252/2 fixed issue with migration tests.
I will now have to find a way how to solve this issue in L2pop mech_driver but this "test patch" should at least unblock CI

Comment 26 Slawek Kaplonski 2018-08-08 13:09:15 UTC
There is patch with fix proposed in upstream now: https://review.openstack.org/589885

Comment 27 Slawek Kaplonski 2018-08-13 11:48:54 UTC
Fix for failing test is merged to OSP-12 (it's test only fix: https://code.engineering.redhat.com/gerrit/#/c/146252/) and it's included in openstack-neutron-11.0.4-4.el7ost
There will be also "real" fix for this issue in L2pop: https://review.openstack.org/#/c/589885/ but as for now I think that issue can be set to MODIFIED as it should not impact CI systems.

Comment 38 errata-xmlrpc 2018-12-05 18:53:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3792


Note You need to log in before you can comment on or make changes to this bug.