Bug 1556686

Summary: [Driver] resource leaks after running tempest scenario tests
Product: Red Hat OpenStack Reporter: jamo luhrsen <jluhrsen>
Component: python-networking-odlAssignee: Mike Kolesnik <mkolesni>
Status: CLOSED DUPLICATE QA Contact: Itzik Brown <itbrown>
Severity: low Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: jluhrsen, mkolesni, nyechiel, wznoinsk
Target Milestone: z1Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: odl_driver
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
N/A
Last Closed: 2018-05-21 12:11:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dhcp-agent
none
opendaylight karaf.log
none
neutron server log
none
4/11 repro karaf.log
none
4/11 repro server.log
none
20180513_043751-test_delete_policies_while_tenant_attached_to_net-sosreport-controller-0
none
20180513_043751-test_delete_policies_while_tenant_attached_to_net-odltools-undercloud-0.redhat.local
none
20180513_043751-test_delete_policies_while_tenant_attached_to_net-sosreport-compute-0
none
20180513_043751-test_delete_policies_while_tenant_attached_to_net-sosreport-compute-1 none

Description jamo luhrsen 2018-03-15 04:03:39 UTC
Description of problem:

With an OpenDaylight/OpenStack integration, there are some resource
leaks (e.g. routers, ports) after running some tempest scenario
tests. Tempest does attempt to clean itself up and these leaks do
not happen every time.

this is also being tracked in upstream OpenDaylight:
  https://jira.opendaylight.org/projects/NETVIRT/issues/NETVIRT-1019

it also happens in our downstream QE jobs, like this:
  https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/opendaylight/view/odl-netvirt/job/DFG-opendaylight-odl-netvirt-12_director-rhel-virthost-1cont_2comp-ipv4-vxlan-tempest/

Version-Release number of selected component (if applicable):


How reproducible:
Most of the time

Steps to Reproduce:
1. sudo tempest run --conf /etc/tempest/tempest.conf --regex tempest.api.network.admin.*snat

2.
3.

Actual results:

leftover tempest related resources like routers and ports are remaining
even though tempest does attempt to clean up.

Expected results:

all resources should be gone.


Additional info:

Comment 1 jamo luhrsen 2018-03-15 04:07:33 UTC
This issue should be analyzed deeper to understand where the problem really
lies. The leak could be such that it will worsen over time and eventually
cause real issues. It could be a bug in ODL, tempest, neutron, etc.

Additionally, this bug causes headaches in CI environments that have 
expectations of the environment.

Comment 2 Mike Kolesnik 2018-03-19 14:23:55 UTC
Can you please attach logs when this happened, so that we can further analyse the bug?

Comment 3 jamo luhrsen 2018-03-19 21:22:04 UTC
(In reply to Mike Kolesnik from comment #2)
> Can you please attach logs when this happened, so that we can further
> analyse the bug?

there are some details in the upstream opendaylight jira that I gave a link
to. what logs do you want? below is a reproduction and I've attached the
neutron server log, dhcp-agent log and odl karaf.log

reproduction (it takes a few iterations to hit the problem with this smaller
tempest regex):

Mon Mar 19 17:07:21 EDT 2018
(overcloud) [stack@undercloud-0 tempest-dir]$ openstack router list

(overcloud) [stack@undercloud-0 tempest-dir]$ sudo tempest run --conf /etc/tempest/tempest.conf --regex tempest.api.network.admin.*snat
running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover --list 
running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover  --load-list /tmp/tmpewb2dI
running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover  --load-list /tmp/tmpWVaYrj
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_create_router_with_default_snat_value [3.352111s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_create_router_with_default_snat_value [3.675526s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_create_router_with_snat_explicit [5.881535s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_create_router_with_snat_explicit [6.039762s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_reset_gateway_without_snat [4.106241s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_reset_gateway_without_snat [3.959537s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_set_gateway_with_snat_explicit [3.090799s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_set_gateway_with_snat_explicit [3.249783s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_set_gateway_without_snat [3.781734s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_set_gateway_without_snat [3.575197s] ... ok

======
Totals
======
Ran: 10 tests in 30.0000 sec.
 - Passed: 10
 - Skipped: 0
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 40.7122 sec.

==============
Worker Balance
==============
 - Worker 0 (5 tests) => 0:00:20.502541
 - Worker 1 (5 tests) => 0:00:20.214550
(overcloud) [stack@undercloud-0 tempest-dir]$ openstack router list

(overcloud) [stack@undercloud-0 tempest-dir]$ date
Mon Mar 19 17:08:10 EDT 2018
(overcloud) [stack@undercloud-0 tempest-dir]$ openstack router list
^[[A^[[A^[[A^[[A
(overcloud) [stack@undercloud-0 tempest-dir]$ sudo tempest run --conf /etc/tempest/tempest.conf --regex tempest.api.network.admin.*snat
running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover --list 
running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover  --load-list /tmp/tmpY36MZP
running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover  --load-list /tmp/tmpV566i5
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_create_router_with_default_snat_value [4.249675s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_create_router_with_default_snat_value [4.377853s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_create_router_with_snat_explicit [6.316407s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_create_router_with_snat_explicit [7.588991s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_reset_gateway_without_snat [4.324835s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_reset_gateway_without_snat [4.410701s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_set_gateway_with_snat_explicit [4.006442s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_set_gateway_with_snat_explicit [3.591074s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_set_gateway_without_snat [3.553873s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_set_gateway_without_snat [3.163889s] ... ok

======
Totals
======
Ran: 10 tests in 32.0000 sec.
 - Passed: 10
 - Skipped: 0
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 45.5837 sec.

==============
Worker Balance
==============
 - Worker 0 (5 tests) => 0:00:23.007013
 - Worker 1 (5 tests) => 0:00:22.581773
(overcloud) [stack@undercloud-0 tempest-dir]$ openstack router list
+--------------------------------------+--------------------------------+--------+-------+-------------+-------+----------------------------------+
| ID                                   | Name                           | Status | State | Distributed | HA    | Project                          |
+--------------------------------------+--------------------------------+--------+-------+-------------+-------+----------------------------------+
| ee5f0f97-a8f9-407d-b912-8a3e8ed024d7 | tempest-snat-router-1366717702 | ACTIVE | UP    | False       | False | 6016d798aca9407295f66349f337c307 |
+--------------------------------------+--------------------------------+--------+-------+-------------+-------+----------------------------------+
(overcloud) [stack@undercloud-0 tempest-dir]$ date
Mon Mar 19 17:08:57 EDT 2018

Comment 4 jamo luhrsen 2018-03-19 21:24:49 UTC
Created attachment 1410156 [details]
dhcp-agent

Comment 5 jamo luhrsen 2018-03-19 21:25:30 UTC
Created attachment 1410157 [details]
opendaylight karaf.log

Comment 6 jamo luhrsen 2018-03-19 21:26:08 UTC
Created attachment 1410158 [details]
neutron server log

Comment 10 jamo luhrsen 2018-03-22 19:34:54 UTC
some extra info from an IRC chat:

look at this: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-ocata-upstream-stateful-carbon/460/robot-plugin/log_05_tempest.html.gz#s1-k2-k1-k5-k1-k4

that is carbon w/ ocata. that is the output of 'router list' after all our tempest tests are done.

and... carbon w/ pike which has a bunch of leftover routers:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-pike-upstream-stateful-carbon/244/robot-plugin/log_05_tempest.html.gz#s1-k2-k1-k5-k1-k4

Comment 11 Mike Kolesnik 2018-04-09 06:52:54 UTC
(In reply to jamo luhrsen from comment #3)
> (In reply to Mike Kolesnik from comment #2)
> > Can you please attach logs when this happened, so that we can further
> > analyse the bug?
> 
> there are some details in the upstream opendaylight jira that I gave a link
> to. what logs do you want? below is a reproduction and I've attached the
> neutron server log, dhcp-agent log and odl karaf.log
> 
> reproduction (it takes a few iterations to hit the problem with this smaller
> tempest regex):
> 
> Mon Mar 19 17:07:21 EDT 2018
> (overcloud) [stack@undercloud-0 tempest-dir]$ openstack router list
> 
> (overcloud) [stack@undercloud-0 tempest-dir]$ sudo tempest run --conf
> /etc/tempest/tempest.conf --regex tempest.api.network.admin.*snat
... snip ...
> ==============
> Worker Balance
> ==============
>  - Worker 0 (5 tests) => 0:00:23.007013
>  - Worker 1 (5 tests) => 0:00:22.581773
> (overcloud) [stack@undercloud-0 tempest-dir]$ openstack router list
> +--------------------------------------+--------------------------------+----
> ----+-------+-------------+-------+----------------------------------+
> | ID                                   | Name                           |
> Status | State | Distributed | HA    | Project                          |
> +--------------------------------------+--------------------------------+----
> ----+-------+-------------+-------+----------------------------------+
> | ee5f0f97-a8f9-407d-b912-8a3e8ed024d7 | tempest-snat-router-1366717702 |
> ACTIVE | UP    | False       | False | 6016d798aca9407295f66349f337c307 |
> +--------------------------------------+--------------------------------+----
> ----+-------+-------------+-------+----------------------------------+
> (overcloud) [stack@undercloud-0 tempest-dir]$ date
> Mon Mar 19 17:08:57 EDT 2018

This is weird, when I search for this router by name I see that it has the ID 092757a0-8e79-4471-9198-53927439abb6 and not that ID which is listed here.
The ID ee5f0f97-a8f9-407d-b912-8a3e8ed024d7 is only listed as the device_id for a router gateway port, but I see no mention of it being created or accessed or deleted..

Also, I see that the router with the id 092757a0-8e79-4471-9198-53927439abb6 was deleted:

2018-03-19 21:08:32.090 108283 INFO neutron.wsgi [req-6949247f-e09d-403e-8b43-a946c9026335 f1e3d25d45a9482c899960ba39cd6b76 6016d798aca9407295f66349f337c307 - default default] 172.17.1.16 "DELETE /v2.0/routers/092757a0-8e79-4471-9198-53927439abb6 HTTP/1.1" status: 204  len: 168 time: 1.5183730
2018-03-19 21:08:32.102 108283 INFO networking_odl.journal.journal [req-6a1b0cc9-ebce-453a-a88f-7605aa601b82 - - - - -] Completed (Entry ID: 9256) - delete router 092757a0-8e79-4471-9198-53927439abb6

Perhaps there's two routers with the same name (Though I thought it's a random name)?

Jamo, can you please attach a neutron & karaf log from when the server started and that has the reproduced behavior?

Comment 12 jamo luhrsen 2018-04-11 21:55:53 UTC
Created attachment 1420519 [details]
4/11 repro karaf.log

Comment 13 jamo luhrsen 2018-04-11 21:56:18 UTC
Created attachment 1420520 [details]
4/11 repro server.log

Comment 14 jamo luhrsen 2018-04-11 21:58:03 UTC
(In reply to Mike Kolesnik from comment #11)
> (In reply to jamo luhrsen from comment #3)
> > (In reply to Mike Kolesnik from comment #2)
> > > Can you please attach logs when this happened, so that we can further
> > > analyse the bug?
> > 
> > there are some details in the upstream opendaylight jira that I gave a link
> > to. what logs do you want? below is a reproduction and I've attached the
> > neutron server log, dhcp-agent log and odl karaf.log
> > 
> > reproduction (it takes a few iterations to hit the problem with this smaller
> > tempest regex):
> > 
> > Mon Mar 19 17:07:21 EDT 2018
> > (overcloud) [stack@undercloud-0 tempest-dir]$ openstack router list
> > 
> > (overcloud) [stack@undercloud-0 tempest-dir]$ sudo tempest run --conf
> > /etc/tempest/tempest.conf --regex tempest.api.network.admin.*snat
> ... snip ...
> > ==============
> > Worker Balance
> > ==============
> >  - Worker 0 (5 tests) => 0:00:23.007013
> >  - Worker 1 (5 tests) => 0:00:22.581773
> > (overcloud) [stack@undercloud-0 tempest-dir]$ openstack router list
> > +--------------------------------------+--------------------------------+----
> > ----+-------+-------------+-------+----------------------------------+
> > | ID                                   | Name                           |
> > Status | State | Distributed | HA    | Project                          |
> > +--------------------------------------+--------------------------------+----
> > ----+-------+-------------+-------+----------------------------------+
> > | ee5f0f97-a8f9-407d-b912-8a3e8ed024d7 | tempest-snat-router-1366717702 |
> > ACTIVE | UP    | False       | False | 6016d798aca9407295f66349f337c307 |
> > +--------------------------------------+--------------------------------+----
> > ----+-------+-------------+-------+----------------------------------+
> > (overcloud) [stack@undercloud-0 tempest-dir]$ date
> > Mon Mar 19 17:08:57 EDT 2018
> 
> This is weird, when I search for this router by name I see that it has the
> ID 092757a0-8e79-4471-9198-53927439abb6 and not that ID which is listed here.
> The ID ee5f0f97-a8f9-407d-b912-8a3e8ed024d7 is only listed as the device_id
> for a router gateway port, but I see no mention of it being created or
> accessed or deleted..
> 
> Also, I see that the router with the id 092757a0-8e79-4471-9198-53927439abb6
> was deleted:
> 
> 2018-03-19 21:08:32.090 108283 INFO neutron.wsgi
> [req-6949247f-e09d-403e-8b43-a946c9026335 f1e3d25d45a9482c899960ba39cd6b76
> 6016d798aca9407295f66349f337c307 - default default] 172.17.1.16 "DELETE
> /v2.0/routers/092757a0-8e79-4471-9198-53927439abb6 HTTP/1.1" status: 204 
> len: 168 time: 1.5183730
> 2018-03-19 21:08:32.102 108283 INFO networking_odl.journal.journal
> [req-6a1b0cc9-ebce-453a-a88f-7605aa601b82 - - - - -] Completed (Entry ID:
> 9256) - delete router 092757a0-8e79-4471-9198-53927439abb6
> 
> Perhaps there's two routers with the same name (Though I thought it's a
> random name)?
> 
> Jamo, can you please attach a neutron & karaf log from when the server
> started and that has the reproduced behavior?


Mike,

attached a new server.log and karaf.log files for another reproduction
(4/11 repro in attachment names). here's the command line output:

(overcloud) [stack@undercloud-0 ~]$ tempest run --conf /etc/tempest/tempest.conf --regex tempest.api.network.admin.*snat
running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover --list 
running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover  --load-list /tmp/tmpf_1wkz
running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover  --load-list /tmp/tmpGjw8Mn
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_create_router_with_default_snat_value [4.264322s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_create_router_with_default_snat_value [4.437107s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_create_router_with_snat_explicit [6.570603s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_create_router_with_snat_explicit [7.621135s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_reset_gateway_without_snat [4.224652s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_reset_gateway_without_snat [4.536168s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_set_gateway_with_snat_explicit [4.679894s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_set_gateway_with_snat_explicit [4.522345s] ... ok
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_set_gateway_without_snat [4.244492s] ... ok
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_set_gateway_without_snat [4.016432s] ... ok

======
Totals
======
Ran: 10 tests in 33.0000 sec.
 - Passed: 10
 - Skipped: 0
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 49.1172 sec.

==============
Worker Balance
==============
 - Worker 0 (5 tests) => 0:00:24.159222
 - Worker 1 (5 tests) => 0:00:24.963881
(overcloud) [stack@undercloud-0 ~]$ openstack router list
+--------------------------------------+-----------------------------+--------+-------+-------------+-------+----------------------------------+
| ID                                   | Name                        | Status | State | Distributed | HA    | Project                          |
+--------------------------------------+-----------------------------+--------+-------+-------------+-------+----------------------------------+
| a1dd6ef4-a9a9-4747-bcff-3a3f80b4abb3 | tempest-snat-router-9379766 | ACTIVE | UP    | False       | False | ee0b40bcd52e4747a888f82e5107e122 |
+--------------------------------------+-----------------------------+--------+-------+-------------+-------+----------------------------------+


Also, note that I stuffed a "JAMO: TEST START" and "JAMO: TEST END" string in the
log files to show right before and after I ran the above tempest command.

Comment 15 Mike Kolesnik 2018-05-08 11:51:14 UTC
I'm sorry but the new log still doesn't contain the information..

For reference here's the JAMO lines:
[mkolesni@localhost ~]$ grep -A1 JAMO /tmp/server.log 
JAMO: TEST START
2018-04-11 21:49:25.327 125358 DEBUG neutron.wsgi [-] (125358) accepted ('172.17.1.17', 55186) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883
--
JAMO: TEST END
2018-04-11 21:50:03.378 125359 DEBUG neutron.wsgi [-] (125359) accepted ('172.17.1.17', 57274) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883


As you can see if you try to grep the router ID it's only mentioned once (the time frame is indeed within the time of the test):

[mkolesni@localhost ~]$ grep a1dd6ef4-a9a9-4747-bcff-3a3f80b4abb3 /tmp/server.log 
2018-04-11 21:49:39.464 125359 DEBUG neutron.api.rpc.handlers.resources_rpc [req-80539b19-186d-4d2c-a341-2ba6af54e882 3713c00dea3247bca8359c5cb812fbde ee0b40bcd52e4747a888f82e5107e122 - - -] neutron.api.rpc.handlers.resources_rpc.ResourcesPushRpcApi method push called with arguments (<neutron_lib.context.Context object at 0x89f4d50>, [Port(admin_state_up=True,allowed_address_pairs=[],binding=PortBinding,binding_levels=[],created_at=2018-04-11T21:49:39Z,data_plane_status=<?>,description='',device_id='a1dd6ef4-a9a9-4747-bcff-3a3f80b4abb3',device_owner='network:router_gateway',dhcp_options=[],distributed_binding=None,dns=None,fixed_ips=[IPAllocation],id=aa8b6068-35e6-4a9a-94f5-866da44201ae,mac_address=fa:16:3e:7f:a9:9d,name='',network_id=103184c3-d46d-4028-978e-50ff00c97dad,project_id='',qos_policy_id=None,revision_number=3,security=PortSecurity(aa8b6068-35e6-4a9a-94f5-866da44201ae),security_group_ids=set([]),status='DOWN',updated_at=2018-04-11T21:49:39Z)], 'updated') {} wrapper /usr/lib/python2.7/site-packages/oslo_log/helpers.py:66
    "device_id": "a1dd6ef4-a9a9-4747-bcff-3a3f80b4abb3"
      "device_id" : "a1dd6ef4-a9a9-4747-bcff-3a3f80b4abb3",
[mkolesni@localhost ~]$ 

So there's no mention of the router being created nor deleted..

If we try to grep for the router name we get several matches:
[mkolesni@localhost ~]$ grep tempest-snat-router-9379766 /tmp/server.log 
2018-04-11 21:49:38.590 125359 DEBUG neutron.api.v2.base [req-80539b19-186d-4d2c-a341-2ba6af54e882 3713c00dea3247bca8359c5cb812fbde ee0b40bcd52e4747a888f82e5107e122 - default default] Request body: {u'router': {u'external_gateway_info': {u'network_id': u'103184c3-d46d-4028-978e-50ff00c97dad', u'enable_snat': False}, u'name': u'tempest-snat-router-9379766'}} prepare_request_body /usr/lib/python2.7/site-packages/neutron/api/v2/base.py:695
2018-04-11 21:49:39.861 125359 DEBUG neutron.api.v2.base [req-80539b19-186d-4d2c-a341-2ba6af54e882 3713c00dea3247bca8359c5cb812fbde ee0b40bcd52e4747a888f82e5107e122 - default default] Request body: {u'router': {u'external_gateway_info': {u'network_id': u'103184c3-d46d-4028-978e-50ff00c97dad', u'enable_snat': False}, u'name': u'tempest-snat-router-9379766'}} prepare_request_body /usr/lib/python2.7/site-packages/neutron/api/v2/base.py:695
    "name": "tempest-snat-router-9379766"
      "name" : "tempest-snat-router-9379766",
2018-04-11 21:49:42.067 125357 DEBUG neutron.api.v2.base [req-af49b3b0-28e5-408f-af36-fed0b6085c5d 3713c00dea3247bca8359c5cb812fbde ee0b40bcd52e4747a888f82e5107e122 - default default] Request body: {u'router': {u'external_gateway_info': {u'network_id': u'103184c3-d46d-4028-978e-50ff00c97dad', u'enable_snat': True}, u'name': u'tempest-snat-router-9379766'}} prepare_request_body /usr/lib/python2.7/site-packages/neutron/api/v2/base.py:695
    "name": "tempest-snat-router-9379766"
      "name" : "tempest-snat-router-9379766",
[mkolesni@localhost ~]$ 

The first match is a create attempt that failed, the second is a successful attemp that created the router with ID 35fd120a-937b-4cd4-8aba-d9135775b233, and the third created the router with same name (not a unique property in neutron anyway) with the ID 408594a8-2ebe-4f41-93aa-3271883268e2. Mind you, both these routers were deleted and that can also be seen in the log.

It's impossible to see when this "leftover" router was created or deleted or where it even came from, but it's not the other ones with the same name..
You could also probably see the router name in the tempest log and see it's reusing the name for the various routers (again the name doesn't matter, only thing that matters is the ID).

So if you have any logs that actually capture the creation and attempted deletion of the router that would be great, otherwise it's impossible to understand where it comes from..

Also if you can please attach the tempest log as well so that we can perhaps get a better understanding of what's going on?

Comment 16 Waldemar Znoinski 2018-05-13 18:24:09 UTC
there are random tempest tests failing with errors like:
RBAC policy on object d5d7ca23-dd42-4ec9-a78e-64460f985bd6 cannot be removed because other objects depend on it.\nDetails: There are routers attached to this network that depend on this policy for access.
or
Details: {u'message': u'Unable to complete operation on network 0e9548e4-f181-4cd8-8ea7-d2287f5f3229. There are one or more ports still in use on the network.', u'type': u'NetworkInUse', u'detail': u''}

I'm attaching datastore dumps and sosreports from the run I saw both of the above messages happening (https://url.corp.redhat.com/75d5ae3), sosreports clearly show multiple routers and security groups created by not removed by tempest tests (karaf and neutron logs inside the same package)

NOTE: the above run #68 was done using Vishal's fix (from downstream gerrit: https://url.corp.redhat.com/7d6a080) for https://bugzilla.redhat.com/show_bug.cgi?id=1567507 and _no_ FIP issues were spotted in 2 consecutive runs (#68 and #66 downstream) with that fix and the left over resources may be more apparent since the FIP issue not occurring

Comment 17 Waldemar Znoinski 2018-05-13 18:28:38 UTC
Created attachment 1435839 [details]
20180513_043751-test_delete_policies_while_tenant_attached_to_net-sosreport-controller-0

Comment 18 Waldemar Znoinski 2018-05-13 18:29:19 UTC
Created attachment 1435840 [details]
20180513_043751-test_delete_policies_while_tenant_attached_to_net-odltools-undercloud-0.redhat.local

Comment 19 Waldemar Znoinski 2018-05-13 18:30:16 UTC
Created attachment 1435841 [details]
20180513_043751-test_delete_policies_while_tenant_attached_to_net-sosreport-compute-0

Comment 20 Waldemar Znoinski 2018-05-13 18:30:48 UTC
Created attachment 1435842 [details]
20180513_043751-test_delete_policies_while_tenant_attached_to_net-sosreport-compute-1

Comment 21 jamo luhrsen 2018-05-14 03:36:25 UTC
Mike,

what else do you need?

Also, just double checking that you noticed the semi-reasonable
reproduction steps.

finally, this is happening in upstream ODL (last time I checked)
and we keep a lot of logs for those jobs. We can sift through those
as well.

Thanks,
JamO

Comment 22 Mike Kolesnik 2018-05-21 12:11:52 UTC

*** This bug has been marked as a duplicate of bug 1519783 ***

Comment 23 Mike Kolesnik 2018-05-21 12:15:10 UTC
After debugging this it seems this is the same issue seen in bug 1519783, where a router gets created on Neutron side but never synced to ODL.

The main reason for this is that the recording to journal fails (probably due to a DB failure) and gets retried, so a new router gets created and the old router is left dangling on Neutron (in bug 1519783 the scale test tries to delete it and fails since it's not on ODL).

This should be fixed by L3 flavors which should be available in OSP 14.