Bug 1576414 - [HA] massive FIP issues (most of the scenario tempest tests fail)
Summary: [HA] massive FIP issues (most of the scenario tempest tests fail)
Keywords:
Status: CLOSED DUPLICATE of bug 1567507
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Aswin Suryanarayanan
QA Contact: Itzik Brown
URL:
Whiteboard: odl_netvirt, odl_ha
Depends On: 1567507
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-09 12:07 UTC by Waldemar Znoinski
Modified: 2018-10-24 12:37 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
N/A
Last Closed: 2018-05-14 10:55:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
massive_FIP_issues_tempest_results (6.07 MB, text/html)
2018-05-09 12:07 UTC, Waldemar Znoinski
no flags Details
massive_FIP_issues_tempest_results (3.81 MB, text/html)
2018-05-09 12:12 UTC, Waldemar Znoinski
no flags Details
opendaylight captures (datastores) (85.01 KB, application/x-gzip)
2018-05-09 12:32 UTC, Waldemar Znoinski
no flags Details
openvswitch capture controller-0 (617.92 KB, text/plain)
2018-05-09 12:34 UTC, Waldemar Znoinski
no flags Details
openvswitch capture compute-0 (103.44 KB, text/plain)
2018-05-09 12:36 UTC, Waldemar Znoinski
no flags Details
openvswitch capture compute-1 (169.55 KB, text/plain)
2018-05-09 12:38 UTC, Waldemar Znoinski
no flags Details
docker logs with karaf (1.22 MB, application/zip)
2018-05-09 13:49 UTC, Waldemar Znoinski
no flags Details
docker logs with karaf controller-1 (807.06 KB, application/zip)
2018-05-09 14:09 UTC, Waldemar Znoinski
no flags Details
docker logs with karaf controller-2 (1.30 MB, application/zip)
2018-05-09 14:11 UTC, Waldemar Znoinski
no flags Details
neutron server.log controller-0 (660.99 KB, application/zip)
2018-05-09 14:44 UTC, Waldemar Znoinski
no flags Details
neutron server.log.1 controller-0 (2.08 MB, application/zip)
2018-05-09 14:45 UTC, Waldemar Znoinski
no flags Details
neutron server.log.2 controller-0 (853.75 KB, application/zip)
2018-05-09 14:46 UTC, Waldemar Znoinski
no flags Details
neutron server.log.3 controller-0 (1002.30 KB, application/zip)
2018-05-09 14:46 UTC, Waldemar Znoinski
no flags Details
neutron server.log controller-1 (650.06 KB, application/zip)
2018-05-09 14:49 UTC, Waldemar Znoinski
no flags Details
neutron server.log.1 controller-1 (1.57 MB, application/zip)
2018-05-09 14:49 UTC, Waldemar Znoinski
no flags Details
neutron server.log.2 controller-1 (1.12 MB, application/zip)
2018-05-09 14:50 UTC, Waldemar Znoinski
no flags Details
neutron server.log.3 controller-1 (861.06 KB, application/zip)
2018-05-09 14:50 UTC, Waldemar Znoinski
no flags Details
neutron server.log controller-2 (668.85 KB, application/zip)
2018-05-09 14:56 UTC, Waldemar Znoinski
no flags Details
neutron server.log.1 controller-2 (1.62 MB, application/zip)
2018-05-09 14:56 UTC, Waldemar Znoinski
no flags Details
neutron server.log.2 controller-2 (1.21 MB, application/zip)
2018-05-09 14:57 UTC, Waldemar Znoinski
no flags Details
neutron server.log.3 controller-2 (993.83 KB, application/zip)
2018-05-09 14:57 UTC, Waldemar Znoinski
no flags Details

Description Waldemar Znoinski 2018-05-09 12:07:30 UTC
Created attachment 1433772 [details]
massive_FIP_issues_tempest_results

Description of problem:
massive (50-90%) of scenario tempest tests fail in some deployments - see attachment (massive_FIP_issues_tempest_results)

when trying manually, spawned two VMs with cirros image and they got fixed ip and can ping it's router (192.168.99.1) but communication from VM to external FIP or from undercloud to any of the two VMs FIPs is _not_ working
the VMs are on two different computes
table=21 and table=48 are there on computes

- the VM gets fixed IP
- the VM can ping it's router
- the ping from VM1 to VM2 (on a different compute on the same network) works
- all 3 odl controllers are up for last 47h (none of the was restarted/brought down)


Version-Release number of selected component (if applicable):
osp13, puddle 2018-05-07.2
opendaylight-8.0.0-9.el7ost.noarch
puppet-opendaylight-8.1.2-1.38977efgit.el7ost.noarch


How reproducible:
happens in 80% of HA CI deployments
happens in 10% of non-ha deployments

Steps to Reproduce:
1. deploy osp13 wiith opendaylight
2. run tempest scenario tests or create network/router/vms manually
3.

Actual results:
FIP not working

Expected results:
FIP working

Additional info:
- it looks like it's not the table=48 issue (https://bugzilla.redhat.com/show_bug.cgi?id=1568989) 
- it's not the random FIP issue (https://bugzilla.redhat.com/show_bug.cgi?id=1570615)
- it's not https://bugzilla.redhat.com/show_bug.cgi?id=1478061
- it doesn't match any other known FIP related bugzillas

Comment 1 Waldemar Znoinski 2018-05-09 12:12:20 UTC
Created attachment 1433785 [details]
massive_FIP_issues_tempest_results

Comment 2 Waldemar Znoinski 2018-05-09 12:30:51 UTC
it's not possible to attach all logs for all failures spotted in this deployment hence all the attachments I'm providing below are in regards to the first noticed issue (timewise): test_server_connectivity_cold_migration

Comment 3 Waldemar Znoinski 2018-05-09 12:32:21 UTC
Created attachment 1433789 [details]
opendaylight captures (datastores)

Comment 4 Waldemar Znoinski 2018-05-09 12:34:44 UTC
Created attachment 1433790 [details]
openvswitch capture controller-0

Comment 5 Waldemar Znoinski 2018-05-09 12:36:48 UTC
Created attachment 1433791 [details]
openvswitch capture compute-0

Comment 6 Waldemar Znoinski 2018-05-09 12:38:08 UTC
Created attachment 1433794 [details]
openvswitch capture compute-1

Comment 7 Waldemar Znoinski 2018-05-09 13:49:31 UTC
Created attachment 1433874 [details]
docker logs with karaf

Comment 8 Waldemar Znoinski 2018-05-09 14:09:18 UTC
Created attachment 1433879 [details]
docker logs with karaf controller-1

Comment 9 Waldemar Znoinski 2018-05-09 14:11:19 UTC
Created attachment 1433881 [details]
docker logs with karaf controller-2

Comment 10 Waldemar Znoinski 2018-05-09 14:15:46 UTC
quite possibly it's https://bugzilla.redhat.com/show_bug.cgi?id=1567507 we'are seeing, thanks Aswin

Comment 11 Aswin Suryanarayanan 2018-05-09 14:33:16 UTC
(In reply to Waldemar Znoinski from comment #10)
> quite possibly it's https://bugzilla.redhat.com/show_bug.cgi?id=1567507
> we'are seeing, thanks Aswin

From the flow dump , the router id seems to be -1 for NAT flows, so we should be hitting 1567507.

cookie=0x8000006, duration=302.720s, table=26, n_packets=0, n_bytes=0, priority=5,ip,metadata=0xfffffe/0xfffffe actions=ct(table=46,zone=5001,nat)

Comment 12 Waldemar Znoinski 2018-05-09 14:44:40 UTC
Created attachment 1433889 [details]
neutron server.log controller-0

Comment 13 Waldemar Znoinski 2018-05-09 14:45:02 UTC
Created attachment 1433890 [details]
neutron server.log.1 controller-0

Comment 14 Waldemar Znoinski 2018-05-09 14:46:20 UTC
Created attachment 1433892 [details]
neutron server.log.2 controller-0

Comment 15 Waldemar Znoinski 2018-05-09 14:46:40 UTC
Created attachment 1433893 [details]
neutron server.log.3 controller-0

Comment 16 Waldemar Znoinski 2018-05-09 14:49:32 UTC
Created attachment 1433895 [details]
neutron server.log controller-1

Comment 17 Waldemar Znoinski 2018-05-09 14:49:56 UTC
Created attachment 1433896 [details]
neutron server.log.1 controller-1

Comment 18 Waldemar Znoinski 2018-05-09 14:50:18 UTC
Created attachment 1433897 [details]
neutron server.log.2 controller-1

Comment 19 Waldemar Znoinski 2018-05-09 14:50:39 UTC
Created attachment 1433898 [details]
neutron server.log.3 controller-1

Comment 20 Waldemar Znoinski 2018-05-09 14:56:19 UTC
Created attachment 1433900 [details]
neutron server.log controller-2

Comment 21 Waldemar Znoinski 2018-05-09 14:56:46 UTC
Created attachment 1433901 [details]
neutron server.log.1 controller-2

Comment 22 Waldemar Znoinski 2018-05-09 14:57:29 UTC
Created attachment 1433902 [details]
neutron server.log.2 controller-2

Comment 23 Waldemar Znoinski 2018-05-09 14:57:57 UTC
Created attachment 1433903 [details]
neutron server.log.3 controller-2

Comment 24 Mike Kolesnik 2018-05-13 13:19:44 UTC
(In reply to Waldemar Znoinski from comment #10)
> quite possibly it's https://bugzilla.redhat.com/show_bug.cgi?id=1567507
> we'are seeing, thanks Aswin

Aswin can you double check this?

If so I think we can close this bug as duplicate.

Comment 25 Waldemar Znoinski 2018-05-13 17:37:57 UTC
Aswin I see ' uncaught exception ' in karaf.log of this bug but would you be able to confirm that's the only thing that causes failures mentioned in this bz? I'm affraid we're seeing  "uncaught exception"  by the way of other things and wouldn't want to close this bz if we need to track these other problems

Comment 26 Aswin Suryanarayanan 2018-05-14 06:00:07 UTC
(In reply to Waldemar Znoinski from comment #25)
> Aswin I see ' uncaught exception ' in karaf.log of this bug but would you be
> able to confirm that's the only thing that causes failures mentioned in this
> bz? I'm affraid we're seeing  "uncaught exception"  by the way of other
> things and wouldn't want to close this bz if we need to track these other
> problems

Waldek,  The uncaught exception is due to conflicting datastore writes, and we do have some of these exception in the whitelist as we have not identified these exception causing any functional issue. I think should not be the root cause of FIP failures as the logs seems to be not associated with any preceding NAT operation.

The exception related to the FIP issue (routerid being  -1 ) is not present in the logs, but we lost some log due to the file size issue. It is from the flows we where able to identify the router was  -1.

Can we mark this as duplicate now and if 1567507 does not solve this issue we can reopen?

Comment 27 Waldemar Znoinski 2018-05-14 09:23:24 UTC
Thanks for the good explanation Aswin, 
This (1576414) bug, so far, is leaning towards some fixed ip / metadata agent problems. The VMs don't seem to be getting fixed ip. 

Do you think this may be caused by conflicting datastore writes you're solving in 1567507?

Comment 28 Waldemar Znoinski 2018-05-14 10:55:44 UTC
Aswin,
sorry I got two bzillas mixed
let's close this one and reopen as you said

*** This bug has been marked as a duplicate of bug 1567507 ***


Note You need to log in before you can comment on or make changes to this bug.