Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1576414

Summary: [HA] massive FIP issues (most of the scenario tempest tests fail)
Product: Red Hat OpenStack Reporter: Waldemar Znoinski <wznoinsk>
Component: opendaylightAssignee: Aswin Suryanarayanan <asuryana>
Status: CLOSED DUPLICATE QA Contact: Itzik Brown <itbrown>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: aadam, asuryana, jluhrsen, mkolesni, nyechiel
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: odl_netvirt, odl_ha
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
N/A
Last Closed: 2018-05-14 10:55:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1567507    
Bug Blocks:    
Attachments:
Description Flags
massive_FIP_issues_tempest_results
none
massive_FIP_issues_tempest_results
none
opendaylight captures (datastores)
none
openvswitch capture controller-0
none
openvswitch capture compute-0
none
openvswitch capture compute-1
none
docker logs with karaf
none
docker logs with karaf controller-1
none
docker logs with karaf controller-2
none
neutron server.log controller-0
none
neutron server.log.1 controller-0
none
neutron server.log.2 controller-0
none
neutron server.log.3 controller-0
none
neutron server.log controller-1
none
neutron server.log.1 controller-1
none
neutron server.log.2 controller-1
none
neutron server.log.3 controller-1
none
neutron server.log controller-2
none
neutron server.log.1 controller-2
none
neutron server.log.2 controller-2
none
neutron server.log.3 controller-2 none

Description Waldemar Znoinski 2018-05-09 12:07:30 UTC
Created attachment 1433772 [details]
massive_FIP_issues_tempest_results

Description of problem:
massive (50-90%) of scenario tempest tests fail in some deployments - see attachment (massive_FIP_issues_tempest_results)

when trying manually, spawned two VMs with cirros image and they got fixed ip and can ping it's router (192.168.99.1) but communication from VM to external FIP or from undercloud to any of the two VMs FIPs is _not_ working
the VMs are on two different computes
table=21 and table=48 are there on computes

- the VM gets fixed IP
- the VM can ping it's router
- the ping from VM1 to VM2 (on a different compute on the same network) works
- all 3 odl controllers are up for last 47h (none of the was restarted/brought down)


Version-Release number of selected component (if applicable):
osp13, puddle 2018-05-07.2
opendaylight-8.0.0-9.el7ost.noarch
puppet-opendaylight-8.1.2-1.38977efgit.el7ost.noarch


How reproducible:
happens in 80% of HA CI deployments
happens in 10% of non-ha deployments

Steps to Reproduce:
1. deploy osp13 wiith opendaylight
2. run tempest scenario tests or create network/router/vms manually
3.

Actual results:
FIP not working

Expected results:
FIP working

Additional info:
- it looks like it's not the table=48 issue (https://bugzilla.redhat.com/show_bug.cgi?id=1568989) 
- it's not the random FIP issue (https://bugzilla.redhat.com/show_bug.cgi?id=1570615)
- it's not https://bugzilla.redhat.com/show_bug.cgi?id=1478061
- it doesn't match any other known FIP related bugzillas

Comment 1 Waldemar Znoinski 2018-05-09 12:12:20 UTC
Created attachment 1433785 [details]
massive_FIP_issues_tempest_results

Comment 2 Waldemar Znoinski 2018-05-09 12:30:51 UTC
it's not possible to attach all logs for all failures spotted in this deployment hence all the attachments I'm providing below are in regards to the first noticed issue (timewise): test_server_connectivity_cold_migration

Comment 3 Waldemar Znoinski 2018-05-09 12:32:21 UTC
Created attachment 1433789 [details]
opendaylight captures (datastores)

Comment 4 Waldemar Znoinski 2018-05-09 12:34:44 UTC
Created attachment 1433790 [details]
openvswitch capture controller-0

Comment 5 Waldemar Znoinski 2018-05-09 12:36:48 UTC
Created attachment 1433791 [details]
openvswitch capture compute-0

Comment 6 Waldemar Znoinski 2018-05-09 12:38:08 UTC
Created attachment 1433794 [details]
openvswitch capture compute-1

Comment 7 Waldemar Znoinski 2018-05-09 13:49:31 UTC
Created attachment 1433874 [details]
docker logs with karaf

Comment 8 Waldemar Znoinski 2018-05-09 14:09:18 UTC
Created attachment 1433879 [details]
docker logs with karaf controller-1

Comment 9 Waldemar Znoinski 2018-05-09 14:11:19 UTC
Created attachment 1433881 [details]
docker logs with karaf controller-2

Comment 10 Waldemar Znoinski 2018-05-09 14:15:46 UTC
quite possibly it's https://bugzilla.redhat.com/show_bug.cgi?id=1567507 we'are seeing, thanks Aswin

Comment 11 Aswin Suryanarayanan 2018-05-09 14:33:16 UTC
(In reply to Waldemar Znoinski from comment #10)
> quite possibly it's https://bugzilla.redhat.com/show_bug.cgi?id=1567507
> we'are seeing, thanks Aswin

From the flow dump , the router id seems to be -1 for NAT flows, so we should be hitting 1567507.

cookie=0x8000006, duration=302.720s, table=26, n_packets=0, n_bytes=0, priority=5,ip,metadata=0xfffffe/0xfffffe actions=ct(table=46,zone=5001,nat)

Comment 12 Waldemar Znoinski 2018-05-09 14:44:40 UTC
Created attachment 1433889 [details]
neutron server.log controller-0

Comment 13 Waldemar Znoinski 2018-05-09 14:45:02 UTC
Created attachment 1433890 [details]
neutron server.log.1 controller-0

Comment 14 Waldemar Znoinski 2018-05-09 14:46:20 UTC
Created attachment 1433892 [details]
neutron server.log.2 controller-0

Comment 15 Waldemar Znoinski 2018-05-09 14:46:40 UTC
Created attachment 1433893 [details]
neutron server.log.3 controller-0

Comment 16 Waldemar Znoinski 2018-05-09 14:49:32 UTC
Created attachment 1433895 [details]
neutron server.log controller-1

Comment 17 Waldemar Znoinski 2018-05-09 14:49:56 UTC
Created attachment 1433896 [details]
neutron server.log.1 controller-1

Comment 18 Waldemar Znoinski 2018-05-09 14:50:18 UTC
Created attachment 1433897 [details]
neutron server.log.2 controller-1

Comment 19 Waldemar Znoinski 2018-05-09 14:50:39 UTC
Created attachment 1433898 [details]
neutron server.log.3 controller-1

Comment 20 Waldemar Znoinski 2018-05-09 14:56:19 UTC
Created attachment 1433900 [details]
neutron server.log controller-2

Comment 21 Waldemar Znoinski 2018-05-09 14:56:46 UTC
Created attachment 1433901 [details]
neutron server.log.1 controller-2

Comment 22 Waldemar Znoinski 2018-05-09 14:57:29 UTC
Created attachment 1433902 [details]
neutron server.log.2 controller-2

Comment 23 Waldemar Znoinski 2018-05-09 14:57:57 UTC
Created attachment 1433903 [details]
neutron server.log.3 controller-2

Comment 24 Mike Kolesnik 2018-05-13 13:19:44 UTC
(In reply to Waldemar Znoinski from comment #10)
> quite possibly it's https://bugzilla.redhat.com/show_bug.cgi?id=1567507
> we'are seeing, thanks Aswin

Aswin can you double check this?

If so I think we can close this bug as duplicate.

Comment 25 Waldemar Znoinski 2018-05-13 17:37:57 UTC
Aswin I see ' uncaught exception ' in karaf.log of this bug but would you be able to confirm that's the only thing that causes failures mentioned in this bz? I'm affraid we're seeing  "uncaught exception"  by the way of other things and wouldn't want to close this bz if we need to track these other problems

Comment 26 Aswin Suryanarayanan 2018-05-14 06:00:07 UTC
(In reply to Waldemar Znoinski from comment #25)
> Aswin I see ' uncaught exception ' in karaf.log of this bug but would you be
> able to confirm that's the only thing that causes failures mentioned in this
> bz? I'm affraid we're seeing  "uncaught exception"  by the way of other
> things and wouldn't want to close this bz if we need to track these other
> problems

Waldek,  The uncaught exception is due to conflicting datastore writes, and we do have some of these exception in the whitelist as we have not identified these exception causing any functional issue. I think should not be the root cause of FIP failures as the logs seems to be not associated with any preceding NAT operation.

The exception related to the FIP issue (routerid being  -1 ) is not present in the logs, but we lost some log due to the file size issue. It is from the flows we where able to identify the router was  -1.

Can we mark this as duplicate now and if 1567507 does not solve this issue we can reopen?

Comment 27 Waldemar Znoinski 2018-05-14 09:23:24 UTC
Thanks for the good explanation Aswin, 
This (1576414) bug, so far, is leaning towards some fixed ip / metadata agent problems. The VMs don't seem to be getting fixed ip. 

Do you think this may be caused by conflicting datastore writes you're solving in 1567507?

Comment 28 Waldemar Znoinski 2018-05-14 10:55:44 UTC
Aswin,
sorry I got two bzillas mixed
let's close this one and reopen as you said

*** This bug has been marked as a duplicate of bug 1567507 ***