Created attachment 1433772 [details] massive_FIP_issues_tempest_results Description of problem: massive (50-90%) of scenario tempest tests fail in some deployments - see attachment (massive_FIP_issues_tempest_results) when trying manually, spawned two VMs with cirros image and they got fixed ip and can ping it's router (192.168.99.1) but communication from VM to external FIP or from undercloud to any of the two VMs FIPs is _not_ working the VMs are on two different computes table=21 and table=48 are there on computes - the VM gets fixed IP - the VM can ping it's router - the ping from VM1 to VM2 (on a different compute on the same network) works - all 3 odl controllers are up for last 47h (none of the was restarted/brought down) Version-Release number of selected component (if applicable): osp13, puddle 2018-05-07.2 opendaylight-8.0.0-9.el7ost.noarch puppet-opendaylight-8.1.2-1.38977efgit.el7ost.noarch How reproducible: happens in 80% of HA CI deployments happens in 10% of non-ha deployments Steps to Reproduce: 1. deploy osp13 wiith opendaylight 2. run tempest scenario tests or create network/router/vms manually 3. Actual results: FIP not working Expected results: FIP working Additional info: - it looks like it's not the table=48 issue (https://bugzilla.redhat.com/show_bug.cgi?id=1568989) - it's not the random FIP issue (https://bugzilla.redhat.com/show_bug.cgi?id=1570615) - it's not https://bugzilla.redhat.com/show_bug.cgi?id=1478061 - it doesn't match any other known FIP related bugzillas
Created attachment 1433785 [details] massive_FIP_issues_tempest_results
it's not possible to attach all logs for all failures spotted in this deployment hence all the attachments I'm providing below are in regards to the first noticed issue (timewise): test_server_connectivity_cold_migration
Created attachment 1433789 [details] opendaylight captures (datastores)
Created attachment 1433790 [details] openvswitch capture controller-0
Created attachment 1433791 [details] openvswitch capture compute-0
Created attachment 1433794 [details] openvswitch capture compute-1
Created attachment 1433874 [details] docker logs with karaf
Created attachment 1433879 [details] docker logs with karaf controller-1
Created attachment 1433881 [details] docker logs with karaf controller-2
quite possibly it's https://bugzilla.redhat.com/show_bug.cgi?id=1567507 we'are seeing, thanks Aswin
(In reply to Waldemar Znoinski from comment #10) > quite possibly it's https://bugzilla.redhat.com/show_bug.cgi?id=1567507 > we'are seeing, thanks Aswin From the flow dump , the router id seems to be -1 for NAT flows, so we should be hitting 1567507. cookie=0x8000006, duration=302.720s, table=26, n_packets=0, n_bytes=0, priority=5,ip,metadata=0xfffffe/0xfffffe actions=ct(table=46,zone=5001,nat)
Created attachment 1433889 [details] neutron server.log controller-0
Created attachment 1433890 [details] neutron server.log.1 controller-0
Created attachment 1433892 [details] neutron server.log.2 controller-0
Created attachment 1433893 [details] neutron server.log.3 controller-0
Created attachment 1433895 [details] neutron server.log controller-1
Created attachment 1433896 [details] neutron server.log.1 controller-1
Created attachment 1433897 [details] neutron server.log.2 controller-1
Created attachment 1433898 [details] neutron server.log.3 controller-1
Created attachment 1433900 [details] neutron server.log controller-2
Created attachment 1433901 [details] neutron server.log.1 controller-2
Created attachment 1433902 [details] neutron server.log.2 controller-2
Created attachment 1433903 [details] neutron server.log.3 controller-2
(In reply to Waldemar Znoinski from comment #10) > quite possibly it's https://bugzilla.redhat.com/show_bug.cgi?id=1567507 > we'are seeing, thanks Aswin Aswin can you double check this? If so I think we can close this bug as duplicate.
Aswin I see ' uncaught exception ' in karaf.log of this bug but would you be able to confirm that's the only thing that causes failures mentioned in this bz? I'm affraid we're seeing "uncaught exception" by the way of other things and wouldn't want to close this bz if we need to track these other problems
(In reply to Waldemar Znoinski from comment #25) > Aswin I see ' uncaught exception ' in karaf.log of this bug but would you be > able to confirm that's the only thing that causes failures mentioned in this > bz? I'm affraid we're seeing "uncaught exception" by the way of other > things and wouldn't want to close this bz if we need to track these other > problems Waldek, The uncaught exception is due to conflicting datastore writes, and we do have some of these exception in the whitelist as we have not identified these exception causing any functional issue. I think should not be the root cause of FIP failures as the logs seems to be not associated with any preceding NAT operation. The exception related to the FIP issue (routerid being -1 ) is not present in the logs, but we lost some log due to the file size issue. It is from the flows we where able to identify the router was -1. Can we mark this as duplicate now and if 1567507 does not solve this issue we can reopen?
Thanks for the good explanation Aswin, This (1576414) bug, so far, is leaning towards some fixed ip / metadata agent problems. The VMs don't seem to be getting fixed ip. Do you think this may be caused by conflicting datastore writes you're solving in 1567507?
Aswin, sorry I got two bzillas mixed let's close this one and reopen as you said *** This bug has been marked as a duplicate of bug 1567507 ***