Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1625995

Summary:	[CI] no connectivity to public addresses via br-ex
Product:	Red Hat OpenStack	Reporter:	Waldemar Znoinski <wznoinsk>
Component:	opendaylight	Assignee:	Mike Kolesnik <mkolesni>
Status:	CLOSED DUPLICATE	QA Contact:	Noam Manos <nmanos>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	14.0 (Rocky)	CC:	aadam, abregman, mkolesni, nyechiel
Target Milestone:	---	Keywords:	AutomationBlocker
Target Release:	---	Flags:	abregman: needinfo-
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	N/A
Last Closed:	2018-09-17 12:34:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1626488
Bug Blocks:

Description Waldemar Znoinski 2018-09-06 10:45:20 UTC

Description of problem:
after deploying OSP14 + ODL, connectivity from overcloud nodes (i.e.: controller-0) doesn't have connectivity to public addresses

controller-0:

[root@controller-0 ~]# ip -o a sh br-ex                                                                                                                                                                            
7: br-ex    inet 10.0.0.108/24 brd 10.0.0.255 scope global br-ex\       valid_lft forever preferred_lft forever
7: br-ex    inet 10.0.0.101/32 brd 10.0.0.255 scope global br-ex\       valid_lft forever preferred_lft forever
7: br-ex    inet6 fe80::5054:ff:fe53:7e69/64 scope link \       valid_lft forever preferred_lft forever


[root@controller-0 ~]# ip r
default via 10.0.0.1 dev br-ex 
10.0.0.0/24 dev br-ex proto kernel scope link src 10.0.0.108 
169.254.169.254 via 192.168.24.1 dev eth0 
172.17.1.0/24 dev vlan20 proto kernel scope link src 172.17.1.29 
172.17.2.0/24 dev vlan50 proto kernel scope link src 172.17.2.24 
172.17.3.0/24 dev vlan30 proto kernel scope link src 172.17.3.14 
172.17.4.0/24 dev vlan40 proto kernel scope link src 172.17.4.15 
172.31.0.0/24 dev docker0 proto kernel scope link src 172.31.0.1 
192.168.24.0/24 dev eth0 proto kernel scope link src 192.168.24.15 


ping controller-0 -> compute-0
[root@controller-0 ~]# ping -c 3 10.0.0.105
PING 10.0.0.105 (10.0.0.105) 56(84) bytes of data.
From 10.0.0.108 icmp_seq=1 Destination Host Unreachable
From 10.0.0.108 icmp_seq=2 Destination Host Unreachable
From 10.0.0.108 icmp_seq=3 Destination Host Unreachable


ping controller-0 -> undercloud-0
[root@controller-0 ~]# ping -c 3 10.0.0.13
PING 10.0.0.13 (10.0.0.13) 56(84) bytes of data.
From 10.0.0.108 icmp_seq=1 Destination Host Unreachable
From 10.0.0.108 icmp_seq=2 Destination Host Unreachable
From 10.0.0.108 icmp_seq=3 Destination Host Unreachable


ping controller-0 -> host (physical server controller-0 is a VM on)
[root@controller-0 ~]# ping -c 3 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
From 10.0.0.108 icmp_seq=1 Destination Host Unreachable
From 10.0.0.108 icmp_seq=2 Destination Host Unreachable
From 10.0.0.108 icmp_seq=3 Destination Host Unreachable


the same problem exists on other controllers
we don't use 10.0.0.X IPs on computes 





Version-Release number of selected component (if applicable):
osp14 (puddle 2018-08-23.3) + opendaylight-8.3.0-3


How reproducible:
100%


Steps to Reproduce:
1.
2.
3.

Actual results:
ping not working


Expected results:
ping to work


Additional info:

[root@controller-0 ~]# ovs-vsctl show
d45f7d11-db46-48c0-a7ab-f7d468d85869
    Manager "tcp:172.17.1.29:6640"
        is_connected: true
    Manager "tcp:172.17.1.10:6640"
        is_connected: true
    Manager "tcp:172.17.1.21:6640"
        is_connected: true
    Manager "ptcp:6639:127.0.0.1"
    Bridge br-isolated
        fail_mode: standalone
        Port "vlan40"
            tag: 40
            Interface "vlan40"
                type: internal
        Port br-isolated
            Interface br-isolated
                type: internal
        Port "eth1"
            Interface "eth1"
        Port "vlan20"
            tag: 20
            Interface "vlan20"
                type: internal
        Port "vlan30"
            tag: 30
            Interface "vlan30"
                type: internal
        Port "vlan50"
            tag: 50
            Interface "vlan50"
                type: internal
    Bridge br-ex
        fail_mode: standalone
        Port "eth2"
            Interface "eth2"
        Port br-ex
            Interface br-ex
                type: internal
        Port br-ex-int-patch
            Interface br-ex-int-patch
                type: patch
                options: {peer=br-ex-patch}

Comment 1 Mike Kolesnik 2018-09-06 13:13:59 UTC

Arie,

Are you seeing this on your OSP 14 CI jobs as well?

This seems to me like a general deployment issue which might not be related to ODL..

Comment 2 Waldemar Znoinski 2018-09-06 13:34:00 UTC

more observations:

1. there's no communication of overcloud nodes with any 10.0.0.0/24 (i.e.: undercloud) so tempest can't even start
2. ovs-vswitchd process dies on controllers, logfile:
...

2018-09-06T12:41:42.473Z|00056|connmgr|INFO|br-isolated: added service controller "punix:/var/run/openvswitch/br-isolated.mgmt"
2018-09-06T12:41:42.503Z|00057|rconn|INFO|br-int<->tcp:172.17.1.29:6653: connected
2018-09-06T12:41:42.520Z|00058|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.10.0
2018-09-06T12:41:43.012Z|00059|connmgr|INFO|br-int<->tcp:172.17.1.29:6653: sending OFPGMFC_GROUP_EXISTS error reply to OFPT_GROUP_MOD message
2018-09-06T12:41:43.167Z|00060|rconn|INFO|br-int<->tcp:172.17.1.21:6653: connection timed out
2018-09-06T12:41:43.167Z|00061|rconn|INFO|br-int<->tcp:172.17.1.21:6653: waiting 1 seconds before reconnect
2018-09-06T12:41:43.167Z|00062|rconn|INFO|br-int<->tcp:172.17.1.10:6653: connection timed out
2018-09-06T12:41:43.167Z|00063|rconn|INFO|br-int<->tcp:172.17.1.10:6653: waiting 1 seconds before reconnect
2018-09-06T12:41:44.166Z|00064|rconn|INFO|br-int<->tcp:172.17.1.21:6653: connecting...
2018-09-06T12:41:44.166Z|00065|rconn|INFO|br-int<->tcp:172.17.1.10:6653: connecting...
2018-09-06T12:41:45.166Z|00066|rconn|INFO|br-int<->tcp:172.17.1.21:6653: connection timed out
2018-09-06T12:41:45.166Z|00067|rconn|INFO|br-int<->tcp:172.17.1.21:6653: waiting 2 seconds before reconnect
2018-09-06T12:41:45.166Z|00068|rconn|INFO|br-int<->tcp:172.17.1.10:6653: connection timed out
2018-09-06T12:41:45.166Z|00069|rconn|INFO|br-int<->tcp:172.17.1.10:6653: waiting 2 seconds before reconnect
2018-09-06T12:41:47.167Z|00070|rconn|INFO|br-int<->tcp:172.17.1.21:6653: connecting...
2018-09-06T12:41:47.167Z|00071|rconn|INFO|br-int<->tcp:172.17.1.10:6653: connecting...
2018-09-06T12:41:48.641Z|00001|util(handler29)|EMER|./include/openvswitch/list.h:261: assertion !ovs_list_is_empty(list) failed in ovs_list_back()


3. there's no communication between overcloud nodes themselves on vlan10/20/40/50 subnets, there is communication working on vlan30

Comment 3 Waldemar Znoinski 2018-09-07 13:26:55 UTC

after looking at it with Sridhar it looks like:

1. openvswitch 2.10 (in OSP14) behaves differently than previously used 2.9 (OSP13)... 2.10 dies with:

util(handler28)|EMER|./include/openvswitch/list.h:261: assertion !ovs_list_is_empty(list) failed in ovs_list_back()

reported as https://bugzilla.redhat.com/show_bug.cgi?id=1626488

2. there are "Unexpected exceptions" due to features in ovs2.10 not yet handled by oxygen: https://bugzilla.redhat.com/show_bug.cgi?id=1626497 (may not be directly related to this external connectivity issue)


after restarting ovs-vswitchd on controllers the external connectivity is working for a while, the ovs-vswitchd dies again (because of bug 1. above) and no external connectivity again

as a test we've tried installing ovs 2.9 (from osp13) instead of 2.10 (osp14) and everything works fine, even after longer period of time

Comment 4 Waldemar Znoinski 2018-09-17 12:34:45 UTC


*** This bug has been marked as a duplicate of bug 1626488 ***