Description of problem: We are using RHOSP7 GA with Jiri and Marios patch (https://github.com/jistr/tripleo-bigswitch-temporary-setup). RHOSP7 set l3_ha = true by default. However, we notice that when we attach a network to a router, the gateway namespace is allocated to a controller node which doesn't match the record in neutron db. This happens very frequently and makes l3 almost not usable. Following is one example. Create a router, a network, attach the network to the router. 1. neutron tells that the gateway ip 1.1.1.1 is at controller-1 [stack@c5220-01 ~]$ neutron port-show 3306c360-5a3d-4a08-aa92-017498758963 +-----------------------+--------------------------------------------------------------------------------+ | Field | Value | +-----------------------+--------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:host_id | overcloud-controller-1.localdomain | | binding:profile | {} | | binding:vif_details | {"port_filter": true, "ovs_hybrid_plug": true} | | binding:vif_type | ovs | | binding:vnic_type | normal | | device_id | 934f0b90-2d98-4d54-b9ca-5222aac2199d | | device_owner | network:router_interface | | extra_dhcp_opts | | | fixed_ips | {"subnet_id": "463c2f0c-5d56-4abb-8b30-8450d8306f46", "ip_address": "1.1.1.1"} | | id | 3306c360-5a3d-4a08-aa92-017498758963 | | mac_address | fa:16:3e:72:34:4c | | name | | | network_id | 98f125b6-6d4d-4417-a0b3-e8d9ff530d6f | | security_groups | | | status | ACTIVE | | tenant_id | 4ef11838925940eb9d177ae9345711ee | +-----------------------+--------------------------------------------------------------------------------+ 2. However, the gateway ip is at controller-2 [heat-admin@overcloud-controller-2 ~]$ sudo ip netns exec qrouter-934f0b90-2d98-4d54-b9ca-5222aac2199d ifconfig ha-6d47f13a-b7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 169.254.192.6 netmask 255.255.192.0 broadcast 169.254.255.255 inet6 fe80::f816:3eff:fe43:9b80 prefixlen 64 scopeid 0x20<link> ether fa:16:3e:43:9b:80 txqueuelen 1000 (Ethernet) RX packets 20 bytes 1638 (1.5 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 309 bytes 16926 (16.5 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 0 (Local Loopback) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 qg-22431202-eb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.8.87.25 netmask 255.255.255.0 broadcast 0.0.0.0 inet6 fe80::f816:3eff:febd:56ad prefixlen 64 scopeid 0x20<link> ether fa:16:3e:bd:56:ad txqueuelen 1000 (Ethernet) RX packets 36 bytes 2746 (2.6 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 43 bytes 2890 (2.8 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 qr-3306c360-5a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 1.1.1.1 netmask 255.255.255.0 broadcast 0.0.0.0 inet6 fe80::f816:3eff:fe72:344c prefixlen 64 scopeid 0x20<link> ether fa:16:3e:72:34:4c txqueuelen 1000 (Ethernet) RX packets 95 bytes 5856 (5.7 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 90 bytes 4200 (4.1 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 3. On controller-1, there is no such ip [heat-admin@overcloud-controller-1 ~]$ sudo ip netns exec qrouter-934f0b90-2d98-4d54-b9ca-5222aac2199d ifconfig ha-7ff9abd2-bd: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 169.254.192.5 netmask 255.255.192.0 broadcast 169.254.255.255 inet6 fe80::f816:3eff:fe9d:275c prefixlen 64 scopeid 0x20<link> ether fa:16:3e:9d:27:5c txqueuelen 1000 (Ethernet) RX packets 321 bytes 19678 (19.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 12 bytes 1008 (1008.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 0 (Local Loopback) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 qg-22431202-eb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether fa:16:3e:bd:56:ad txqueuelen 1000 (Ethernet) RX packets 42 bytes 3360 (3.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1 bytes 110 (110.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 qr-3306c360-5a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether fa:16:3e:72:34:4c txqueuelen 1000 (Ethernet) RX packets 105 bytes 6456 (6.3 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1 bytes 110 (110.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 4. On controller-0, there is no such ip [heat-admin@overcloud-controller-0 ~]$ sudo ip netns exec qrouter-934f0b90-2d98-4d54-b9ca-5222aac2199d ifconfig ha-8dccf24a-2e: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 169.254.192.4 netmask 255.255.192.0 broadcast 169.254.255.255 inet6 fe80::f816:3eff:fe98:83dd prefixlen 64 scopeid 0x20<link> ether fa:16:3e:98:83:dd txqueuelen 1000 (Ethernet) RX packets 1140 bytes 68618 (67.0 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 12 bytes 1008 (1008.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 0 (Local Loopback) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 qg-22431202-eb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether fa:16:3e:bd:56:ad txqueuelen 1000 (Ethernet) RX packets 42 bytes 3244 (3.1 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1 bytes 110 (110.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 qr-3306c360-5a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether fa:16:3e:72:34:4c txqueuelen 1000 (Ethernet) RX packets 1753 bytes 105336 (102.8 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1 bytes 110 (110.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I can recreate this on my dev setup - as described below, but I am still not clear on the exact problem (for me connectivity was not an issue, until after service restart/reboot, and even then eventually came back anyway). So I guess questions for moving forward: 1. how did you create router, network and attach etc. (exact commands please are best). 2. apart from the incorrect status reported, is connectivity otherwise ok? can you ping the router interface? 3. does neutron l3-agent-list-hosting-router <router_name> also lie? 4. More info about the exact problem please. I mean how is the specific issue induced - in your report you say "This happens very frequently" - what does, the incorrect report of status or a loss of connectivity? It it a random occurrence? Is the port unusable immediately upon creation? I guess you meant loss of connectivity because of "makes l3 almost not usable". Is time a factor here, does it eventually recover after a few mins as described below. More context and what I tried below: openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 2 Created a router port (float-ip) and instance like: [4] instance + connectivity healthy, ssh fedora.2.46 (floatip) and from it wget google.com fine I can confirm that as BigSwitch reports binding:host_id [1] is a lie on the port show for the created router ip - in my setup: | binding:host_id | overcloud-controller-1.localdomai However, at least neutron l3-agent-list-hosting-router default-router gives the truth about that address: | 9bd5a4a0-26f8-46d7-b6ab-7eea5eb3c28c | overcloud-controller-0.localdomain | True | :-) | active | showing it active on controller-0 (which I verified on the node itself like "for i in `ip netns list` ; do ip netns exec $i ifconfig; done;" ). I also noticed that whilst the router iface and the vm floating ip were up and pinging fine, the status of that router iface was DOWN, I came across this bug [2] and a couple of other reports/questions [3] that suggest this is a failure to update the reported status but not the actual status of the port (i.e. the port *is* responding fine). Whilst debugging, I tried pcs resource restart neutron-server on the controllers (restarts all neutron-* services in sequence) and later a full nova reboot. Connectivity was lost to the vm and gateway for a good 3-5 mins but eventually recovered. As a final note, I just checked again and it seems the neutron port-show does catch up eventually. I have the port showing as ACTIVE and on the correct host: | binding:host_id | overcloud-controller-0.localdomain | | status | ACTIVE | thanks, marios [1] http://developer.openstack.org/api-ref-networking-v2-ext.html#listPorts [2] https://bugs.launchpad.net/neutron/+bug/1192883 [3] https://ask.openstack.org/en/question/25234/one-router-port-is-always-down/ [4] glance image-create --name user --is-public True --disk-format qcow2 --container-format bare --file fedora-user.qcow2 NETWORK_CIDR='10.0.0.0/8' OVERCLOUD_NAMESERVER='8.8.8.8' FLOATING_IP_CIDR='192.0.2.0/24' FLOATING_IP_START='192.0.2.45' FLOATING_IP_END='192.0.2.64' BM_NETWORK_GATEWAY='192.0.2.1' NETWORK_JSON=$(mktemp) jq "." <<EOF > $NETWORK_JSON { "float": { "cidr": "$NETWORK_CIDR", "name": "default-net", "nameserver": "$OVERCLOUD_NAMESERVER" }, "external": { "name": "ext-net", "cidr": "$FLOATING_IP_CIDR", "allocation_start": "$FLOATING_IP_START", "allocation_end": "$FLOATING_IP_END", "gateway": "$BM_NETWORK_GATEWAY" } } EOF setup-neutron -n $NETWORK_JSON neutron net-list NET_ID=$(neutron net-list -f csv --quote none | grep default-net | cut -d, -f1) if ! nova keypair-show default 2>/dev/null; then tripleo user-config; fi nova boot --poll --key-name default --flavor m1.demo --image user --nic net-id=$NET_ID demo PRIVATEIP=$(nova list | grep demo | awk -F"default-net=" '{print $2}' | awk '{print $1}') tripleo wait_for 10 5 neutron port-list -f csv -c id --quote none \| grep id PORT=$(neutron port-list | grep $PRIVATEIP | cut -d'|' -f2) FLOATINGIP=$(neutron floatingip-create ext-net --port-id "${PORT//[[:space:]]/}" | awk '$2=="floating_ip_address" {print $4}') SECGROUPID=$(nova secgroup-list | grep default | cut -d ' ' -f2) neutron security-group-rule-create $SECGROUPID --protocol icmp \ --direction ingress --port-range-min 8 || true neutron security-group-rule-create $SECGROUPID --protocol tcp \ --direction ingress --port-range-min 22 --port-range-max 22 || true
1. We use horizon gui to create router, network and add network interface to that router. 2. The connectivity is not ok because neutron reports the wrong state. The way big switch ml2 plugin works is that when a router port is created in neutron, the plugin will send a rest call to big switch controller to prepare flow entries for that port. In this case, the flow entries were wrong. 3. neutron l3-agent-list-hosting-router will list all the three openstack controllers, because the router namespace is brought up on all openstack controllers. 4. This problem does not happen all the time. Where the IP is actually located depends on VRRP. Whenever neutron db does not match the actual port location, connectivity becomes a problem.
(In reply to bigswitch from comment #4) > 1. We use horizon gui to create router, network and add network interface to > that router. > 2. The connectivity is not ok because neutron reports the wrong state. The > way big switch ml2 plugin works is that when a router port is created in > neutron, the plugin will send a rest call to big switch controller to > prepare flow entries for that port. In this case, the flow entries were > wrong. > 3. neutron l3-agent-list-hosting-router will list all the three openstack > controllers, because the router namespace is brought up on all openstack > controllers. yes but only one of them is active, right? like: [stack@instack ~]$ neutron l3-agent-list-hosting-router default-router +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | 9bd5a4a0-26f8-46d7-b6ab-7eea5eb3c28c | overcloud-controller-0.localdomain | True | :-) | active | | 503cefd8-2a61-496e-835c-745cffc2a805 | overcloud-controller-2.localdomain | True | :-) | standby | | cbf8f100-2101-4fa9-be32-21f3d868ca40 | overcloud-controller-1.localdomain | True | :-) | standby | +--------------------------------------+------------------------------------+------- > 4. This problem does not happen all the time. Where the IP is actually > located depends on VRRP. Whenever neutron db does not match the actual port > location, connectivity becomes a problem.
@Marios, yes, you are right. ml2 plugin should be able to check who's active and configure big switch controller accordingly. However, current ml2 plugin doesn't have this capability. I will work with my colleague to figure out a solution. At the same time, neutron port-show doesn't match the actual port allocation is still a problem.
For now, we work around this problem by making l3_ha = False and allow_automatic_l3agent_failover = True. At the same time, we open a neutron upstream bug to track it https://bugs.launchpad.net/neutron/+bug/1494866. We can close this bug here.
re-opening to start a discussion on backporting the fix https://review.openstack.org/#/c/141114/ as reported in the upstream bug https://bugs.launchpad.net/neutron/+bug/1494866
*** Bug 1263520 has been marked as a duplicate of this bug. ***
The fix was backported a while ago and will be available in the OSP 7 z3 release.
*** Bug 1287809 has been marked as a duplicate of this bug. ***
*** This bug has been marked as a duplicate of bug 1253953 ***