1260298 – router allocation doesn't match the record in neutron db when l3_ha is true

Bug 1260298 - router allocation doesn't match the record in neutron db when l3_ha is true

Summary: router allocation doesn't match the record in neutron db when l3_ha is true

Keywords:
Status:	CLOSED DUPLICATE of bug 1253953
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	async
Target Release:	7.0 (Kilo)
Assignee:	Assaf Muller
QA Contact:	Ofer Blaut
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1263520 1287809 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-05 19:03 UTC by bigswitch
Modified:	2023-02-22 23:02 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-16 17:16:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1494866	0	None	None	None	Never
OpenStack gerrit	211166	0	None	None	None	Never

Description bigswitch 2015-09-05 19:03:15 UTC

Description of problem:
We are using RHOSP7 GA with Jiri and Marios patch (https://github.com/jistr/tripleo-bigswitch-temporary-setup). RHOSP7 set l3_ha = true by default. However, we notice that when we attach a network to a router, the gateway namespace is allocated to a controller node which doesn't match the record in neutron db. This happens very frequently and makes l3 almost not usable. Following is one example.

Create a router, a network, attach the network to the router.

1. neutron tells that the gateway ip 1.1.1.1 is at controller-1
[stack@c5220-01 ~]$ neutron port-show 3306c360-5a3d-4a08-aa92-017498758963
+-----------------------+--------------------------------------------------------------------------------+
| Field                 | Value                                                                          |
+-----------------------+--------------------------------------------------------------------------------+
| admin_state_up        | True                                                                           |
| allowed_address_pairs |                                                                                |
| binding:host_id       | overcloud-controller-1.localdomain                                             |
| binding:profile       | {}                                                                             |
| binding:vif_details   | {"port_filter": true, "ovs_hybrid_plug": true}                                 |
| binding:vif_type      | ovs                                                                            |
| binding:vnic_type     | normal                                                                         |
| device_id             | 934f0b90-2d98-4d54-b9ca-5222aac2199d                                           |
| device_owner          | network:router_interface                                                       |
| extra_dhcp_opts       |                                                                                |
| fixed_ips             | {"subnet_id": "463c2f0c-5d56-4abb-8b30-8450d8306f46", "ip_address": "1.1.1.1"} |
| id                    | 3306c360-5a3d-4a08-aa92-017498758963                                           |
| mac_address           | fa:16:3e:72:34:4c                                                              |
| name                  |                                                                                |
| network_id            | 98f125b6-6d4d-4417-a0b3-e8d9ff530d6f                                           |
| security_groups       |                                                                                |
| status                | ACTIVE                                                                         |
| tenant_id             | 4ef11838925940eb9d177ae9345711ee                                               |
+-----------------------+--------------------------------------------------------------------------------+


2. However, the gateway ip is at controller-2
[heat-admin@overcloud-controller-2 ~]$ sudo ip netns exec qrouter-934f0b90-2d98-4d54-b9ca-5222aac2199d ifconfig
ha-6d47f13a-b7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 169.254.192.6  netmask 255.255.192.0  broadcast 169.254.255.255
        inet6 fe80::f816:3eff:fe43:9b80  prefixlen 64  scopeid 0x20<link>
        ether fa:16:3e:43:9b:80  txqueuelen 1000  (Ethernet)
        RX packets 20  bytes 1638 (1.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 309  bytes 16926 (16.5 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 0  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

qg-22431202-eb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.8.87.25  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::f816:3eff:febd:56ad  prefixlen 64  scopeid 0x20<link>
        ether fa:16:3e:bd:56:ad  txqueuelen 1000  (Ethernet)
        RX packets 36  bytes 2746 (2.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 43  bytes 2890 (2.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

qr-3306c360-5a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 1.1.1.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::f816:3eff:fe72:344c  prefixlen 64  scopeid 0x20<link>
        ether fa:16:3e:72:34:4c  txqueuelen 1000  (Ethernet)
        RX packets 95  bytes 5856 (5.7 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 90  bytes 4200 (4.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


3. On controller-1, there is no such ip
[heat-admin@overcloud-controller-1 ~]$ sudo ip netns exec qrouter-934f0b90-2d98-4d54-b9ca-5222aac2199d ifconfig
ha-7ff9abd2-bd: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 169.254.192.5  netmask 255.255.192.0  broadcast 169.254.255.255
        inet6 fe80::f816:3eff:fe9d:275c  prefixlen 64  scopeid 0x20<link>
        ether fa:16:3e:9d:27:5c  txqueuelen 1000  (Ethernet)
        RX packets 321  bytes 19678 (19.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 12  bytes 1008 (1008.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 0  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

qg-22431202-eb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether fa:16:3e:bd:56:ad  txqueuelen 1000  (Ethernet)
        RX packets 42  bytes 3360 (3.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1  bytes 110 (110.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

qr-3306c360-5a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether fa:16:3e:72:34:4c  txqueuelen 1000  (Ethernet)
        RX packets 105  bytes 6456 (6.3 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1  bytes 110 (110.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


4. On controller-0, there is no such ip
[heat-admin@overcloud-controller-0 ~]$ sudo ip netns exec qrouter-934f0b90-2d98-4d54-b9ca-5222aac2199d ifconfig
ha-8dccf24a-2e: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 169.254.192.4  netmask 255.255.192.0  broadcast 169.254.255.255
        inet6 fe80::f816:3eff:fe98:83dd  prefixlen 64  scopeid 0x20<link>
        ether fa:16:3e:98:83:dd  txqueuelen 1000  (Ethernet)
        RX packets 1140  bytes 68618 (67.0 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 12  bytes 1008 (1008.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 0  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

qg-22431202-eb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether fa:16:3e:bd:56:ad  txqueuelen 1000  (Ethernet)
        RX packets 42  bytes 3244 (3.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1  bytes 110 (110.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

qr-3306c360-5a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether fa:16:3e:72:34:4c  txqueuelen 1000  (Ethernet)
        RX packets 1753  bytes 105336 (102.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1  bytes 110 (110.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Marios Andreou 2015-09-08 13:00:24 UTC

I can recreate this on my dev setup - as described below, but I am still not clear on the exact problem (for me connectivity was not an issue, until after service restart/reboot, and even then eventually came back anyway). So I guess questions for moving forward: 

1. how did you create router, network and attach etc. (exact commands please are best).

2. apart from the incorrect status reported, is connectivity otherwise ok? can you ping the router interface? 

3. does neutron l3-agent-list-hosting-router <router_name> also lie?

4. More info about the exact problem please. I mean how is the specific issue induced - in your report you say "This happens very frequently" - what does, the incorrect report of status or a loss of connectivity? It it a random occurrence? Is the port unusable immediately upon creation? I guess you meant loss of connectivity because of "makes l3 almost not usable". Is time a factor here, does it eventually recover after a few mins as described below. 

More context and what I tried below:


openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 2

Created a router port (float-ip) and instance like: [4]
instance + connectivity healthy, ssh fedora.2.46 (floatip) and from it wget google.com fine

I can confirm that as BigSwitch reports binding:host_id [1] is a lie on the port show for the created router ip - in my setup:

| binding:host_id       | overcloud-controller-1.localdomai

However, at least 

neutron l3-agent-list-hosting-router default-router

gives the truth about that address:

| 9bd5a4a0-26f8-46d7-b6ab-7eea5eb3c28c | overcloud-controller-0.localdomain | True           | :-)   | active   |

showing it active on controller-0 (which I verified on the node itself like "for i in `ip netns list` ; do ip netns exec $i ifconfig; done;" ).

I also noticed that whilst the router iface and the vm floating ip were up and pinging fine, the status of that router iface was DOWN, I came across this bug [2] and a couple of other reports/questions [3] that suggest this is a failure to update the reported status but not the actual status of the port (i.e. the port *is* responding fine). 

Whilst debugging, I tried pcs resource restart neutron-server on the controllers (restarts all neutron-* services in sequence) and later a full nova reboot. Connectivity was lost to the vm and gateway for a good 3-5 mins but eventually recovered. 

As a final note, I just checked again and it seems the neutron port-show does catch up eventually. I have the port showing as ACTIVE and on the correct host:

| binding:host_id       | overcloud-controller-0.localdomain                                              |
| status                | ACTIVE |

thanks, marios

[1] http://developer.openstack.org/api-ref-networking-v2-ext.html#listPorts
[2] https://bugs.launchpad.net/neutron/+bug/1192883
[3] https://ask.openstack.org/en/question/25234/one-router-port-is-always-down/
[4]
glance image-create --name user --is-public True --disk-format qcow2     --container-format bare --file fedora-user.qcow2 
NETWORK_CIDR='10.0.0.0/8'
OVERCLOUD_NAMESERVER='8.8.8.8'
FLOATING_IP_CIDR='192.0.2.0/24'
FLOATING_IP_START='192.0.2.45'
FLOATING_IP_END='192.0.2.64'
BM_NETWORK_GATEWAY='192.0.2.1'

NETWORK_JSON=$(mktemp)
jq "." <<EOF > $NETWORK_JSON
{
    "float": {
        "cidr": "$NETWORK_CIDR",
        "name": "default-net",
        "nameserver": "$OVERCLOUD_NAMESERVER"
    },
    "external": {
        "name": "ext-net",
        "cidr": "$FLOATING_IP_CIDR",
        "allocation_start": "$FLOATING_IP_START",
        "allocation_end": "$FLOATING_IP_END",
        "gateway": "$BM_NETWORK_GATEWAY"
    }
}
EOF
setup-neutron -n $NETWORK_JSON
neutron net-list

NET_ID=$(neutron net-list -f csv --quote none | grep default-net | cut -d, -f1)

if ! nova keypair-show default 2>/dev/null; then tripleo user-config; fi
nova boot --poll --key-name default --flavor m1.demo --image user --nic net-id=$NET_ID demo
PRIVATEIP=$(nova list | grep demo | awk -F"default-net=" '{print $2}' | awk '{print $1}')
tripleo wait_for 10 5 neutron port-list -f csv -c id --quote none \| grep id
PORT=$(neutron port-list | grep $PRIVATEIP | cut -d'|' -f2)
FLOATINGIP=$(neutron floatingip-create ext-net --port-id "${PORT//[[:space:]]/}" | awk '$2=="floating_ip_address" {print $4}')
SECGROUPID=$(nova secgroup-list | grep default | cut -d ' ' -f2)
neutron security-group-rule-create $SECGROUPID --protocol icmp \
    --direction ingress --port-range-min 8 || true
neutron security-group-rule-create $SECGROUPID --protocol tcp \
    --direction ingress --port-range-min 22 --port-range-max 22 || true

Comment 4 bigswitch 2015-09-09 03:50:20 UTC

1. We use horizon gui to create router, network and add network interface to that router.
2. The connectivity is not ok because neutron reports the wrong state. The way big switch ml2 plugin works is that when a router port is created in neutron, the plugin will send a rest call to big switch controller to prepare flow entries for that port. In this case, the flow entries were wrong.
3. neutron l3-agent-list-hosting-router will list all the three openstack controllers, because the router namespace is brought up on all openstack controllers.
4. This problem does not happen all the time. Where the IP is actually located depends on VRRP. Whenever neutron db does not match the actual port location, connectivity becomes a problem.

Comment 5 Marios Andreou 2015-09-09 05:44:03 UTC

(In reply to bigswitch from comment #4)
> 1. We use horizon gui to create router, network and add network interface to
> that router.
> 2. The connectivity is not ok because neutron reports the wrong state. The
> way big switch ml2 plugin works is that when a router port is created in
> neutron, the plugin will send a rest call to big switch controller to
> prepare flow entries for that port. In this case, the flow entries were
> wrong.
> 3. neutron l3-agent-list-hosting-router will list all the three openstack
> controllers, because the router namespace is brought up on all openstack
> controllers.

yes but only one of them is active, right? like:


[stack@instack ~]$ neutron l3-agent-list-hosting-router default-router
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 9bd5a4a0-26f8-46d7-b6ab-7eea5eb3c28c | overcloud-controller-0.localdomain | True           | :-)   | active   |
| 503cefd8-2a61-496e-835c-745cffc2a805 | overcloud-controller-2.localdomain | True           | :-)   | standby  |
| cbf8f100-2101-4fa9-be32-21f3d868ca40 | overcloud-controller-1.localdomain | True           | :-)   | standby  |
+--------------------------------------+------------------------------------+-------


> 4. This problem does not happen all the time. Where the IP is actually
> located depends on VRRP. Whenever neutron db does not match the actual port
> location, connectivity becomes a problem.

Comment 6 bigswitch 2015-09-09 06:35:03 UTC

@Marios, yes, you are right. ml2 plugin should be able to check who's active and configure big switch controller accordingly. However, current ml2 plugin doesn't have this capability. I will work with my colleague to figure out a solution. At the same time, neutron port-show doesn't match the actual port allocation is still a problem.

Comment 7 bigswitch 2015-09-11 18:01:55 UTC

For now, we work around this problem by making l3_ha = False and allow_automatic_l3agent_failover = True. At the same time, we open a neutron upstream bug to track it https://bugs.launchpad.net/neutron/+bug/1494866. We can close this bug here.

Comment 9 Marios Andreou 2015-09-16 16:29:57 UTC

re-opening to start a discussion on backporting the fix https://review.openstack.org/#/c/141114/  as reported in the upstream bug https://bugs.launchpad.net/neutron/+bug/1494866

Comment 10 chris alfonso 2015-09-17 16:44:50 UTC

*** Bug 1263520 has been marked as a duplicate of this bug. ***

Comment 11 Assaf Muller 2015-10-13 21:01:55 UTC

The fix was backported a while ago and will be available in the OSP 7 z3 release.

Comment 13 Jiri Stransky 2015-12-03 11:06:49 UTC

*** Bug 1287809 has been marked as a duplicate of this bug. ***

Comment 14 Assaf Muller 2015-12-16 17:16:33 UTC


*** This bug has been marked as a duplicate of bug 1253953 ***

Note You need to log in before you can comment on or make changes to this bug.