Bug 1260298
| Summary: | router allocation doesn't match the record in neutron db when l3_ha is true | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | bigswitch <rhosp-bugs-internal> |
| Component: | openstack-neutron | Assignee: | Assaf Muller <amuller> |
| Status: | CLOSED DUPLICATE | QA Contact: | Ofer Blaut <oblaut> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.0 (Kilo) | CC: | amuller, chrisw, jschluet, lpeer, mandreou, mburns, mcornea, nyechiel, rhel-osp-director-maint, tdunnon, yeylon |
| Target Milestone: | async | Keywords: | Reopened, ZStream |
| Target Release: | 7.0 (Kilo) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-12-16 17:16:33 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
bigswitch
2015-09-05 19:03:15 UTC
I can recreate this on my dev setup - as described below, but I am still not clear on the exact problem (for me connectivity was not an issue, until after service restart/reboot, and even then eventually came back anyway). So I guess questions for moving forward: 1. how did you create router, network and attach etc. (exact commands please are best). 2. apart from the incorrect status reported, is connectivity otherwise ok? can you ping the router interface? 3. does neutron l3-agent-list-hosting-router <router_name> also lie? 4. More info about the exact problem please. I mean how is the specific issue induced - in your report you say "This happens very frequently" - what does, the incorrect report of status or a loss of connectivity? It it a random occurrence? Is the port unusable immediately upon creation? I guess you meant loss of connectivity because of "makes l3 almost not usable". Is time a factor here, does it eventually recover after a few mins as described below. More context and what I tried below: openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 2 Created a router port (float-ip) and instance like: [4] instance + connectivity healthy, ssh fedora.2.46 (floatip) and from it wget google.com fine I can confirm that as BigSwitch reports binding:host_id [1] is a lie on the port show for the created router ip - in my setup: | binding:host_id | overcloud-controller-1.localdomai However, at least neutron l3-agent-list-hosting-router default-router gives the truth about that address: | 9bd5a4a0-26f8-46d7-b6ab-7eea5eb3c28c | overcloud-controller-0.localdomain | True | :-) | active | showing it active on controller-0 (which I verified on the node itself like "for i in `ip netns list` ; do ip netns exec $i ifconfig; done;" ). I also noticed that whilst the router iface and the vm floating ip were up and pinging fine, the status of that router iface was DOWN, I came across this bug [2] and a couple of other reports/questions [3] that suggest this is a failure to update the reported status but not the actual status of the port (i.e. the port *is* responding fine). Whilst debugging, I tried pcs resource restart neutron-server on the controllers (restarts all neutron-* services in sequence) and later a full nova reboot. Connectivity was lost to the vm and gateway for a good 3-5 mins but eventually recovered. As a final note, I just checked again and it seems the neutron port-show does catch up eventually. I have the port showing as ACTIVE and on the correct host: | binding:host_id | overcloud-controller-0.localdomain | | status | ACTIVE | thanks, marios [1] http://developer.openstack.org/api-ref-networking-v2-ext.html#listPorts [2] https://bugs.launchpad.net/neutron/+bug/1192883 [3] https://ask.openstack.org/en/question/25234/one-router-port-is-always-down/ [4] glance image-create --name user --is-public True --disk-format qcow2 --container-format bare --file fedora-user.qcow2 NETWORK_CIDR='10.0.0.0/8' OVERCLOUD_NAMESERVER='8.8.8.8' FLOATING_IP_CIDR='192.0.2.0/24' FLOATING_IP_START='192.0.2.45' FLOATING_IP_END='192.0.2.64' BM_NETWORK_GATEWAY='192.0.2.1' NETWORK_JSON=$(mktemp) jq "." <<EOF > $NETWORK_JSON { "float": { "cidr": "$NETWORK_CIDR", "name": "default-net", "nameserver": "$OVERCLOUD_NAMESERVER" }, "external": { "name": "ext-net", "cidr": "$FLOATING_IP_CIDR", "allocation_start": "$FLOATING_IP_START", "allocation_end": "$FLOATING_IP_END", "gateway": "$BM_NETWORK_GATEWAY" } } EOF setup-neutron -n $NETWORK_JSON neutron net-list NET_ID=$(neutron net-list -f csv --quote none | grep default-net | cut -d, -f1) if ! nova keypair-show default 2>/dev/null; then tripleo user-config; fi nova boot --poll --key-name default --flavor m1.demo --image user --nic net-id=$NET_ID demo PRIVATEIP=$(nova list | grep demo | awk -F"default-net=" '{print $2}' | awk '{print $1}') tripleo wait_for 10 5 neutron port-list -f csv -c id --quote none \| grep id PORT=$(neutron port-list | grep $PRIVATEIP | cut -d'|' -f2) FLOATINGIP=$(neutron floatingip-create ext-net --port-id "${PORT//[[:space:]]/}" | awk '$2=="floating_ip_address" {print $4}') SECGROUPID=$(nova secgroup-list | grep default | cut -d ' ' -f2) neutron security-group-rule-create $SECGROUPID --protocol icmp \ --direction ingress --port-range-min 8 || true neutron security-group-rule-create $SECGROUPID --protocol tcp \ --direction ingress --port-range-min 22 --port-range-max 22 || true 1. We use horizon gui to create router, network and add network interface to that router. 2. The connectivity is not ok because neutron reports the wrong state. The way big switch ml2 plugin works is that when a router port is created in neutron, the plugin will send a rest call to big switch controller to prepare flow entries for that port. In this case, the flow entries were wrong. 3. neutron l3-agent-list-hosting-router will list all the three openstack controllers, because the router namespace is brought up on all openstack controllers. 4. This problem does not happen all the time. Where the IP is actually located depends on VRRP. Whenever neutron db does not match the actual port location, connectivity becomes a problem. (In reply to bigswitch from comment #4) > 1. We use horizon gui to create router, network and add network interface to > that router. > 2. The connectivity is not ok because neutron reports the wrong state. The > way big switch ml2 plugin works is that when a router port is created in > neutron, the plugin will send a rest call to big switch controller to > prepare flow entries for that port. In this case, the flow entries were > wrong. > 3. neutron l3-agent-list-hosting-router will list all the three openstack > controllers, because the router namespace is brought up on all openstack > controllers. yes but only one of them is active, right? like: [stack@instack ~]$ neutron l3-agent-list-hosting-router default-router +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | 9bd5a4a0-26f8-46d7-b6ab-7eea5eb3c28c | overcloud-controller-0.localdomain | True | :-) | active | | 503cefd8-2a61-496e-835c-745cffc2a805 | overcloud-controller-2.localdomain | True | :-) | standby | | cbf8f100-2101-4fa9-be32-21f3d868ca40 | overcloud-controller-1.localdomain | True | :-) | standby | +--------------------------------------+------------------------------------+------- > 4. This problem does not happen all the time. Where the IP is actually > located depends on VRRP. Whenever neutron db does not match the actual port > location, connectivity becomes a problem. @Marios, yes, you are right. ml2 plugin should be able to check who's active and configure big switch controller accordingly. However, current ml2 plugin doesn't have this capability. I will work with my colleague to figure out a solution. At the same time, neutron port-show doesn't match the actual port allocation is still a problem. For now, we work around this problem by making l3_ha = False and allow_automatic_l3agent_failover = True. At the same time, we open a neutron upstream bug to track it https://bugs.launchpad.net/neutron/+bug/1494866. We can close this bug here. re-opening to start a discussion on backporting the fix https://review.openstack.org/#/c/141114/ as reported in the upstream bug https://bugs.launchpad.net/neutron/+bug/1494866 *** Bug 1263520 has been marked as a duplicate of this bug. *** The fix was backported a while ago and will be available in the OSP 7 z3 release. *** Bug 1287809 has been marked as a duplicate of this bug. *** *** This bug has been marked as a duplicate of bug 1253953 *** |