Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1502572

Summary:	After update to OSP 10 release 4, floating IP addresses not added to router's qg interface when guests spawned
Product:	Red Hat OpenStack	Reporter:	Paul Needle <pneedle>
Component:	openstack-neutron	Assignee:	Brian Haley <bhaley>
Status:	CLOSED ERRATA	QA Contact:	Toni Freger <tfreger>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	10.0 (Newton)	CC:	akaris, amuller, bhaley, bschmaus, chrisw, ealcaniz, ihrachys, ipetrova, jlibosva, jmelvin, ljozsa, lpeer, lruzicka, mschuppe, nyechiel, pablo.iranzo, pneedle, srevivo, vkommadi
Target Milestone:	z6	Keywords:	Triaged, ZStream
Target Release:	10.0 (Newton)	Flags:	lpeer: needinfo+ lpeer: needinfo+
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-neutron-9.4.1-4.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1505770 1505771 (view as bug list)		Environment:
Last Closed:	2017-11-15 13:53:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1505770, 1505771

Description Paul Needle 2017-10-16 08:58:37 UTC

Description of problem:

After minor release upgrading to OSP 10 version 4, newly created instances cannot be accessed via their associated floating IP addresses.

Specifically, it looks like the floating IP associated with newly spawned guest instances are not being added to the 'qg' interface within the qrouter namespace on the controller node with an active routing role.

When the customer adds this manually (using 'ip a a <ip address> dev <qg-xy>'), connectivity ensues.

This is preventing upgrade within the primary production platform, until the symptoms can be avoided.

Version-Release number of selected component (if applicable):

- OSP 10 version 4.
- RHEL 7.
- openstack-neutron-9.4.0-2.el7ost.noarch

How reproducible:

Every time (by the customer).

Steps to Reproduce:

1. After updating to OSP 10 version 4, spawn a new guest.

2. Floating IP for that new guest is unresponsive to pings.

3. The actively routing controller's router namespace's qg interface has not been updated to include a route for the newly created floating IP.

4. Adding the address manually results in successful pings to the floating IP.

Actual results:

Newly spawned guests after the minor release update do not have their floating IP address added to the associated router namespaces qg interface.

Expected results:

For the address to be added to qg interfaces appropriately so that floating IPs can be used effectively.

Comment 7 Benjamin Schmaus 2017-10-17 16:32:58 UTC

@Brian - This does not appear to be the situation for OSP9 issue as there were no xtable locks in the logs on the controller this was happening on and the customer had applied the neutron fix in OSP9 for same iptable issue.

I am asking them for reconfirmation and will let you know once I get that answer.

I am also waiting on the confirmation that manually adding the route works for the interim.

Comment 9 Brian Haley 2017-10-18 12:52:51 UTC

From the log, the neutron code seems to be working as I'd expect:

 - run 'iptables-restore -n'
 - If fail, run 'iptables-restore -n -w 20'

Is there an sosreport for this other customer?  Are there any traces in the l3-agent log?

Comment 11 Brian Haley 2017-10-18 19:35:57 UTC

So I've been trying to dissect logs.

The most likely issue is that the iptables calls are failing and we are never adding the IP addresses to the namespaces (or keepalived isn't).  This is because we add the NAT rules first, then the IP.

After talking to Jakub he mentioned something about checking how long we were waiting, and if that could be part of the problem.

In the case above, since all values are at their default, we are using:

  (report_interval/3.0), or (30/3.0), or 10 seconds

That should be plenty of time, but I do see errors like this sometimes:

  ; Stdout: ; Stderr: Another app is currently holding the xtables lock; still 1s 0us time ahead to have a chance to grab the lock...

I still don't understand why we could be taking so long to get the lock, but one option might be to try setting some config values to see if it helps these customers get past the problem.  The two values that would have to change are:

  agent_down_time = 150
  report_interval = 60

That is triple their default values in order to get the iptables wait time to be 20 seconds instead of 10, but would allow the l3-agent to send less frequent updates to the server, and have the server wait longer in-between before declaring the agent down.

It's just a theory until we can debug this further.

Also, if this is being seen in an environment we can get access to it might help.

Comment 14 Brian Haley 2017-10-19 20:07:19 UTC

So I've resorted to debugging this on live systems experiencing the problem the Ladislav gave me access to - the central CI systems.

I re-started the l3-agents on all three of their controller nodes based on code in these two upstream patches:

https://review.openstack.org/#/c/513171/
https://review.openstack.org/#/c/513489/

The first "learns" if we used iptables-restore with the -w flag successfully, and always uses it in the future, saving a possible failing call to iptables-restore.

The second increases the rate at which we try to get the iptables lock, from the default of 1 per second to 5 per second.  I'm hoping this will lead to less timeouts where we don't apply iptables rules at all, which I believe is leading to the failure, since we don't configure the IP address (or notify keepalived to) unless that step is successful.

Comment 15 Brian Haley 2017-10-19 21:10:26 UTC

After restarting all the l3-agents in the configuration from comment #14, all the NAT rules seem to be configured on all nodes, but for some reason keepalived is not configuring the floating IP addresses.  Looking at the configuration files all three seem to be in backup state:

vrrp_instance VR_1 {
    state BACKUP
    interface ha-9fa65ca6-14
    virtual_router_id 1
    priority 50
    garp_master_delay 60
    nopreempt
    advert_int 2
[...]

I would have hoped that one would have taken over and become MASTER.

Will have to figure out how best to do this without breaking something.

Comment 16 Brian Haley 2017-10-20 21:53:11 UTC

Just wanted to update.

The BACKUP settings in the keepalived.conf files were a red herring.  Others took a look and found that the l3-agent hadn't processed the router, and so keepalived state wasn't updated.  When they bounced the router state, admin_state_up=False/True the change was picked-up and things started to work again.

Daniel was going to look into recent changes to see what might have changed to cause this regression, since constantly bouncing the state shouldn't be required, but for now it might be a workaround if things get stuck for a floating IP.

Comment 17 Benjamin Schmaus 2017-10-20 22:40:59 UTC

Brian,

I assume changing the admin status to down then up would be similar to doing a failover as well right?  That is what my customer has been doing but in either case I would assume that both cause network connectivity loss for existing instances when the router is bounced or failed over?

Comment 18 Jakub Libosvar 2017-10-23 08:19:04 UTC

(In reply to Benjamin Schmaus from comment #17)
> Brian,
> 
> I assume changing the admin status to down then up would be similar to doing
> a failover as well right?  That is what my customer has been doing but in
> either case I would assume that both cause network connectivity loss for
> existing instances when the router is bounced or failed over?

I don't think failover helps in this case. The culprit is agent being out of sync with server. While failover communicates only between agents' keepalived processes, the admin state down triggers communication between agent and server, leading to agent fetch new data and configure keepalived config file correctly.

Comment 21 anil venkata 2017-10-23 12:01:15 UTC

My analysis is -

Whenever a floating ip is added, l3 agent will 
1) add it to its internal cache and then
2) writes to the config file and SIGHUP keepalived process to reload the new config
But I suspect step 2 is not happening here because HA network port status is DOWN.

https://review.openstack.org/#/c/512179/ addresses this issue. Once backports are merged in u/s we will backport it to d/s and can provide hotfix.

note: restarting l2 agent(after restarting l3 agent) should fix this issue as well.

Comment 28 Brian Haley 2017-11-01 13:10:44 UTC

Toni,

I think all the customers had to do to reproduce this was to have a neutron environment configured with L3 HA, boot an instance, and associate a floating IP to it, then try and ping/ssh the floating IP.  In some cases that would fail.  Unfortunately it wasn't always reproducible because it was due to a race condition in agent notifications, but trying with multiple instances and/or multiple tenants (so different routers) would be a good test.

Thanks,

-Brian

Comment 30 Ihar Hrachyshka 2017-11-06 14:35:33 UTC

*** Bug 1507570 has been marked as a duplicate of this bug. ***

Comment 31 Toni Freger 2017-11-07 10:43:47 UTC

Tested on latest OSP10 openstack-neutron-9.4.1-5.el7ost.noarch
Setup: 3 Controllers,1 Compute

Reproduction steps:

1)VM spawned and floatingip attached, connectivity tested.
2) L3 Agent of MASTER router restarted several times during continuous ping to the floatingip of the VM.
3)Spawned additional 2 VMs with different internal network, FIP attached.Connectivity to them tested.
4)Keeplived conf contains all FIP as expected, see below.
global_defs {
    notification_email_from neutron
    router_id neutron
}
vrrp_instance VR_1 {
    state BACKUP
    interface ha-774ae136-05
    virtual_router_id 1
    priority 50
    garp_master_delay 60
    nopreempt
    advert_int 2
    track_interface {
        ha-774ae136-05
    }
    virtual_ipaddress {
        169.254.0.1/24 dev ha-774ae136-05
    }
    virtual_ipaddress_excluded {
        10.0.0.215/24 dev qg-83d040e6-18
        10.0.0.217/32 dev qg-83d040e6-18
        10.0.0.218/32 dev qg-83d040e6-18
        10.0.0.219/32 dev qg-83d040e6-18
        40.40.40.1/24 dev qr-725d4569-ef
        70.70.70.1/24 dev qr-bd2bda7c-84
        fe80::f816:3eff:fe3c:d138/64 dev qr-bd2bda7c-84 scope link
        fe80::f816:3eff:fed2:1db7/64 dev qr-725d4569-ef scope link
        fe80::f816:3eff:feef:4d78/64 dev qg-83d040e6-18 scope link
    }
    virtual_routes {
        0.0.0.0/0 via 10.0.0.1 dev qg-83d040e6-18
    }
}

Comment 33 errata-xmlrpc 2017-11-15 13:53:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3234