Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1503818

Summary: [OSP9] Neutron L3-Agent silently stops updating router namespace
Product: Red Hat OpenStack Reporter: Benjamin Schmaus <bschmaus>
Component: openstack-neutronAssignee: Brian Haley <bhaley>
Status: CLOSED ERRATA QA Contact: Toni Freger <tfreger>
Severity: high Docs Contact:
Priority: high    
Version: 9.0 (Mitaka)CC: akaris, amuller, bhaley, chrisw, jlibosva, nyechiel, samccann, srevivo, vkommadi
Target Milestone: zstreamKeywords: Triaged, ZStream
Target Release: 9.0 (Mitaka)   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: openstack-neutron-8.4.0-8.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-08 18:36:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Benjamin Schmaus 2017-10-18 20:17:49 UTC
Description of problem:
When customer goes to attach a new floating IP the L3 agent does not seem to attach it and there is no connectivity.   If customer fails over L3 to passive node the FIP becomes available and connected however the failover event causes outage for existing instances.


Version-Release number of selected component (if applicable):
OSP9

How reproducible:
100%

Steps to Reproduce:
1.Boot new instance
2.Attach floating IP to instance
3.FIP is not attached to name space

Actual results:
No connectivity unless L3 agent is failed over

Expected results:
Once FIP is attached the instance should have network connectivity

Additional info:
This seems to be similar to the following BZ#1502572 but the workaround of manually adding the fip to the namespace does not work in this case.

Comment 2 Benjamin Schmaus 2017-10-18 20:22:57 UTC
Packages in configuration:

openstack-neutron-8.4.0-6.el7ost.noarch
openstack-neutron-metering-agent-8.4.0-6.el7ost.noarch
python-neutron-8.4.0-6.el7ost.noarch
python-neutron-lbaas-8.4.0-1.el7ost.noarch
openstack-neutron-common-8.4.0-6.el7ost.noarch
openstack-neutron-lbaas-8.4.0-1.el7ost.noarch
python-neutronclient-4.1.1-2.el7ost.noarch
openstack-neutron-bigswitch-lldp-8.40.7-2.el7ost.noarch
openstack-neutron-openvswitch-8.4.0-6.el7ost.noarch
openstack-neutron-ml2-8.4.0-6.el7ost.noarch
python-neutron-lib-0.0.2-1.el7ost.noarch

Comment 4 Brian Haley 2017-10-20 15:34:14 UTC
This still does have similarities to bz 1502572, it could be that we are tripping over the iptables-restore issue here, so the NAT rules are not getting added to the namespace, so adding the IP doesn't work.

One thing we've been using to try and workaround this is to set admin_state_up=False then True on the affected router, which will trigger the agent to refresh things and get the rules and IP configured.  We are still trying to root-cause the overlying issue and will update when I have more info.

Comment 5 Brian Haley 2017-10-20 15:50:04 UTC
This does have the iptables traces in the log file, it also has this:

2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task [req-19638c71-4ad9-412f-b5d7-dc9cb84eca4f - - - - -] Error during L3NATAgentWithStateReport.periodic_sync_routers_task
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task Traceback (most recent call last):
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task   File "/usr/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task     task(self, context)
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 568, in periodic_sync_routers_task
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task     self.fetch_and_sync_all_routers(context, ns_manager)
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 603, in fetch_and_sync_all_routers
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task     r['id'], r.get(l3_constants.HA_ROUTER_STATE_KEY))
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 120, in check_ha_state_for_router
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task     if ri and current_state != TRANSLATION_MAP[ri.ha_state]:
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 81, in ha_state
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task     ha_state_path = self.keepalived_manager.get_full_config_file_path(
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task AttributeError: 'NoneType' object has no attribute 'get_full_config_file_path'
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task

I pinged someone to look at that since it could be related to why an IP did not get configured.

Comment 6 anil venkata 2017-10-23 12:03:02 UTC
Whenever a floating ip is added, l3 agent will 
1) add it to its internal cache and then
2) writes to the config file and SIGHUP keepalived process to reload the new config
But I suspect step 2 is not happening here because HA network port status is DOWN.

https://review.openstack.org/#/c/512179/ addresses this issue. Once backports are merged in u/s we will backport it to d/s and can provide hotfix.

note: restarting l2 agent(after restarting l3 agent) should fix this issue as well.

Comment 7 anil venkata 2017-10-23 12:05:25 UTC
Below error is different issue
2017-10-12 16:17:03.425 12387 ERROR oslo_service.periodic_task AttributeError: 'NoneType' object has no attribute 'get_full_config_file_path'

I will propose a patch in u/s for that.

But patch(https://review.openstack.org/#/c/512179/) in comment should fix floatingip issue.

Comment 9 anil venkata 2017-10-23 12:38:33 UTC
May be tomorrow if https://review.openstack.org/#/c/514138/ and https://review.openstack.org/#/c/514139/ gets merged today(I hope they can be merged today).

Comment 11 Brian Haley 2017-10-24 18:15:12 UTC
Since there were a few related bugs with slightly different descriptions, another was cloned to track all the backports from upstream.

https://bugzilla.redhat.com/show_bug.cgi?id=1505771

Comment 16 anil venkata 2017-11-02 11:57:29 UTC
Steps to reproduce
1) In OSP9(or OSP10, OSP11) Restart L3 agent
2) Then spawn a vm and add floatingip
3) Ping floatingip, should succeed with the fix.
4) Also check if the floatingip is added to keepalived config file.

Comment 17 anil venkata 2017-11-02 11:57:42 UTC
Steps to reproduce
1) In OSP9(or OSP10, OSP11) Restart L3 agent
2) Then spawn a vm and add floatingip
3) Ping floatingip, should succeed with the fix.
4) Also check if the floatingip is added to keepalived config file.

Comment 20 Toni Freger 2017-11-07 07:59:00 UTC
Tested on latest OSP9 openstack-neutron-8.4.0-8.el7ost.noarch
Setup: 3 Controllers,1 Compute

Reproduction steps:

1)VM spawned and floatingip attached, connectivity tested.
2) L3 Agent of MASTER router restarted several times during continuous ping to the floatingip of the VM.
3)Spawned additional 2 VMs with FIP.Connectivity to them tested.
4)Keeplived conf contains all FIP as expected, see below.
vrrp_instance VR_1 {
    state BACKUP
    interface ha-0d868774-03
    virtual_router_id 1
    priority 50
    garp_master_delay 60
    nopreempt
    advert_int 2
    track_interface {
        ha-0d868774-03
    }
    virtual_ipaddress {
        169.254.0.1/24 dev ha-0d868774-03
    }
    virtual_ipaddress_excluded {
        10.0.0.210/24 dev qg-dacf0c94-d3
        10.0.0.211/32 dev qg-dacf0c94-d3
        10.0.0.212/32 dev qg-dacf0c94-d3
        10.0.0.213/32 dev qg-dacf0c94-d3
        30.30.30.1/24 dev qr-2520ff23-1b
        fe80::f816:3eff:fe15:ac1/64 dev qg-dacf0c94-d3 scope link
        fe80::f816:3eff:fe7f:30af/64 dev qr-2520ff23-1b scope link
    }
    virtual_routes {
        0.0.0.0/0 via 10.0.0.1 dev qg-dacf0c94-d3
    }
}

Comment 22 errata-xmlrpc 2017-11-08 18:36:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3152