Bug 1502572
| Summary: | After update to OSP 10 release 4, floating IP addresses not added to router's qg interface when guests spawned | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Paul Needle <pneedle> | |
| Component: | openstack-neutron | Assignee: | Brian Haley <bhaley> | |
| Status: | CLOSED ERRATA | QA Contact: | Toni Freger <tfreger> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 10.0 (Newton) | CC: | akaris, amuller, bhaley, bschmaus, chrisw, ealcaniz, ihrachys, ipetrova, jlibosva, jmelvin, ljozsa, lpeer, lruzicka, mschuppe, nyechiel, pablo.iranzo, pneedle, srevivo, vkommadi | |
| Target Milestone: | z6 | Keywords: | Triaged, ZStream | |
| Target Release: | 10.0 (Newton) | Flags: | lpeer:
needinfo+
lpeer: needinfo+ |
|
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | openstack-neutron-9.4.1-4.el7ost | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1505770 1505771 (view as bug list) | Environment: | ||
| Last Closed: | 2017-11-15 13:53:31 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1505770, 1505771 | |||
|
Description
Paul Needle
2017-10-16 08:58:37 UTC
@Brian - This does not appear to be the situation for OSP9 issue as there were no xtable locks in the logs on the controller this was happening on and the customer had applied the neutron fix in OSP9 for same iptable issue. I am asking them for reconfirmation and will let you know once I get that answer. I am also waiting on the confirmation that manually adding the route works for the interim. From the log, the neutron code seems to be working as I'd expect: - run 'iptables-restore -n' - If fail, run 'iptables-restore -n -w 20' Is there an sosreport for this other customer? Are there any traces in the l3-agent log? So I've been trying to dissect logs. The most likely issue is that the iptables calls are failing and we are never adding the IP addresses to the namespaces (or keepalived isn't). This is because we add the NAT rules first, then the IP. After talking to Jakub he mentioned something about checking how long we were waiting, and if that could be part of the problem. In the case above, since all values are at their default, we are using: (report_interval/3.0), or (30/3.0), or 10 seconds That should be plenty of time, but I do see errors like this sometimes: ; Stdout: ; Stderr: Another app is currently holding the xtables lock; still 1s 0us time ahead to have a chance to grab the lock... I still don't understand why we could be taking so long to get the lock, but one option might be to try setting some config values to see if it helps these customers get past the problem. The two values that would have to change are: agent_down_time = 150 report_interval = 60 That is triple their default values in order to get the iptables wait time to be 20 seconds instead of 10, but would allow the l3-agent to send less frequent updates to the server, and have the server wait longer in-between before declaring the agent down. It's just a theory until we can debug this further. Also, if this is being seen in an environment we can get access to it might help. So I've resorted to debugging this on live systems experiencing the problem the Ladislav gave me access to - the central CI systems. I re-started the l3-agents on all three of their controller nodes based on code in these two upstream patches: https://review.openstack.org/#/c/513171/ https://review.openstack.org/#/c/513489/ The first "learns" if we used iptables-restore with the -w flag successfully, and always uses it in the future, saving a possible failing call to iptables-restore. The second increases the rate at which we try to get the iptables lock, from the default of 1 per second to 5 per second. I'm hoping this will lead to less timeouts where we don't apply iptables rules at all, which I believe is leading to the failure, since we don't configure the IP address (or notify keepalived to) unless that step is successful. After restarting all the l3-agents in the configuration from comment #14, all the NAT rules seem to be configured on all nodes, but for some reason keepalived is not configuring the floating IP addresses. Looking at the configuration files all three seem to be in backup state: vrrp_instance VR_1 { state BACKUP interface ha-9fa65ca6-14 virtual_router_id 1 priority 50 garp_master_delay 60 nopreempt advert_int 2 [...] I would have hoped that one would have taken over and become MASTER. Will have to figure out how best to do this without breaking something. Just wanted to update. The BACKUP settings in the keepalived.conf files were a red herring. Others took a look and found that the l3-agent hadn't processed the router, and so keepalived state wasn't updated. When they bounced the router state, admin_state_up=False/True the change was picked-up and things started to work again. Daniel was going to look into recent changes to see what might have changed to cause this regression, since constantly bouncing the state shouldn't be required, but for now it might be a workaround if things get stuck for a floating IP. Brian, I assume changing the admin status to down then up would be similar to doing a failover as well right? That is what my customer has been doing but in either case I would assume that both cause network connectivity loss for existing instances when the router is bounced or failed over? (In reply to Benjamin Schmaus from comment #17) > Brian, > > I assume changing the admin status to down then up would be similar to doing > a failover as well right? That is what my customer has been doing but in > either case I would assume that both cause network connectivity loss for > existing instances when the router is bounced or failed over? I don't think failover helps in this case. The culprit is agent being out of sync with server. While failover communicates only between agents' keepalived processes, the admin state down triggers communication between agent and server, leading to agent fetch new data and configure keepalived config file correctly. My analysis is - Whenever a floating ip is added, l3 agent will 1) add it to its internal cache and then 2) writes to the config file and SIGHUP keepalived process to reload the new config But I suspect step 2 is not happening here because HA network port status is DOWN. https://review.openstack.org/#/c/512179/ addresses this issue. Once backports are merged in u/s we will backport it to d/s and can provide hotfix. note: restarting l2 agent(after restarting l3 agent) should fix this issue as well. Toni, I think all the customers had to do to reproduce this was to have a neutron environment configured with L3 HA, boot an instance, and associate a floating IP to it, then try and ping/ssh the floating IP. In some cases that would fail. Unfortunately it wasn't always reproducible because it was due to a race condition in agent notifications, but trying with multiple instances and/or multiple tenants (so different routers) would be a good test. Thanks, -Brian *** Bug 1507570 has been marked as a duplicate of this bug. *** Tested on latest OSP10 openstack-neutron-9.4.1-5.el7ost.noarch
Setup: 3 Controllers,1 Compute
Reproduction steps:
1)VM spawned and floatingip attached, connectivity tested.
2) L3 Agent of MASTER router restarted several times during continuous ping to the floatingip of the VM.
3)Spawned additional 2 VMs with different internal network, FIP attached.Connectivity to them tested.
4)Keeplived conf contains all FIP as expected, see below.
global_defs {
notification_email_from neutron
router_id neutron
}
vrrp_instance VR_1 {
state BACKUP
interface ha-774ae136-05
virtual_router_id 1
priority 50
garp_master_delay 60
nopreempt
advert_int 2
track_interface {
ha-774ae136-05
}
virtual_ipaddress {
169.254.0.1/24 dev ha-774ae136-05
}
virtual_ipaddress_excluded {
10.0.0.215/24 dev qg-83d040e6-18
10.0.0.217/32 dev qg-83d040e6-18
10.0.0.218/32 dev qg-83d040e6-18
10.0.0.219/32 dev qg-83d040e6-18
40.40.40.1/24 dev qr-725d4569-ef
70.70.70.1/24 dev qr-bd2bda7c-84
fe80::f816:3eff:fe3c:d138/64 dev qr-bd2bda7c-84 scope link
fe80::f816:3eff:fed2:1db7/64 dev qr-725d4569-ef scope link
fe80::f816:3eff:feef:4d78/64 dev qg-83d040e6-18 scope link
}
virtual_routes {
0.0.0.0/0 via 10.0.0.1 dev qg-83d040e6-18
}
}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3234 |