Bug 1584779 - [HA] Floating IP issues after introducing failures to Cluster nodes
Summary: [HA] Floating IP issues after introducing failures to Cluster nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: z3
: 13.0 (Queens)
Assignee: Sridhar Gaddam
QA Contact: Tomas Jamrisko
URL:
Whiteboard: HA
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-31 16:00 UTC by Tomas Jamrisko
Modified: 2018-11-19 17:34 UTC (History)
7 users (show)

Fixed In Version: opendaylight-8.3.0-4.el7ost
Doc Type: Bug Fix
Doc Text:
Cause: Null Pointer Exceptions (NPE) seen in Netvirt when some of the controller nodes were brought down. Consequence: NPEs were causing some missing flows and stale group entries. Fix: NPEs are now fixed and the OVS pipeline is programmed accordingly for the FloatingIP use-case. Result: FIP use-case continues to work even when some disruptive tests are performed on the Controller/Compute nodes.
Clone Of:
Environment:
Last Closed: 2018-11-13 23:32:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
controller-0 logs (17.60 MB, application/x-tar)
2018-05-31 16:05 UTC, Tomas Jamrisko
no flags Details
controller-2 logs (18.03 MB, application/x-tar)
2018-05-31 16:09 UTC, Tomas Jamrisko
no flags Details
compute-1-ovslogs (230.00 KB, application/x-tar)
2018-05-31 16:10 UTC, Tomas Jamrisko
no flags Details
compute-0-ovs logs (190.00 KB, application/x-tar)
2018-05-31 16:10 UTC, Tomas Jamrisko
no flags Details
Controller-1 logs (281.62 KB, application/x-xz)
2018-05-31 16:12 UTC, Tomas Jamrisko
no flags Details
controller-1 neutron server logs (413.68 KB, application/x-xz)
2018-05-31 16:13 UTC, Tomas Jamrisko
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenDaylight Bug NETVIRT-1133 0 None None None 2018-06-28 04:39:49 UTC
OpenDaylight Bug NETVIRT-1353 0 None None None 2018-06-28 04:36:37 UTC
Red Hat Product Errata RHBA-2018:3614 0 None None None 2018-11-13 23:34:31 UTC

Description Tomas Jamrisko 2018-05-31 16:00:41 UTC
Description of problem:
Floating IPs become unreachable after performing some disrupting operations

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-11.el7ost


Steps to Reproduce:
1. Perform a reboot of controller nodes or computes
2. Start VMs
3. Create floating ips, attach to VMs and try to ping

Actual results:
VMs will be unreachable

Expected results:
VMs are reachable

(will attach logs shortly)

Comment 1 Tomas Jamrisko 2018-05-31 16:05:39 UTC
Created attachment 1446321 [details]
controller-0 logs

Comment 2 Tomas Jamrisko 2018-05-31 16:09:28 UTC
Created attachment 1446322 [details]
controller-2 logs

Comment 3 Tomas Jamrisko 2018-05-31 16:10:17 UTC
Created attachment 1446323 [details]
compute-1-ovslogs

Comment 4 Tomas Jamrisko 2018-05-31 16:10:50 UTC
Created attachment 1446324 [details]
compute-0-ovs logs

Comment 5 Tomas Jamrisko 2018-05-31 16:12:09 UTC
Created attachment 1446325 [details]
Controller-1 logs

Comment 6 Tomas Jamrisko 2018-05-31 16:13:32 UTC
Created attachment 1446326 [details]
controller-1 neutron server logs

Comment 7 Aswin Suryanarayanan 2018-06-04 06:59:53 UTC
This issue happens when the shard leader changes. Once failures are introduced nodes the shard leader changes and the packet punted to controller from ovs fails to reach the controller(netvirt, to be precise) . For a FIP traffic from undercloud, in the return path  the packet will be send to the controller to learn the undercloud mac, this packet is not reaching the controller. 


When we faced this issues earlier , we had observed that if we introduce failures again so as to make  the initial shard leader to get elected again, the fip was working.

Comment 8 Stephen Kitt 2018-06-04 09:14:18 UTC
(In reply to Aswin Suryanarayanan from comment #7)
> This issue happens when the shard leader changes. Once failures are
> introduced nodes the shard leader changes and the packet punted to
> controller from ovs fails to reach the controller(netvirt, to be precise) .
> For a FIP traffic from undercloud, in the return path  the packet will be
> send to the controller to learn the undercloud mac, this packet is not
> reaching the controller. 

When you say the packet is not reaching the controller, do you know where it's disappearing? Is it not reaching the VM/container hosting ODL? Or is it reaching the VM, but not being processed by the controller?

Comment 10 Aswin Suryanarayanan 2018-06-04 09:52:12 UTC
(In reply to Stephen Kitt from comment #8)
> (In reply to Aswin Suryanarayanan from comment #7)
> > This issue happens when the shard leader changes. Once failures are
> > introduced nodes the shard leader changes and the packet punted to
> > controller from ovs fails to reach the controller(netvirt, to be precise) .
> > For a FIP traffic from undercloud, in the return path  the packet will be
> > send to the controller to learn the undercloud mac, this packet is not
> > reaching the controller. 
> 
> When you say the packet is not reaching the controller, do you know where
> it's disappearing? Is it not reaching the VM/container hosting ODL? Or is it
> reaching the VM, but not being processed by the controller?

I am not sure were exactly it is dropped, it didn't reach till netvirt. Tomas do you some info regarding this?

Comment 11 Tomas Jamrisko 2018-06-04 10:18:56 UTC
I don't know much more, I can try to get a deployment, reproduce the issue, but I'm not sure what to look for. Would you be willing to take a look at the broken deployment?

(In reply to Aswin Suryanarayanan from comment #10)
> (In reply to Stephen Kitt from comment #8)
> > (In reply to Aswin Suryanarayanan from comment #7)
> > > This issue happens when the shard leader changes. Once failures are
> > > introduced nodes the shard leader changes and the packet punted to
> > > controller from ovs fails to reach the controller(netvirt, to be precise) .
> > > For a FIP traffic from undercloud, in the return path  the packet will be
> > > send to the controller to learn the undercloud mac, this packet is not
> > > reaching the controller. 
> > 
> > When you say the packet is not reaching the controller, do you know where
> > it's disappearing? Is it not reaching the VM/container hosting ODL? Or is it
> > reaching the VM, but not being processed by the controller?
> 
> I am not sure were exactly it is dropped, it didn't reach till netvirt.
> Tomas do you some info regarding this?

Comment 12 Sridhar Gaddam 2018-06-06 06:16:17 UTC
(In reply to Tomas Jamrisko from comment #11)
> I don't know much more, I can try to get a deployment, reproduce the issue,
> but I'm not sure what to look for. Would you be willing to take a look at
> the broken deployment?

Thomas, please reproduce this issue and provide us access to the environment. We will debug it. Thanks.
> 
> (In reply to Aswin Suryanarayanan from comment #10)
> > (In reply to Stephen Kitt from comment #8)
> > > (In reply to Aswin Suryanarayanan from comment #7)
> > > > This issue happens when the shard leader changes. Once failures are
> > > > introduced nodes the shard leader changes and the packet punted to
> > > > controller from ovs fails to reach the controller(netvirt, to be precise) .
> > > > For a FIP traffic from undercloud, in the return path  the packet will be
> > > > send to the controller to learn the undercloud mac, this packet is not
> > > > reaching the controller. 
> > > 
> > > When you say the packet is not reaching the controller, do you know where
> > > it's disappearing? Is it not reaching the VM/container hosting ODL? Or is it
> > > reaching the VM, but not being processed by the controller?
> > 
> > I am not sure were exactly it is dropped, it didn't reach till netvirt.
> > Tomas do you some info regarding this?

Comment 13 Sridhar Gaddam 2018-06-06 21:23:59 UTC
Had a look at the setup along with @Aswin and we found the following issues.

Setup: 2 Computes and 3 Controllers

The setup had three VMs where FIPs were not working.

NAPT Switch is located on Controller-0
Tenant network (net1) is VxLAN and public network is FLAT network

vm5 : MAC: fa:16:3e:fa:b3:fc, net1=192.168.99.5, 10.0.0.216, on Compute-1 
vm6 : MAC: fa:16:3e:8b:4a:3b, net1=192.168.99.6, 10.0.0.225 on Compute-0
vm9 : MAC: fa:16:3e:00:90:52, net1=192.168.99.9, 10.0.0.213 on Compute-1

While trying to ping from the undercloud to the FIP (same issue when VM was trying to ping to DC-GW/Undercloud), the packet is getting dropped due to one of the following reasons.

Issue-1: Stale Group entry seen for vm5 and vm6
===============================================
sudo ovs-appctl ofproto/trace br-int 'in_port=1,dl_src=52:54:00:d6:b0:82,dl_dst=fa:16:3e:3f:e5:1b,dl_type=0x0800,nw_src=10.0.0.1,nw_dst=10.0.0.216,nw_proto=1,nw_tos=0,nw_ttl=128,icmp_type=8,icmp_code=0'
Flow: icmp,in_port=1,vlan_tci=0x0000,dl_src=52:54:00:d6:b0:82,dl_dst=fa:16:3e:3f:e5:1b,nw_src=10.0.0.1,nw_dst=10.0.0.216,nw_tos=0,nw_ecn=0,nw_ttl=128,icmp_type=8,icmp_code=0
----------------
 0. in_port=1,vlan_tci=0x0000/0x1fff, priority 4, cookie 0x8000000
    write_metadata:0x180000000001/0xffffff0000000001
    goto_table:17
17. metadata=0x180000000000/0xffffff0000000000, priority 10, cookie 0x8000001
    load:0x19e10->NXM_NX_REG3[0..24]
    write_metadata:0x9000180000033c20/0xfffffffffffffffe
    goto_table:19
19. metadata=0x33c20/0xfffffe,dl_dst=fa:16:3e:3f:e5:1b, priority 20, cookie 0x8000009
    write_metadata:0x33c22/0xfffffe
    goto_table:21
21. ip,metadata=0x33c22/0xfffffe,nw_dst=10.0.0.216, priority 42, cookie 0x8000003
    set_field:fa:16:3e:3f:e5:1b->eth_dst
    goto_table:25
25. ip,dl_dst=fa:16:3e:3f:e5:1b,nw_dst=10.0.0.216, priority 10, cookie 0x8000004
    set_field:192.168.99.5->ip_dst
    write_metadata:0x33c26/0xfffffe
    goto_table:27
27. ip,metadata=0x33c26/0xfffffe,nw_dst=192.168.99.5, priority 10, cookie 0x8000004
    resubmit(,21)
21. ip,metadata=0x33c26/0xfffffe,nw_dst=192.168.99.5, priority 42, cookie 0x8000003
    group:155003
    set_field:fa:16:3e:ee:48:f9->eth_src
    set_field:fa:16:3e:8c:0c:5c->eth_dst     ====> The MAC address does not belong to vm5 
    load:0x1d00->NXM_NX_REG6[]
    resubmit(,220)
220. No match.
    drop

When we looked at the config store, we could see an entry in the groupTable with the wrong MAC address.

Issue-2: missing FIB entry seen for vm9
========================================
[heat-admin@compute-1 SampleScripts]$ sudo ovs-appctl ofproto/trace br-int 'in_port=1,dl_src=52:54:00:d6:b0:82,dl_dst=fa:16:3e:ee:f3:07,dl_type=0x0800,nw_src=10.0.0.1,nw_dst=10.0.0.213,nw_proto=1,nw_tos=0,nw_ttl=128,icmp_type=8,icmp_code=0'
 0. in_port=1,vlan_tci=0x0000/0x1fff, priority 4, cookie 0x8000000
    write_metadata:0x180000000001/0xffffff0000000001
    goto_table:17
17. metadata=0x180000000000/0xffffff0000000000, priority 10, cookie 0x8000001
    load:0x19e10->NXM_NX_REG3[0..24]
    write_metadata:0x9000180000033c20/0xfffffffffffffffe
    goto_table:19
19. metadata=0x33c20/0xfffffe,dl_dst=fa:16:3e:ee:f3:07, priority 20, cookie 0x8000009
    write_metadata:0x33c22/0xfffffe
    goto_table:21
21. ip,metadata=0x33c22/0xfffffe,nw_dst=10.0.0.213, priority 42, cookie 0x8000003
    set_field:fa:16:3e:ee:f3:07->eth_dst
    goto_table:25
25. ip,dl_dst=fa:16:3e:ee:f3:07,nw_dst=10.0.0.213, priority 10, cookie 0x8000004
    set_field:192.168.99.9->ip_dst
    write_metadata:0x33c26/0xfffffe
    goto_table:27
27. ip,metadata=0x33c26/0xfffffe,nw_dst=192.168.99.9, priority 10, cookie 0x8000004
    resubmit(,21)
21. ip,metadata=0x33c26/0xfffffe,nw_dst=192.168.99.0/24, priority 34, cookie 0x8000003
    write_metadata:0x157e033c26/0xfffffffffe
    goto_table:22
22. priority 0, cookie 0x8000004
    CONTROLLER:65535

The issue here is a missing FIB entry and the reason why the FIB entry is missing is due to a missing group entry.
Basically Config store indeed has an entry for 192.168.99.9 (FIP: 10.0.0.213) in Table21, but the action in the flow was to send the packet to a specific GroupId which was missing on the Switch (and also in the config store). When there is a request made to the switch to program the flow, OVS-Switch (on Compute-1) seem to reject it as there is no corresponding groupEntry.

Some errors:
============
2018-06-06T08:58:43,799 | ERROR | ForkJoinPool-1-worker-1 | ElanForwardingEntriesHandler     | 347 - org.opendaylight.netvirt.elanmanager-impl - 0.6.0.redhat-10 | Static MAC address PhysAddress [_value=fa:16:3e:8c:0c:5c] has already been added for the same ElanInstance bbd99f4b-6ef8-48e2-9e3c-a580fba3eff2 on the same Logical Interface Port 7f06eb09-66d5-4bd3-97b7-d4a8d8ac7ac6. No operation will be done.

2018-06-06T13:33:09.332Z|00558|rconn|INFO|br-int<->tcp:172.17.1.15:6653: connected
2018-06-06T13:33:09.834Z|00559|connmgr|INFO|br-int<->tcp:172.17.1.19:6653: sending OFPGMFC_GROUP_EXISTS error reply to OFPT_GROUP_MOD message 
2018-06-06T13:33:10.688Z|00560|ofp_actions|WARN|bad action at offset 0 (OFPBMC_BAD_FIELD):
00000000  00 19 00 08 00 00 00 00-ff ff 00 18 00 00 23 20
00000010  00 07 00 1f 00 01 0c 04-00 00 00 00 00 00 01 00
00000020  ff ff 00 10 00 00 23 20-00 0e ff f8 dc 00 00 00
2018-06-06T13:33:10.688Z|00561|connmgr|INFO|br-int<->tcp:172.17.1.19:6653: sending OFPBMC_BAD_FIELD error reply to OFPT_FLOW_MOD message 
2018-06-06T13:33:10.688Z|00562|connmgr|INFO|br-int<->tcp:172.17.1.19:6653: sending OFPBAC_BAD_OUT_GROUP error reply to OFPT_FLOW_MOD message 

Observations/Next Steps:
========================
While searching for any pattern (like IPAddresses, MAC addresses etc), we could see couple of logs where the VM IPaddresses were re-used (i.e., when vm was deleted and a new VM spawned got the same IP), switch losing connections to the controllers, errors with group entry in ovs-switch etc.

It looks like this is a different issue to the one that @Aswin mentioned in Comment#7. @Tomas had to try different combinations (like rebooting Controller nodes, compute nodes, spawning VMs while the controller/shard-leader was down, dis-associate/re-associate FIP, delete VMs, spawn new VMs etc) before the FIPs started to fail. It would be helpful if some easy way is identified to reproduce this issue so that we can enable necessary logging to capture more details.

Comment 17 Tomas Jamrisko 2018-06-19 12:49:44 UTC
Looks like i can reproduce one of the FIP issues by reusing a previously created FIP.

Start a VM, add FIP, kill compute on which the VM is running, delete vm, create a new VM, attach FIP

Comment 18 Ariel Adam 2018-06-19 13:03:59 UTC
Tomas, very nice, finally it can be reproduced :-).

Comment 21 Sridhar Gaddam 2018-06-27 19:01:55 UTC
(In reply to Tomas Jamrisko from comment #17)
> Looks like i can reproduce one of the FIP issues by reusing a previously
> created FIP.
> 
> Start a VM, add FIP, kill compute on which the VM is running, delete vm,
> create a new VM, attach FIP

I had a look at the setup and the reason for the failure is a "Stale Group entry".

Basically when we try to ping FIP from the undercloud, it was seen that packet successfully gets DNAT'ed (0->17->19->21->25->27->21->GroupEntry->Table 220/drop), but from the FIB (21) table when its sent to GroupEntry (which is stale), its updated with wrong values and hence is getting dropped in Table 220.

The stale group entry is as shown below.
 group_id=152503,type=all,bucket=actions=set_field:fa:16:3e:8f:a3:48->eth_src,set_field:fa:16:3e:0e:e2:7e->eth_dst,load:0x1d00->NXM_NX_REG6[],resubmit(,220)

i.e., group:152503
      set_field:fa:16:3e:8f:a3:48->eth_src ----> This is correct and points to the neutron router interface MAC
      set_field:fa:16:3e:0e:e2:7e->eth_dst ----> This is INCORRECT and points to MAC address of a deleted VM.
      load:0x1d00->NXM_NX_REG6[]
      resubmit(,220)

Ideally, the group 152503 should be deleted when the corresponding VM is deleted, but since the ComputeNode hosting the VM was down when nova delete for the VM was invoked, the entry might have remained stale (due to some possible race-condition).

Another observation is that tenant IP-address that is allocated to the VM got recycled and the same ip-address was allocated to a new VM. 
Since the GroupEntry was not deleted and remained as a stale entry on the ComputeNode, FIP use-case is failing.

I had a look at the config datastore and could see that groupEntry in the config datastore was matching with the entry on the ComputeNode (i.e., stale entry).

Further analysis showed two main NullPointerExceptions in Karaf console.

1. NPE at updateVpnInterfacesForUnProcessAdjancencies: Proposed the following patch which will address this exception. https://git.opendaylight.org/gerrit/#/c/73491/

2. NPE with Unable to handle the TEP event: This issue is already fixed/merged in upstream as part of https://jira.opendaylight.org/browse/NETVIRT-1133 few days back.

I tried to reproduce the issue locally 1) by re-using the same ip-address 2) by resetting the ComputeNode hosting the VM, but the issue was not reproduced.

Next Steps:
@Tomas, can you please retry the "same use-case" with an image that includes the fix that I proposed here - https://git.opendaylight.org/gerrit/#/c/73491/ and let us know the results.

Comment 28 Ariel Adam 2018-08-13 10:38:16 UTC
Tomas, could we try and run an automation on this bug?
We'd like to see a long stable run.

Comment 41 errata-xmlrpc 2018-11-13 23:32:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3614


Note You need to log in before you can comment on or make changes to this bug.