Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1600076

Summary: [HA] Missing rules makes external IP unreachable from network namespace
Product: Red Hat OpenStack Reporter: Janki <jchhatba>
Component: opendaylightAssignee: Vishal Thapar <vthapar>
Status: CLOSED WORKSFORME QA Contact: Noam Manos <nmanos>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: aadam, asuryana, jchhatba, mkolesni, nyechiel
Target Milestone: z3Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)Flags: jchhatba: needinfo-
jchhatba: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: HA
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
N/A
Last Closed: 2018-10-09 12:12:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
karaf logs, data dumps, ovs logs, flows, OS CLI outputs none

Description Janki 2018-07-11 11:29:37 UTC
Created attachment 1458044 [details]
karaf logs, data dumps, ovs logs, flows, OS CLI outputs

Description of problem:
we are able to ping to the DC-GW but not to any externalIP (like 8.8.8.8)

Version-Release number of selected component (if applicable):
OSP13+Oxygen

How reproducible:
Sometimes

Steps to Reproduce:
1. Create a network and attach to a router.
2. Attach router to external network
3. From network ns on controller node, ping 8.8.8.8

Actual results:
Ping to Google DNS server fails

Expected results:
Google DNS server should be reachable

Additional info:
This is a new issue where group entry is present but it does not have the right actions. It only has a drop rule and hence the use-case is failing

Comment 1 Mike Kolesnik 2018-07-12 08:38:22 UTC
Janki, can you reproduce it with ping from within a VM to an external IP?

Comment 2 Sridhar Gaddam 2018-07-19 19:23:28 UTC
On careful analysis of the logs and odl-dumps, the issue is as follows.
We are able to ping to the DCGW (10.0.0.1) but are unable to ping to the ExternalIP (like 8.8.8.8)

The flows used to support ping to DCGW [*] are slightly different compared to ExternalIP[#] in the FIB table. 
Incase of DCGW, the use-case is treated as a PNF use-case whereas for the ExternalIP its like defaultSubnetRoute.

[*] table=21, n_packets=30, n_bytes=2909, priority=42,ip,metadata=0x324b2/0xfffffe,nw_dst=10.0.0.1 actions=set_field:52:54:00:15:d5:42->eth_dst,load:0x1800->NXM_NX_REG6[],resubmit(,220)
[#] table=21, n_packets=2024, n_bytes=198352, priority=10,ip,metadata=0x324b2/0xfffffe actions=group:227500
    group_id=227500,type=all,bucket=actions=drop

Now coming back to the actual issue, on Controller-2 when ExternalNetworkGroupInstaller tried to program the group_id=227500 the DCGW MAC was not resolved by then (i.e., 2018-07-10T11:02:03,304 => say t1), so the GroupEntry did not have the action to set the destination macaddress and hence the packets were getting dropped.
Logs: installExtNetGroupEntries : Installing ext-net group 227500 entry for subnet 046eb4f0-86e1-4ab8-b784-aa8ccc9aecb6 with macAddress null 

DCGW MAC was actually resolved on Controller-0 (as it was the southbound leader processing the packetIns) at 2018-07-10T11:02:08,731 (say t2)
Logs: createLearntVpnVipToPort: ARP learned for fixedIp: 10.0.0.1, vpn e85c5d41-c6b9-4445-abe4-e5b7b3bfec9d, interface 36012076628633:br-ex-patch:trunk, mac 52:54:00:15:D5:42, isSubnetIp {} added to VpnPortipToPort DS

Netvirt currently has a listener (i.e., SubnetGwMacChangeListener) which gets notified when the LearntVpnVipToPort model is updated.
This class takes care of updating the externalNetworkGroupEntries accordingly. But its an AsyncDataTreeChangeListenerBase which gets notified only on the shard leader.

It appears like the Shard Leader for the cluster was Controller-1 and based on the logs, it went into some inactive state during timeframe t1 and t2.
Exception: 2018-07-10T11:02:04,081 | ERROR | CommitFutures-6  | TransactionChainManager          | 385 - org.opendaylight.openflowplugin.common - 0.6.0.redhat-9 | Transaction commit failed.
<SNIP>
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ModifiedNodeDoesNotExistException: Node /(urn:opendaylight:inventory?revision=2013-08-19)nodes/node/node[{(urn:opendaylight:inventory?revision=2013-08-19)id=openflow:242085300750378}]/AugmentationIdentifier{childNames=[(urn:opendaylight:flow:inventory?revision=2013-08-19)description, (urn:opendaylight:flow:inventory?revision=2013-08-19)supported-actions, (urn:opendaylight:flow:inventory?revision=2013-08-19)hardware, (urn:opendaylight:flow:inventory?revision=2013-08-19)switch-features, (urn:opendaylight:flow:inventory?revision=2013-08-19)stale-meter, (urn:opendaylight:flow:inventory?revision=2013-08-19)supported-instructions, (urn:opendaylight:flow:inventory?revision=2013-08-19)meter, (urn:opendaylight:flow:inventory?revision=2013-08-19)serial-number, (urn:opendaylight:flow:inventory?revision=2013-08-19)stale-group, (urn:opendaylight:flow:inventory?revision=2013-08-19)supported-match-types, (urn:opendaylight:flow:inventory?revision=2013-08-19)port-number, (urn:opendaylight:flow:inventory?revision=2013-08-19)table, (urn:opendaylight:flow:inventory?revision=2013-08-19)group, (urn:opendaylight:flow:inventory?revision=2013-08-19)manufacturer, (urn:opendaylight:flow:inventory?revision=2013-08-19)table-features, (urn:opendaylight:flow:inventory?revision=2013-08-19)software, (urn:opendaylight:flow:inventory?revision=2013-08-19)ip-address]}/(urn:opendaylight:flow:inventory?revision=2013-08-19)group/group[{(urn:opendaylight:flow:inventory?revision=2013-08-19)group-id=227501}] does not exist. Cannot apply modification to its children.

In the above exception, the nodeID openflow:242085300750378 corresponds to Controller-1. 
I tried to look at the logs on Controller-1, but because of some repeated debug log the attached log file does not have sufficient information.
Though not related to this bug, we have to analyse why the following debug message is repeatedly coming in the console of Controller-1
https://gist.githubusercontent.com/sridhargaddam/5c31ef0d4babe21f2f8ddda2375eaa1b/raw/eb21f0623380f5d7feef11e6f51f708bb2cc90b8/RHBZ%25231600076%2520log

Based on the code walkthrough, I feel that SubnetGwMacChangeListener is properly handling updation of ExternalNetworkGroupEntries if it gets fired (on the shard leader).

Comment 6 Vishal Thapar 2018-09-20 13:19:33 UTC
Janki, as discussed in IRC, can you update this as not reproducible? I'll leave it up to you to close it or just reduce priority.