Bug 1600076
| Summary: | [HA] Missing rules makes external IP unreachable from network namespace | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Janki <jchhatba> | ||||
| Component: | opendaylight | Assignee: | Vishal Thapar <vthapar> | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | Noam Manos <nmanos> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 13.0 (Queens) | CC: | aadam, asuryana, jchhatba, mkolesni, nyechiel | ||||
| Target Milestone: | z3 | Keywords: | Triaged, ZStream | ||||
| Target Release: | 13.0 (Queens) | Flags: | jchhatba:
needinfo-
jchhatba: needinfo- |
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | HA | ||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: |
N/A
|
|||||
| Last Closed: | 2018-10-09 12:12:49 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Janki, can you reproduce it with ping from within a VM to an external IP? On careful analysis of the logs and odl-dumps, the issue is as follows.
We are able to ping to the DCGW (10.0.0.1) but are unable to ping to the ExternalIP (like 8.8.8.8)
The flows used to support ping to DCGW [*] are slightly different compared to ExternalIP[#] in the FIB table.
Incase of DCGW, the use-case is treated as a PNF use-case whereas for the ExternalIP its like defaultSubnetRoute.
[*] table=21, n_packets=30, n_bytes=2909, priority=42,ip,metadata=0x324b2/0xfffffe,nw_dst=10.0.0.1 actions=set_field:52:54:00:15:d5:42->eth_dst,load:0x1800->NXM_NX_REG6[],resubmit(,220)
[#] table=21, n_packets=2024, n_bytes=198352, priority=10,ip,metadata=0x324b2/0xfffffe actions=group:227500
group_id=227500,type=all,bucket=actions=drop
Now coming back to the actual issue, on Controller-2 when ExternalNetworkGroupInstaller tried to program the group_id=227500 the DCGW MAC was not resolved by then (i.e., 2018-07-10T11:02:03,304 => say t1), so the GroupEntry did not have the action to set the destination macaddress and hence the packets were getting dropped.
Logs: installExtNetGroupEntries : Installing ext-net group 227500 entry for subnet 046eb4f0-86e1-4ab8-b784-aa8ccc9aecb6 with macAddress null
DCGW MAC was actually resolved on Controller-0 (as it was the southbound leader processing the packetIns) at 2018-07-10T11:02:08,731 (say t2)
Logs: createLearntVpnVipToPort: ARP learned for fixedIp: 10.0.0.1, vpn e85c5d41-c6b9-4445-abe4-e5b7b3bfec9d, interface 36012076628633:br-ex-patch:trunk, mac 52:54:00:15:D5:42, isSubnetIp {} added to VpnPortipToPort DS
Netvirt currently has a listener (i.e., SubnetGwMacChangeListener) which gets notified when the LearntVpnVipToPort model is updated.
This class takes care of updating the externalNetworkGroupEntries accordingly. But its an AsyncDataTreeChangeListenerBase which gets notified only on the shard leader.
It appears like the Shard Leader for the cluster was Controller-1 and based on the logs, it went into some inactive state during timeframe t1 and t2.
Exception: 2018-07-10T11:02:04,081 | ERROR | CommitFutures-6 | TransactionChainManager | 385 - org.opendaylight.openflowplugin.common - 0.6.0.redhat-9 | Transaction commit failed.
<SNIP>
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ModifiedNodeDoesNotExistException: Node /(urn:opendaylight:inventory?revision=2013-08-19)nodes/node/node[{(urn:opendaylight:inventory?revision=2013-08-19)id=openflow:242085300750378}]/AugmentationIdentifier{childNames=[(urn:opendaylight:flow:inventory?revision=2013-08-19)description, (urn:opendaylight:flow:inventory?revision=2013-08-19)supported-actions, (urn:opendaylight:flow:inventory?revision=2013-08-19)hardware, (urn:opendaylight:flow:inventory?revision=2013-08-19)switch-features, (urn:opendaylight:flow:inventory?revision=2013-08-19)stale-meter, (urn:opendaylight:flow:inventory?revision=2013-08-19)supported-instructions, (urn:opendaylight:flow:inventory?revision=2013-08-19)meter, (urn:opendaylight:flow:inventory?revision=2013-08-19)serial-number, (urn:opendaylight:flow:inventory?revision=2013-08-19)stale-group, (urn:opendaylight:flow:inventory?revision=2013-08-19)supported-match-types, (urn:opendaylight:flow:inventory?revision=2013-08-19)port-number, (urn:opendaylight:flow:inventory?revision=2013-08-19)table, (urn:opendaylight:flow:inventory?revision=2013-08-19)group, (urn:opendaylight:flow:inventory?revision=2013-08-19)manufacturer, (urn:opendaylight:flow:inventory?revision=2013-08-19)table-features, (urn:opendaylight:flow:inventory?revision=2013-08-19)software, (urn:opendaylight:flow:inventory?revision=2013-08-19)ip-address]}/(urn:opendaylight:flow:inventory?revision=2013-08-19)group/group[{(urn:opendaylight:flow:inventory?revision=2013-08-19)group-id=227501}] does not exist. Cannot apply modification to its children.
In the above exception, the nodeID openflow:242085300750378 corresponds to Controller-1.
I tried to look at the logs on Controller-1, but because of some repeated debug log the attached log file does not have sufficient information.
Though not related to this bug, we have to analyse why the following debug message is repeatedly coming in the console of Controller-1
https://gist.githubusercontent.com/sridhargaddam/5c31ef0d4babe21f2f8ddda2375eaa1b/raw/eb21f0623380f5d7feef11e6f51f708bb2cc90b8/RHBZ%25231600076%2520log
Based on the code walkthrough, I feel that SubnetGwMacChangeListener is properly handling updation of ExternalNetworkGroupEntries if it gets fired (on the shard leader).
Janki, as discussed in IRC, can you update this as not reproducible? I'll leave it up to you to close it or just reduce priority. |
Created attachment 1458044 [details] karaf logs, data dumps, ovs logs, flows, OS CLI outputs Description of problem: we are able to ping to the DC-GW but not to any externalIP (like 8.8.8.8) Version-Release number of selected component (if applicable): OSP13+Oxygen How reproducible: Sometimes Steps to Reproduce: 1. Create a network and attach to a router. 2. Attach router to external network 3. From network ns on controller node, ping 8.8.8.8 Actual results: Ping to Google DNS server fails Expected results: Google DNS server should be reachable Additional info: This is a new issue where group entry is present but it does not have the right actions. It only has a drop rule and hence the use-case is failing