Bug 1568989
Summary: | [BasicFunctioniality] Some VMs are unpingable through floating IP in OSP+ODL setup | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Sai Sindhur Malleni <smalleni> | |
Component: | puppet-tripleo | Assignee: | Tim Rozet <trozet> | |
Status: | CLOSED ERRATA | QA Contact: | Itzik Brown <itbrown> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 13.0 (Queens) | CC: | aadam, asuryana, itbrown, jchhatba, jjoyce, jluhrsen, jschluet, mkolesni, nyechiel, oblaut, sgaddam, skitt, slinaber, smalleni, tjamrisk, trozet, tvignaud, wznoinsk | |
Target Milestone: | rc | Keywords: | Triaged | |
Target Release: | 13.0 (Queens) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | BasicFunctioniality | |||
Fixed In Version: | puppet-tripleo-8.3.2-7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1588115 1588116 (view as bug list) | Environment: |
N/A
|
|
Last Closed: | 2018-06-27 13:52:00 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Sai Sindhur Malleni
2018-04-18 13:31:05 UTC
I rerean test case trying to boot and ping 50 VMs. One VM remained unpingable. Here is the debug info. The FIP is 172.21.0.145 (overcloud) [stack@c08-h26-r630 ansible]$ openstack server list --all | grep 172.21.0.145 | b9d6f2c8-6c1b-4644-a7df-86280a0fa2ac | s_rally_d355dea0_9Lo9XC0N | ACTIVE | s_rally_d355dea0_gBMBEqAr=10.2.19.6, 172.21.0.145 | cirros | neutron port-list --device-id=b9d6f2c8-6c1b-4644-a7df-86280a0fa2ac +--------------------------------------+------+----------------------------------+-------------------+----------------------------------------------------------------------------------+ | id | name | tenant_id | mac_address | fixed_ips | +--------------------------------------+------+----------------------------------+-------------------+----------------------------------------------------------------------------------+ | bc295e73-a824-4156-b7ed-62a457243814 | | 6fecd2ad58fc4a76bde60f5907b50786 | fa:16:3e:dd:be:e3 | {"subnet_id": "a88a3e91-adc5-4f9c-b9e3-f81ede8ae736", "ip_address": "10.2.19.6"} | +--------------------------------------+------+----------------------------------+-------------------+----------------------------------------------------------------------------------+ Verified that the VM was able to dhcp by checking nova console-log Starting network... udhcpc (v1.20.1) started Sending discover... Sending select for 10.2.19.6... Lease of 10.2.19.6 obtained, lease time 86400 route: SIOCADDRT: File exists WARN: failed: route add -net "0.0.0.0/0" gw "10.2.19.1" cirros-ds 'net' up at 0.86 According to Aswin issue seems to be a missing flow in table 21. Here are the flows on the compute node for the VIm with FIP 172.21.0.145 http://file.rdu.redhat.com/~smalleni/flows Based on the packet count, traffic seems to b reaching the compute with FIP 172.21.0.145, but the reverse traffic is dropped table 21 with out any flows matching it. The table miss flow which sent the traffic to table 26 for FIP translation is missing for the metadata 0x30e26 table=19, n_packets=116, n_bytes=11368, priority=20,metadata=0x30e26/0xfffffe,dl_dst=fa:16:3e:44:af:8e actions=goto_table:21 Tested and can still reproduce it. Worth mentioning that we are testing clustered setups. (In reply to Sai Sindhur Malleni from comment #12) > Tested and can still reproduce it. Worth mentioning that we are testing > clustered setups. This seems to be similar to what our upstream CSIT deals with in those 3node (aka clustered) setups. Instance ip connectivity has sporadic failures. (In reply to Sai Sindhur Malleni from comment #12) > Tested and can still reproduce it. Worth mentioning that we are testing > clustered setups. What's the ratio of VMs hitting this? (In reply to Mike Kolesnik from comment #17) > (In reply to Sai Sindhur Malleni from comment #12) > > Tested and can still reproduce it. Worth mentioning that we are testing > > clustered setups. > > What's the ratio of VMs hitting this? I think this issue is frequent when we create the neutron resources in parallel(concurrency 8 in this case). I'm having an issue where VMs on one compute are pingable but no on the other compute. Is it the same on your setup? (In reply to Itzik Brown from comment #21) > I'm having an issue where VMs on one compute are pingable but no on the > other compute. Is it the same on your setup? It seems like the failure might manifest on specific compute nodes but not on others. If Sai/Janki can check if this is indeed the case I think it will help shed some light on this bug. Hey Mike, Yes, in our setup all the failed VMs are on compute-1. 10 VMs failed ping and all 10 are on compute-1. as another data point, we are also seeing this in CI with our upstream robot suites. I don't think we thought this was a scale/perf issue anymore anyway because the root cause is the missing flow in table-48 which Aswin says should happen at startup/deployment anyway. So, for one network we spin up three instances and two of those failed to get ip's, while another instance did. The compute node with the failed instances has no flow in table=48: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k12-k4 whereas the other compute node does have that flow which resubmits to tables 49 and 50: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/robot/report/log.html#s1-s1-t8-k7-k2-k1-k2-k1-k12-k4 We have some (if not all) of the relevant logs as build artifacts in zipped files per node here: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/artifact/ @Aswin, you asked me this on IRC, but I don't get it now: jamoluhrsen: is the ovs where the flow installation failed and the node which has the flow connected to same controller? The ovs instances are connected to all three controllers. Looks like this: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k9-k4 or maybe you want to know which controller is the master vs the other two that would be slaves? (In reply to jamo luhrsen from comment #31) > as another data point, we are also seeing this in CI with our upstream robot > suites. I don't think we thought this was a scale/perf issue anymore anyway > because the root cause is the missing flow in table-48 which Aswin says > should > happen at startup/deployment anyway. > > So, for one network we spin up three instances and two of those failed to get > ip's, while another instance did. > > The compute node with the failed instances has no flow in table=48: > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k12-k4 > > whereas the other compute node does have that flow which resubmits to tables > 49 and 50: > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > robot/report/log.html#s1-s1-t8-k7-k2-k1-k2-k1-k12-k4 > > > We have some (if not all) of the relevant logs as build artifacts in > zipped files per node here: > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > artifact/ > > > @Aswin, you asked me this on IRC, but I don't get it now: > > jamoluhrsen: is the ovs where the flow installation failed and the node > which has the flow connected to same controller? > > The ovs instances are connected to all three controllers. Looks like this: > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k9-k4 > > or maybe you want to know which controller is the master vs the other two > that > would be slaves? Yes that was my question which controller is the master for the OVS. I think this not the same as the shard leader who does the flow programming. This missing flows are due to the exception in the bug 1573273 . When the exception happens in ElanNodeListener all the flows programmed by the class will be missing which includes table 48 flows. (Other flows include the default miss entry in table 50, 51 etc). This is quite random and can happen in multiple **NodeListener and all the flows installed in these classes will be missing. (In reply to Aswin Suryanarayanan from comment #33) > This missing flows are due to the exception in the bug 1573273 . > > When the exception happens in ElanNodeListener all the flows programmed by > the class will be missing which includes table 48 flows. (Other flows > include the default miss entry in table 50, 51 etc). > > This is quite random and can happen in multiple **NodeListener and all the > flows installed in these classes will be missing. This job [0] also has the symptom of a missing table=48 flow in one compute node. But, I cannot find any "frozen class" errors in the opendaylight logs for that job. I don't know what that means, but maybe it's not totally 100% caused by that specific problem. [0] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27 (In reply to Aswin Suryanarayanan from comment #32) > (In reply to jamo luhrsen from comment #31) > > as another data point, we are also seeing this in CI with our upstream robot > > suites. I don't think we thought this was a scale/perf issue anymore anyway > > because the root cause is the missing flow in table-48 which Aswin says > > should > > happen at startup/deployment anyway. > > > > So, for one network we spin up three instances and two of those failed to get > > ip's, while another instance did. > > > > The compute node with the failed instances has no flow in table=48: > > > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > > robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k12-k4 > > > > whereas the other compute node does have that flow which resubmits to tables > > 49 and 50: > > > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > > robot/report/log.html#s1-s1-t8-k7-k2-k1-k2-k1-k12-k4 > > > > > > We have some (if not all) of the relevant logs as build artifacts in > > zipped files per node here: > > > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > > artifact/ > > > > > > @Aswin, you asked me this on IRC, but I don't get it now: > > > > jamoluhrsen: is the ovs where the flow installation failed and the node > > which has the flow connected to same controller? > > > > The ovs instances are connected to all three controllers. Looks like this: > > > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > > robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k9-k4 > > > > or maybe you want to know which controller is the master vs the other two > > that > > would be slaves? > > Yes that was my question which controller is the master for the OVS. I think > this not the same as the shard leader who does the flow programming. It's easy to get confused with the logs across all the controllers and tracking down the right node, ip, mac, etc etc etc. BUT, I think I can say that the br-int on controller-2 was the SLAVE in this setup when it was finally deployed. [jluhrsen@jamo tmp]$ rg 206921423413162 ./controller-2/var/log/extra/docker/containers/opendaylight_api/stdout.log | rg 'SLAVE|MASTER' 2018-05-01T20:01:59,070 | INFO | nioEventLoopGroup-9-2 | RoleContextImpl | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | Started timer for setting SLAVE role on device openflow:206921423413162 if no role will be set in 20s. 2018-05-01T20:02:19,071 | INFO | pool-86-thread-1 | SalRoleServiceImpl | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | SetRole called with input:SetRoleInput [_controllerRole=BECOMESLAVE, _node=NodeRef [_value=KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.Nodes, org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node[key=NodeKey [_id=Uri [_value=openflow:206921423413162]]]]}], augmentation=[]] 2018-05-01T20:02:19,071 | INFO | pool-86-thread-1 | SalRoleServiceImpl | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | RoleChangeTask called on device:openflow:206921423413162 OFPRole:BECOMESLAVE 2018-05-01T20:02:19,073 | INFO | nioEventLoopGroup-9-2 | RoleService | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | submitRoleChange called for device:Uri [_value=openflow:206921423413162], role:BECOMESLAVE 2018-05-01T20:02:19,074 | INFO | nioEventLoopGroup-9-2 | RoleService | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | submitRoleChange onSuccess for device:Uri [_value=openflow:206921423413162], role:BECOMESLAVE 2018-05-01T20:02:19,074 | INFO | nioEventLoopGroup-9-2 | ContextChainHolderImpl | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | Role SLAVE was granted to device openflow:206921423413162 The mac of br-int is bc:31:a5:f0:5f:aa found here. that translates to 206921423413162 in decimal which is the search I did above, showing it end up as SLAVE. https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k11-k4 (In reply to jamo luhrsen from comment #35) > (In reply to Aswin Suryanarayanan from comment #32) > > (In reply to jamo luhrsen from comment #31) > > > as another data point, we are also seeing this in CI with our upstream robot > > > suites. I don't think we thought this was a scale/perf issue anymore anyway > > > because the root cause is the missing flow in table-48 which Aswin says > > > should > > > happen at startup/deployment anyway. > > > > > > So, for one network we spin up three instances and two of those failed to get > > > ip's, while another instance did. > > > > > > The compute node with the failed instances has no flow in table=48: > > > > > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > > > robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k12-k4 > > > > > > whereas the other compute node does have that flow which resubmits to tables > > > 49 and 50: > > > > > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > > > robot/report/log.html#s1-s1-t8-k7-k2-k1-k2-k1-k12-k4 > > > > > > > > > We have some (if not all) of the relevant logs as build artifacts in > > > zipped files per node here: > > > > > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > > > artifact/ > > > > > > > > > @Aswin, you asked me this on IRC, but I don't get it now: > > > > > > jamoluhrsen: is the ovs where the flow installation failed and the node > > > which has the flow connected to same controller? > > > > > > The ovs instances are connected to all three controllers. Looks like this: > > > > > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > > > robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k9-k4 > > > > > > or maybe you want to know which controller is the master vs the other two > > > that > > > would be slaves? > > > > Yes that was my question which controller is the master for the OVS. I think > > this not the same as the shard leader who does the flow programming. > > > It's easy to get confused with the logs across all the controllers and > tracking > down the right node, ip, mac, etc etc etc. > > BUT, I think I can say that the br-int on controller-2 was the SLAVE in this > setup when it was finally deployed. > > [jluhrsen@jamo tmp]$ rg 206921423413162 > ./controller-2/var/log/extra/docker/containers/opendaylight_api/stdout.log | > rg 'SLAVE|MASTER' > 2018-05-01T20:01:59,070 | INFO | nioEventLoopGroup-9-2 | RoleContextImpl > | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | Started > timer for setting SLAVE role on device openflow:206921423413162 if no role > will be set in 20s. > 2018-05-01T20:02:19,071 | INFO | pool-86-thread-1 | SalRoleServiceImpl > | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | SetRole > called with input:SetRoleInput [_controllerRole=BECOMESLAVE, _node=NodeRef > [_value=KeyedInstanceIdentifier{targetType=interface > org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node, > path=[org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819. > Nodes, > org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes. > Node[key=NodeKey [_id=Uri [_value=openflow:206921423413162]]]]}], > augmentation=[]] > 2018-05-01T20:02:19,071 | INFO | pool-86-thread-1 | SalRoleServiceImpl > | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | > RoleChangeTask called on device:openflow:206921423413162 OFPRole:BECOMESLAVE > 2018-05-01T20:02:19,073 | INFO | nioEventLoopGroup-9-2 | RoleService > | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | > submitRoleChange called for device:Uri [_value=openflow:206921423413162], > role:BECOMESLAVE > 2018-05-01T20:02:19,074 | INFO | nioEventLoopGroup-9-2 | RoleService > | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | > submitRoleChange onSuccess for device:Uri [_value=openflow:206921423413162], > role:BECOMESLAVE > 2018-05-01T20:02:19,074 | INFO | nioEventLoopGroup-9-2 | > ContextChainHolderImpl | 385 - > org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | Role SLAVE was > granted to device openflow:206921423413162 > > > The mac of br-int is bc:31:a5:f0:5f:aa found here. that translates to > 206921423413162 > in decimal which is the search I did above, showing it end up as SLAVE. > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight- > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/ > robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k11-k4 There seems to be an issue other than frozen class that is leading to missing flows. All the instances which I observed so far it seems to happen in the node where the ovs master is not the same as the shard leader. Also flows programmed by all the node listeners is not missing. It is missing for some of the listeners. This seems to be some timing issue. We can use this bug to track the missing flows which is not caused by frozen class. Aswin, any update on this issue? (In reply to Mike Kolesnik from comment #41) > Aswin, any update on this issue? I have patch [1] d/s and currently testing it with CI to see if the issue is fixed , if not we have to explore further to find the root cause. [1]https://code.engineering.redhat.com/gerrit/#/c/138935/ *** Bug 1573224 has been marked as a duplicate of this bug. *** A workaround: Rebooting the compute node works. Can someone please try setting the following parameter in your deployment and let us know if you are able to reproduce the issue? OpenDaylightCheckURL: diagstatus (In reply to Tim Rozet from comment #46) > Can someone please try setting the following parameter in your deployment > and let us know if you are able to reproduce the issue? > OpenDaylightCheckURL: diagstatus I have a patch to do this for one of our jobs: https://code.engineering.redhat.com/gerrit/#/c/140309/ I gave it a -1 (see comments) it's running in my private job in the staging jenkins. I don't know if it will work or not. I've never tried a 3node job on my baremetal yet. https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/opendaylight/view/odl-netvirt/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit-jamo-poc/ I'll monitor and report back here. (In reply to jamo luhrsen from comment #47) > (In reply to Tim Rozet from comment #46) > > Can someone please try setting the following parameter in your deployment > > and let us know if you are able to reproduce the issue? > > OpenDaylightCheckURL: diagstatus > > I have a patch to do this for one of our jobs: > https://code.engineering.redhat.com/gerrit/#/c/140309/ > > I gave it a -1 (see comments) > > it's running in my private job in the staging jenkins. I don't > know if it will work or not. I've never tried a 3node job > on my baremetal yet. > > > https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/ > view/opendaylight/view/odl-netvirt/job/DFG-opendaylight-odl-netvirt- > 13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit-jamo-poc/ > > I'll monitor and report back here. the /diagstatus endpoint is not available in our d/s distro, but we have a related endpoint that we can use: jolokia/exec/org.opendaylight.infrautils.diagstatus:type=SvcStatus/acquireServiceStatus I was able to reproduce the problem where table=48 is not present on one of the computes when using this other CheckUrl. below is the full flow table of that compute. [heat-admin@compute-0 ~]$ sudo ovs-ofctl dump-flows br-int -OOpenFlow13 cookie=0x8000001, duration=1907.386s, table=0, n_packets=600, n_bytes=56790, priority=5,in_port=tuna1eead404eb actions=write_metadata:0x50000000001/0xfffff0000000001,goto_table:36 cookie=0x8000001, duration=1907.386s, table=0, n_packets=599, n_bytes=56724, priority=5,in_port=tunb1dbca2df9a actions=write_metadata:0x40000000001/0xfffff0000000001,goto_table:36 cookie=0x8000001, duration=1902.056s, table=0, n_packets=599, n_bytes=57488, priority=5,in_port=tun486a7e7734e actions=write_metadata:0x90000000001/0xfffff0000000001,goto_table:36 cookie=0x8000001, duration=1894.153s, table=0, n_packets=596, n_bytes=56436, priority=5,in_port=tun7f59979156c actions=write_metadata:0xe0000000001/0xfffff0000000001,goto_table:36 cookie=0x8220015, duration=1906.986s, table=19, n_packets=0, n_bytes=0, priority=100,arp,arp_op=1 actions=resubmit(,17) cookie=0x8220016, duration=1906.986s, table=19, n_packets=0, n_bytes=0, priority=100,arp,arp_op=2 actions=resubmit(,17) cookie=0x1080000, duration=1906.986s, table=19, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,17) cookie=0x1030000, duration=1906.986s, table=20, n_packets=0, n_bytes=0, priority=0 actions=goto_table:80 cookie=0x8000004, duration=1906.986s, table=22, n_packets=0, n_bytes=0, priority=0 actions=CONTROLLER:65535 cookie=0x1080000, duration=1906.986s, table=23, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,17) cookie=0x822002e, duration=1907.434s, table=43, n_packets=0, n_bytes=0, priority=100,arp,arp_op=2 actions=CONTROLLER:65535,resubmit(,48) cookie=0x1030000, duration=1906.984s, table=80, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,17) cookie=0x6900000, duration=1907.434s, table=210, n_packets=0, n_bytes=0, priority=63009,arp actions=drop cookie=0x6900000, duration=1907.434s, table=210, n_packets=0, n_bytes=0, priority=61009,ipv6 actions=drop cookie=0x6900000, duration=1907.434s, table=210, n_packets=0, n_bytes=0, priority=61009,ip actions=drop cookie=0x6900000, duration=1907.434s, table=210, n_packets=0, n_bytes=0, priority=0 actions=drop cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,tcp actions=write_metadata:0/0x2,goto_table:212 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,udp actions=write_metadata:0/0x2,goto_table:212 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,icmp actions=write_metadata:0/0x2,goto_table:212 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,icmp6 actions=write_metadata:0/0x2,goto_table:212 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,tcp6 actions=write_metadata:0/0x2,goto_table:212 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,udp6 actions=write_metadata:0/0x2,goto_table:212 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=0 actions=write_metadata:0x2/0x2,goto_table:214 cookie=0x6900000, duration=1907.434s, table=212, n_packets=0, n_bytes=0, priority=0 actions=drop cookie=0x6900000, duration=1907.434s, table=213, n_packets=0, n_bytes=0, priority=0 actions=goto_table:214 cookie=0x6900000, duration=1907.434s, table=214, n_packets=0, n_bytes=0, priority=62030,ct_state=-new-est+rel-inv+trk,ct_mark=0x1/0x1 actions=ct_clear,resubmit(,17) cookie=0x6900000, duration=1907.434s, table=214, n_packets=0, n_bytes=0, priority=62030,ct_state=-new+est-rel-inv+trk,ct_mark=0x1/0x1 actions=ct_clear,resubmit(,17) cookie=0x6900000, duration=1907.434s, table=214, n_packets=0, n_bytes=0, priority=62030,ct_state=-trk actions=ct_clear,resubmit(,242) cookie=0x6900000, duration=1907.434s, table=214, n_packets=0, n_bytes=0, priority=0 actions=drop cookie=0x6900000, duration=1907.434s, table=215, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,214) cookie=0x6900000, duration=1907.434s, table=216, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,214) cookie=0x6900000, duration=1907.434s, table=217, n_packets=0, n_bytes=0, priority=0 actions=drop cookie=0x8000007, duration=1906.922s, table=220, n_packets=0, n_bytes=0, priority=9,reg6=0x500 actions=output:tuna1eead404eb cookie=0x8000007, duration=1906.765s, table=220, n_packets=0, n_bytes=0, priority=9,reg6=0x400 actions=output:tunb1dbca2df9a cookie=0x8000007, duration=1901.824s, table=220, n_packets=0, n_bytes=0, priority=9,reg6=0x900 actions=output:tun486a7e7734e cookie=0x8000007, duration=1893.413s, table=220, n_packets=0, n_bytes=0, priority=9,reg6=0xe00 actions=output:tun7f59979156c cookie=0x6900000, duration=1907.434s, table=239, n_packets=0, n_bytes=0, priority=100,ip actions=ct_clear,goto_table:240 cookie=0x6900000, duration=1907.434s, table=239, n_packets=0, n_bytes=0, priority=100,ipv6 actions=ct_clear,goto_table:240 cookie=0x6900000, duration=1907.434s, table=239, n_packets=0, n_bytes=0, priority=0 actions=goto_table:240 cookie=0x6900000, duration=1907.434s, table=240, n_packets=0, n_bytes=0, priority=61010,ip,dl_dst=ff:ff:ff:ff:ff:ff,nw_dst=255.255.255.255 actions=goto_table:241 cookie=0x6900000, duration=1907.434s, table=240, n_packets=0, n_bytes=0, priority=61005,dl_dst=ff:ff:ff:ff:ff:ff actions=resubmit(,220) cookie=0x6900000, duration=1907.434s, table=240, n_packets=0, n_bytes=0, priority=0 actions=drop cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,icmp actions=write_metadata:0/0x2,goto_table:242 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,icmp6 actions=write_metadata:0/0x2,goto_table:242 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,tcp6 actions=write_metadata:0/0x2,goto_table:242 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,udp6 actions=write_metadata:0/0x2,goto_table:242 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,tcp actions=write_metadata:0/0x2,goto_table:242 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,udp actions=write_metadata:0/0x2,goto_table:242 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=0 actions=write_metadata:0x2/0x2,goto_table:244 cookie=0x6900000, duration=1907.434s, table=242, n_packets=0, n_bytes=0, priority=0 actions=drop cookie=0x6900000, duration=1907.434s, table=243, n_packets=0, n_bytes=0, priority=0 actions=goto_table:244 cookie=0x6900000, duration=1907.434s, table=244, n_packets=0, n_bytes=0, priority=62030,ct_state=-trk actions=ct_clear,resubmit(,242) cookie=0x6900000, duration=1907.434s, table=244, n_packets=0, n_bytes=0, priority=62030,ct_state=-new-est+rel-inv+trk,ct_mark=0x1/0x1 actions=ct_clear,resubmit(,220) cookie=0x6900000, duration=1907.434s, table=244, n_packets=0, n_bytes=0, priority=62030,ct_state=-new+est-rel-inv+trk,ct_mark=0x1/0x1 actions=ct_clear,resubmit(,220) cookie=0x6900000, duration=1907.434s, table=244, n_packets=0, n_bytes=0, priority=0 actions=drop cookie=0x6900000, duration=1907.434s, table=245, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,244) cookie=0x6900000, duration=1907.434s, table=246, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,244) cookie=0x6900000, duration=1907.434s, table=247, n_packets=0, n_bytes=0, priority=0 actions=drop [heat-admin@compute-0 ~]$ If you disconnect and reconnect the openflow manager on an affected ovs node, the flows will be reprogrammed and should rectify this issue. This is only a workaround and the root cause should still be found and fixed. steps: (NOTE: replace the ip addresses as needed) ovs-vsctl del-controller br-int ovs-vsctl set-controller br-int tcp:172.17.1.16:6653 tcp:172.17.1.20:6653 tcp:172.17.1.24:6653 This was observed in a non-cluster environment as well. Here table-19 was missing ,but other tables where present. The flows were present in the config DS. When the controller was set again the flows were programmed and FIP started working. The plan is to insert a workaround into puppet-tripleo to resync the OVS openflow table with ODL when tables are missing. Patch posted upstream which works locally for me. Need to test in a deployment. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086 |