1568989 – [BasicFunctioniality] Some VMs are unpingable through floating IP in OSP+ODL setup

Bug 1568989 - [BasicFunctioniality] Some VMs are unpingable through floating IP in OSP+ODL setup

Summary: [BasicFunctioniality] Some VMs are unpingable through floating IP in OSP+ODL ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	13.0 (Queens)
Assignee:	Tim Rozet
QA Contact:	Itzik Brown
Docs Contact:
URL:
Whiteboard:	BasicFunctioniality
Duplicates (1):	1573224 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-18 13:31 UTC by Sai Sindhur Malleni
Modified:	2018-10-18 07:21 UTC (History)
CC List:	18 users (show)
Fixed In Version:	puppet-tripleo-8.3.2-7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1588115 1588116 (view as bug list)
Environment:	N/A
Last Closed:	2018-06-27 13:52:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1775436	None	None	None	2018-06-06 17:04:10 UTC
OpenStack gerrit	572804	None	master: MERGED	puppet-tripleo: Adds check and resyncs ODL/OVS OF pipeline (I28d13a26198268cfd1f3e9e64236605f24319a04)	2018-06-08 15:53:20 UTC
OpenStack gerrit	573226	None	stable/queens: NEW	puppet-tripleo: Adds check and resyncs ODL/OVS OF pipeline (I28d13a26198268cfd1f3e9e64236605f24319a04)	2018-06-08 15:53:10 UTC
Red Hat Product Errata	RHEA-2018:2086	None	None	None	2018-06-27 13:52:54 UTC

Description Sai Sindhur Malleni 2018-04-18 13:31:05 UTC

Description of problem:
When running a Browbeat+Rally use case that does
Create a network
Create a sbunet
Create a router
Attach router to subnet and public network
Boot VM with floating IP
Ping VM

with concurrency 8 and times set to 50 we see that some VMs remain unpingable even after 300 seconds

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-5.el7ost.noarch
OSP 13


How reproducible:
About 10-15% of VMS spawned are always unpingable

Steps to Reproduce:
1. Install OSP with ODL
2. Run scale use case as above
3.

Actual results:
Some VMs are unpingable until the test stops trying after a 300s timeout

Expected results:
All VMS should be pingable

Additional info:

Comment 4 Sai Sindhur Malleni 2018-04-19 20:07:02 UTC

I rerean test case trying to boot and ping 50 VMs. One VM remained unpingable. Here is the debug info. 

The FIP is 172.21.0.145
(overcloud) [stack@c08-h26-r630 ansible]$ openstack server list --all | grep 172.21.0.145
| b9d6f2c8-6c1b-4644-a7df-86280a0fa2ac | s_rally_d355dea0_9Lo9XC0N | ACTIVE | s_rally_d355dea0_gBMBEqAr=10.2.19.6, 172.21.0.145  | cirros     |


neutron port-list --device-id=b9d6f2c8-6c1b-4644-a7df-86280a0fa2ac
+--------------------------------------+------+----------------------------------+-------------------+----------------------------------------------------------------------------------+
| id                                   | name | tenant_id                        | mac_address       | fixed_ips                                                                        |
+--------------------------------------+------+----------------------------------+-------------------+----------------------------------------------------------------------------------+
| bc295e73-a824-4156-b7ed-62a457243814 |      | 6fecd2ad58fc4a76bde60f5907b50786 | fa:16:3e:dd:be:e3 | {"subnet_id": "a88a3e91-adc5-4f9c-b9e3-f81ede8ae736", "ip_address": "10.2.19.6"} |
+--------------------------------------+------+----------------------------------+-------------------+----------------------------------------------------------------------------------+


Verified that the VM was able to dhcp by checking nova console-log
Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending select for 10.2.19.6...
Lease of 10.2.19.6 obtained, lease time 86400
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.2.19.1"
cirros-ds 'net' up at 0.86


According to Aswin issue seems to be a missing flow in table 21.

Comment 5 Sai Sindhur Malleni 2018-04-19 20:18:01 UTC

Here are the flows on the compute node for the VIm with FIP 172.21.0.145
http://file.rdu.redhat.com/~smalleni/flows

Comment 6 Aswin Suryanarayanan 2018-04-20 12:42:50 UTC

Based on the packet count, traffic seems to b  reaching the compute with FIP 172.21.0.145, but the reverse traffic is dropped table 21 with out any flows matching it. The table miss flow which sent the traffic to table 26 for FIP translation is missing for the metadata 0x30e26

table=19, n_packets=116, n_bytes=11368, priority=20,metadata=0x30e26/0xfffffe,dl_dst=fa:16:3e:44:af:8e actions=goto_table:21

Comment 12 Sai Sindhur Malleni 2018-04-30 21:03:43 UTC

Tested and can still reproduce it. Worth mentioning that we are testing clustered setups.

Comment 13 jamo luhrsen 2018-04-30 21:07:25 UTC

(In reply to Sai Sindhur Malleni from comment #12)
> Tested and can still reproduce it. Worth mentioning that we are testing
> clustered setups.

This seems to be similar to what our upstream CSIT deals with in those
3node (aka clustered) setups. Instance ip connectivity has sporadic
failures.

Comment 15 Sai Sindhur Malleni 2018-04-30 22:24:30 UTC

https://gist.github.com/smalleni/c2a44ed527d7e2dccd818066efe9808a

Comment 17 Mike Kolesnik 2018-05-01 04:48:53 UTC

(In reply to Sai Sindhur Malleni from comment #12)
> Tested and can still reproduce it. Worth mentioning that we are testing
> clustered setups.

What's the ratio of VMs hitting this?

Comment 20 Aswin Suryanarayanan 2018-05-01 07:19:05 UTC

(In reply to Mike Kolesnik from comment #17)
> (In reply to Sai Sindhur Malleni from comment #12)
> > Tested and can still reproduce it. Worth mentioning that we are testing
> > clustered setups.
> 
> What's the ratio of VMs hitting this?

I think this issue is  frequent when we create the neutron resources  in parallel(concurrency 8 in this case).

Comment 21 Itzik Brown 2018-05-01 11:51:38 UTC

I'm having an issue where VMs on one compute are pingable but no on the other compute. Is it the same on your setup?

Comment 23 Mike Kolesnik 2018-05-01 12:05:11 UTC

(In reply to Itzik Brown from comment #21)
> I'm having an issue where VMs on one compute are pingable but no on the
> other compute. Is it the same on your setup?

It seems like the failure might manifest on specific compute nodes but not on others.
If Sai/Janki can check if this is indeed the case I think it will help shed some light on this bug.

Comment 24 Sai Sindhur Malleni 2018-05-01 16:51:15 UTC

Hey Mike,

Yes, in our setup all the failed VMs are on compute-1. 10 VMs failed ping and all 10 are on compute-1.

Comment 31 jamo luhrsen 2018-05-03 00:11:36 UTC

as another data point, we are also seeing this in CI with our upstream robot suites. I don't think we thought this was a scale/perf issue anymore anyway
because the root cause is the missing flow in table-48 which Aswin says should
happen at startup/deployment anyway.

So, for one network we spin up three instances and two of those failed to get
ip's, while another instance did.

The compute node with the failed instances has no flow in table=48:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k12-k4

whereas the other compute node does have that flow which resubmits to tables
49 and 50:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/robot/report/log.html#s1-s1-t8-k7-k2-k1-k2-k1-k12-k4


We have some (if not all) of the relevant logs as build artifacts in
zipped files per node here:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/artifact/


@Aswin, you asked me this on IRC, but I don't get it now:

    jamoluhrsen: is the ovs where the flow installation failed and the node which has the flow connected to same controller?

The ovs instances are connected to all three controllers. Looks like this:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k9-k4

or maybe you want to know which controller is the master vs the other two that
would be slaves?

Comment 32 Aswin Suryanarayanan 2018-05-03 11:59:08 UTC

(In reply to jamo luhrsen from comment #31)
> as another data point, we are also seeing this in CI with our upstream robot
> suites. I don't think we thought this was a scale/perf issue anymore anyway
> because the root cause is the missing flow in table-48 which Aswin says
> should
> happen at startup/deployment anyway.
> 
> So, for one network we spin up three instances and two of those failed to get
> ip's, while another instance did.
> 
> The compute node with the failed instances has no flow in table=48:
> 
> https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k12-k4
> 
> whereas the other compute node does have that flow which resubmits to tables
> 49 and 50:
> 
> https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> robot/report/log.html#s1-s1-t8-k7-k2-k1-k2-k1-k12-k4
> 
> 
> We have some (if not all) of the relevant logs as build artifacts in
> zipped files per node here:
> 
> https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> artifact/
> 
> 
> @Aswin, you asked me this on IRC, but I don't get it now:
> 
>     jamoluhrsen: is the ovs where the flow installation failed and the node
> which has the flow connected to same controller?
> 
> The ovs instances are connected to all three controllers. Looks like this:
> 
> https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k9-k4
> 
> or maybe you want to know which controller is the master vs the other two
> that
> would be slaves?

Yes that was my question which controller is the master for the OVS. I think this not the same as the shard leader who does the flow programming.

Comment 33 Aswin Suryanarayanan 2018-05-03 12:03:39 UTC

This missing flows are due to the exception in the bug  1573273 . 

When  the exception happens in ElanNodeListener all the flows programmed by the class will be missing which includes table 48 flows. (Other flows include the default miss entry in table 50, 51 etc). 

This is quite random and can happen in multiple **NodeListener and all the flows installed in these classes will be missing.

Comment 34 jamo luhrsen 2018-05-03 21:27:11 UTC

(In reply to Aswin Suryanarayanan from comment #33)
> This missing flows are due to the exception in the bug  1573273 . 
> 
> When  the exception happens in ElanNodeListener all the flows programmed by
> the class will be missing which includes table 48 flows. (Other flows
> include the default miss entry in table 50, 51 etc). 
> 
> This is quite random and can happen in multiple **NodeListener and all the
> flows installed in these classes will be missing.

This job [0] also has the symptom of a missing table=48 flow in one compute
node. But, I cannot find any "frozen class" errors in the opendaylight
logs for that job. I don't know what that means, but maybe it's not totally
100% caused by that specific problem.

[0] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27

Comment 35 jamo luhrsen 2018-05-05 23:43:42 UTC

(In reply to Aswin Suryanarayanan from comment #32)
> (In reply to jamo luhrsen from comment #31)
> > as another data point, we are also seeing this in CI with our upstream robot
> > suites. I don't think we thought this was a scale/perf issue anymore anyway
> > because the root cause is the missing flow in table-48 which Aswin says
> > should
> > happen at startup/deployment anyway.
> > 
> > So, for one network we spin up three instances and two of those failed to get
> > ip's, while another instance did.
> > 
> > The compute node with the failed instances has no flow in table=48:
> > 
> > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> > robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k12-k4
> > 
> > whereas the other compute node does have that flow which resubmits to tables
> > 49 and 50:
> > 
> > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> > robot/report/log.html#s1-s1-t8-k7-k2-k1-k2-k1-k12-k4
> > 
> > 
> > We have some (if not all) of the relevant logs as build artifacts in
> > zipped files per node here:
> > 
> > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> > artifact/
> > 
> > 
> > @Aswin, you asked me this on IRC, but I don't get it now:
> > 
> >     jamoluhrsen: is the ovs where the flow installation failed and the node
> > which has the flow connected to same controller?
> > 
> > The ovs instances are connected to all three controllers. Looks like this:
> > 
> > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> > robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k9-k4
> > 
> > or maybe you want to know which controller is the master vs the other two
> > that
> > would be slaves?
> 
> Yes that was my question which controller is the master for the OVS. I think
> this not the same as the shard leader who does the flow programming.


It's easy to get confused with the logs across all the controllers and tracking
down the right node, ip, mac, etc etc etc.

BUT, I think I can say that the br-int on controller-2 was the SLAVE in this
setup when it was finally deployed.

[jluhrsen@jamo tmp]$ rg 206921423413162 ./controller-2/var/log/extra/docker/containers/opendaylight_api/stdout.log | rg 'SLAVE|MASTER'
2018-05-01T20:01:59,070 | INFO  | nioEventLoopGroup-9-2 | RoleContextImpl                  | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | Started timer for setting SLAVE role on device openflow:206921423413162 if no role will be set in 20s.
2018-05-01T20:02:19,071 | INFO  | pool-86-thread-1 | SalRoleServiceImpl               | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | SetRole called with input:SetRoleInput [_controllerRole=BECOMESLAVE, _node=NodeRef [_value=KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.Nodes, org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node[key=NodeKey [_id=Uri [_value=openflow:206921423413162]]]]}], augmentation=[]]
2018-05-01T20:02:19,071 | INFO  | pool-86-thread-1 | SalRoleServiceImpl               | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | RoleChangeTask called on device:openflow:206921423413162 OFPRole:BECOMESLAVE
2018-05-01T20:02:19,073 | INFO  | nioEventLoopGroup-9-2 | RoleService                      | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | submitRoleChange called for device:Uri [_value=openflow:206921423413162], role:BECOMESLAVE
2018-05-01T20:02:19,074 | INFO  | nioEventLoopGroup-9-2 | RoleService                      | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | submitRoleChange onSuccess for device:Uri [_value=openflow:206921423413162], role:BECOMESLAVE
2018-05-01T20:02:19,074 | INFO  | nioEventLoopGroup-9-2 | ContextChainHolderImpl           | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | Role SLAVE was granted to device openflow:206921423413162


The mac of br-int is bc:31:a5:f0:5f:aa found here. that translates to 206921423413162
in decimal which is the search I did above, showing it end up as SLAVE.

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k11-k4

Comment 36 Aswin Suryanarayanan 2018-05-07 18:21:18 UTC

(In reply to jamo luhrsen from comment #35)
> (In reply to Aswin Suryanarayanan from comment #32)
> > (In reply to jamo luhrsen from comment #31)
> > > as another data point, we are also seeing this in CI with our upstream robot
> > > suites. I don't think we thought this was a scale/perf issue anymore anyway
> > > because the root cause is the missing flow in table-48 which Aswin says
> > > should
> > > happen at startup/deployment anyway.
> > > 
> > > So, for one network we spin up three instances and two of those failed to get
> > > ip's, while another instance did.
> > > 
> > > The compute node with the failed instances has no flow in table=48:
> > > 
> > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> > > robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k12-k4
> > > 
> > > whereas the other compute node does have that flow which resubmits to tables
> > > 49 and 50:
> > > 
> > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> > > robot/report/log.html#s1-s1-t8-k7-k2-k1-k2-k1-k12-k4
> > > 
> > > 
> > > We have some (if not all) of the relevant logs as build artifacts in
> > > zipped files per node here:
> > > 
> > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> > > artifact/
> > > 
> > > 
> > > @Aswin, you asked me this on IRC, but I don't get it now:
> > > 
> > >     jamoluhrsen: is the ovs where the flow installation failed and the node
> > > which has the flow connected to same controller?
> > > 
> > > The ovs instances are connected to all three controllers. Looks like this:
> > > 
> > > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> > > odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> > > robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k9-k4
> > > 
> > > or maybe you want to know which controller is the master vs the other two
> > > that
> > > would be slaves?
> > 
> > Yes that was my question which controller is the master for the OVS. I think
> > this not the same as the shard leader who does the flow programming.
> 
> 
> It's easy to get confused with the logs across all the controllers and
> tracking
> down the right node, ip, mac, etc etc etc.
> 
> BUT, I think I can say that the br-int on controller-2 was the SLAVE in this
> setup when it was finally deployed.
> 
> [jluhrsen@jamo tmp]$ rg 206921423413162
> ./controller-2/var/log/extra/docker/containers/opendaylight_api/stdout.log |
> rg 'SLAVE|MASTER'
> 2018-05-01T20:01:59,070 | INFO  | nioEventLoopGroup-9-2 | RoleContextImpl   
> | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | Started
> timer for setting SLAVE role on device openflow:206921423413162 if no role
> will be set in 20s.
> 2018-05-01T20:02:19,071 | INFO  | pool-86-thread-1 | SalRoleServiceImpl     
> | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | SetRole
> called with input:SetRoleInput [_controllerRole=BECOMESLAVE, _node=NodeRef
> [_value=KeyedInstanceIdentifier{targetType=interface
> org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node,
> path=[org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.
> Nodes,
> org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.
> Node[key=NodeKey [_id=Uri [_value=openflow:206921423413162]]]]}],
> augmentation=[]]
> 2018-05-01T20:02:19,071 | INFO  | pool-86-thread-1 | SalRoleServiceImpl     
> | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 |
> RoleChangeTask called on device:openflow:206921423413162 OFPRole:BECOMESLAVE
> 2018-05-01T20:02:19,073 | INFO  | nioEventLoopGroup-9-2 | RoleService       
> | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 |
> submitRoleChange called for device:Uri [_value=openflow:206921423413162],
> role:BECOMESLAVE
> 2018-05-01T20:02:19,074 | INFO  | nioEventLoopGroup-9-2 | RoleService       
> | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 |
> submitRoleChange onSuccess for device:Uri [_value=openflow:206921423413162],
> role:BECOMESLAVE
> 2018-05-01T20:02:19,074 | INFO  | nioEventLoopGroup-9-2 |
> ContextChainHolderImpl           | 385 -
> org.opendaylight.openflowplugin.impl - 0.6.0.redhat-9 | Role SLAVE was
> granted to device openflow:206921423413162
> 
> 
> The mac of br-int is bc:31:a5:f0:5f:aa found here. that translates to
> 206921423413162
> in decimal which is the search I did above, showing it end up as SLAVE.
> 
> https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-
> odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/27/
> robot/report/log.html#s1-s1-t8-k7-k2-k1-k3-k1-k11-k4

There seems to be an issue other than frozen class that is leading to missing flows. All the instances which I observed so far it seems  to happen in the node  where the ovs master is not the same as the shard leader. Also flows programmed  by all the node listeners is not missing. It is missing for some of the listeners.

This seems to be some timing issue. We can use this bug to track the missing flows which is not caused by frozen class.

Comment 41 Mike Kolesnik 2018-05-21 12:45:09 UTC

Aswin, any update on this issue?

Comment 42 Aswin Suryanarayanan 2018-05-22 06:57:56 UTC

(In reply to Mike Kolesnik from comment #41)
> Aswin, any update on this issue?

I have patch [1] d/s and currently testing it with CI to see if the issue is fixed , if not we have to explore further to find the root cause.

[1]https://code.engineering.redhat.com/gerrit/#/c/138935/

Comment 43 Vishal Thapar 2018-05-23 07:51:36 UTC

*** Bug 1573224 has been marked as a duplicate of this bug. ***

Comment 44 Itzik Brown 2018-05-29 10:43:38 UTC

A workaround:
Rebooting the compute node works.

Comment 46 Tim Rozet 2018-05-30 14:44:44 UTC

Can someone please try setting the following parameter in your deployment and let us know if you are able to reproduce the issue?
OpenDaylightCheckURL: diagstatus

Comment 47 jamo luhrsen 2018-05-30 22:53:59 UTC

(In reply to Tim Rozet from comment #46)
> Can someone please try setting the following parameter in your deployment
> and let us know if you are able to reproduce the issue?
> OpenDaylightCheckURL: diagstatus

I have a patch to do this for one of our jobs:
  https://code.engineering.redhat.com/gerrit/#/c/140309/

I gave it a -1 (see comments)

it's running in my private job in the staging jenkins. I don't
know if it will work or not. I've never tried a 3node job
on my baremetal yet.

  https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/opendaylight/view/odl-netvirt/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit-jamo-poc/

I'll monitor and report back here.

Comment 48 jamo luhrsen 2018-06-01 00:24:19 UTC

(In reply to jamo luhrsen from comment #47)
> (In reply to Tim Rozet from comment #46)
> > Can someone please try setting the following parameter in your deployment
> > and let us know if you are able to reproduce the issue?
> > OpenDaylightCheckURL: diagstatus
> 
> I have a patch to do this for one of our jobs:
>   https://code.engineering.redhat.com/gerrit/#/c/140309/
> 
> I gave it a -1 (see comments)
> 
> it's running in my private job in the staging jenkins. I don't
> know if it will work or not. I've never tried a 3node job
> on my baremetal yet.
> 
>  
> https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/
> view/opendaylight/view/odl-netvirt/job/DFG-opendaylight-odl-netvirt-
> 13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit-jamo-poc/
> 
> I'll monitor and report back here.

the /diagstatus endpoint is not available in our d/s distro, but we have
a related endpoint that we can use:

  jolokia/exec/org.opendaylight.infrautils.diagstatus:type=SvcStatus/acquireServiceStatus

I was able to reproduce the problem where table=48 is not present on one
of the computes when using this other CheckUrl. below is the full flow
table of that compute.


[heat-admin@compute-0 ~]$ sudo ovs-ofctl dump-flows br-int -OOpenFlow13
 cookie=0x8000001, duration=1907.386s, table=0, n_packets=600, n_bytes=56790, priority=5,in_port=tuna1eead404eb actions=write_metadata:0x50000000001/0xfffff0000000001,goto_table:36
 cookie=0x8000001, duration=1907.386s, table=0, n_packets=599, n_bytes=56724, priority=5,in_port=tunb1dbca2df9a actions=write_metadata:0x40000000001/0xfffff0000000001,goto_table:36
 cookie=0x8000001, duration=1902.056s, table=0, n_packets=599, n_bytes=57488, priority=5,in_port=tun486a7e7734e actions=write_metadata:0x90000000001/0xfffff0000000001,goto_table:36
 cookie=0x8000001, duration=1894.153s, table=0, n_packets=596, n_bytes=56436, priority=5,in_port=tun7f59979156c actions=write_metadata:0xe0000000001/0xfffff0000000001,goto_table:36
 cookie=0x8220015, duration=1906.986s, table=19, n_packets=0, n_bytes=0, priority=100,arp,arp_op=1 actions=resubmit(,17)
 cookie=0x8220016, duration=1906.986s, table=19, n_packets=0, n_bytes=0, priority=100,arp,arp_op=2 actions=resubmit(,17)
 cookie=0x1080000, duration=1906.986s, table=19, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,17)
 cookie=0x1030000, duration=1906.986s, table=20, n_packets=0, n_bytes=0, priority=0 actions=goto_table:80
 cookie=0x8000004, duration=1906.986s, table=22, n_packets=0, n_bytes=0, priority=0 actions=CONTROLLER:65535
 cookie=0x1080000, duration=1906.986s, table=23, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,17)
 cookie=0x822002e, duration=1907.434s, table=43, n_packets=0, n_bytes=0, priority=100,arp,arp_op=2 actions=CONTROLLER:65535,resubmit(,48)
 cookie=0x1030000, duration=1906.984s, table=80, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,17)
 cookie=0x6900000, duration=1907.434s, table=210, n_packets=0, n_bytes=0, priority=63009,arp actions=drop
 cookie=0x6900000, duration=1907.434s, table=210, n_packets=0, n_bytes=0, priority=61009,ipv6 actions=drop
 cookie=0x6900000, duration=1907.434s, table=210, n_packets=0, n_bytes=0, priority=61009,ip actions=drop
 cookie=0x6900000, duration=1907.434s, table=210, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,tcp actions=write_metadata:0/0x2,goto_table:212
 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,udp actions=write_metadata:0/0x2,goto_table:212
 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,icmp actions=write_metadata:0/0x2,goto_table:212
 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,icmp6 actions=write_metadata:0/0x2,goto_table:212
 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,tcp6 actions=write_metadata:0/0x2,goto_table:212
 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=100,udp6 actions=write_metadata:0/0x2,goto_table:212
 cookie=0x6900000, duration=1907.434s, table=211, n_packets=0, n_bytes=0, priority=0 actions=write_metadata:0x2/0x2,goto_table:214
 cookie=0x6900000, duration=1907.434s, table=212, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x6900000, duration=1907.434s, table=213, n_packets=0, n_bytes=0, priority=0 actions=goto_table:214
 cookie=0x6900000, duration=1907.434s, table=214, n_packets=0, n_bytes=0, priority=62030,ct_state=-new-est+rel-inv+trk,ct_mark=0x1/0x1 actions=ct_clear,resubmit(,17)
 cookie=0x6900000, duration=1907.434s, table=214, n_packets=0, n_bytes=0, priority=62030,ct_state=-new+est-rel-inv+trk,ct_mark=0x1/0x1 actions=ct_clear,resubmit(,17)
 cookie=0x6900000, duration=1907.434s, table=214, n_packets=0, n_bytes=0, priority=62030,ct_state=-trk actions=ct_clear,resubmit(,242)
 cookie=0x6900000, duration=1907.434s, table=214, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x6900000, duration=1907.434s, table=215, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,214)
 cookie=0x6900000, duration=1907.434s, table=216, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,214)
 cookie=0x6900000, duration=1907.434s, table=217, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x8000007, duration=1906.922s, table=220, n_packets=0, n_bytes=0, priority=9,reg6=0x500 actions=output:tuna1eead404eb
 cookie=0x8000007, duration=1906.765s, table=220, n_packets=0, n_bytes=0, priority=9,reg6=0x400 actions=output:tunb1dbca2df9a
 cookie=0x8000007, duration=1901.824s, table=220, n_packets=0, n_bytes=0, priority=9,reg6=0x900 actions=output:tun486a7e7734e
 cookie=0x8000007, duration=1893.413s, table=220, n_packets=0, n_bytes=0, priority=9,reg6=0xe00 actions=output:tun7f59979156c
 cookie=0x6900000, duration=1907.434s, table=239, n_packets=0, n_bytes=0, priority=100,ip actions=ct_clear,goto_table:240
 cookie=0x6900000, duration=1907.434s, table=239, n_packets=0, n_bytes=0, priority=100,ipv6 actions=ct_clear,goto_table:240
 cookie=0x6900000, duration=1907.434s, table=239, n_packets=0, n_bytes=0, priority=0 actions=goto_table:240
 cookie=0x6900000, duration=1907.434s, table=240, n_packets=0, n_bytes=0, priority=61010,ip,dl_dst=ff:ff:ff:ff:ff:ff,nw_dst=255.255.255.255 actions=goto_table:241
 cookie=0x6900000, duration=1907.434s, table=240, n_packets=0, n_bytes=0, priority=61005,dl_dst=ff:ff:ff:ff:ff:ff actions=resubmit(,220)
 cookie=0x6900000, duration=1907.434s, table=240, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,icmp actions=write_metadata:0/0x2,goto_table:242
 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,icmp6 actions=write_metadata:0/0x2,goto_table:242
 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,tcp6 actions=write_metadata:0/0x2,goto_table:242
 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,udp6 actions=write_metadata:0/0x2,goto_table:242
 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,tcp actions=write_metadata:0/0x2,goto_table:242
 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=100,udp actions=write_metadata:0/0x2,goto_table:242
 cookie=0x6900000, duration=1907.434s, table=241, n_packets=0, n_bytes=0, priority=0 actions=write_metadata:0x2/0x2,goto_table:244
 cookie=0x6900000, duration=1907.434s, table=242, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x6900000, duration=1907.434s, table=243, n_packets=0, n_bytes=0, priority=0 actions=goto_table:244
 cookie=0x6900000, duration=1907.434s, table=244, n_packets=0, n_bytes=0, priority=62030,ct_state=-trk actions=ct_clear,resubmit(,242)
 cookie=0x6900000, duration=1907.434s, table=244, n_packets=0, n_bytes=0, priority=62030,ct_state=-new-est+rel-inv+trk,ct_mark=0x1/0x1 actions=ct_clear,resubmit(,220)
 cookie=0x6900000, duration=1907.434s, table=244, n_packets=0, n_bytes=0, priority=62030,ct_state=-new+est-rel-inv+trk,ct_mark=0x1/0x1 actions=ct_clear,resubmit(,220)
 cookie=0x6900000, duration=1907.434s, table=244, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x6900000, duration=1907.434s, table=245, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,244)
 cookie=0x6900000, duration=1907.434s, table=246, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,244)
 cookie=0x6900000, duration=1907.434s, table=247, n_packets=0, n_bytes=0, priority=0 actions=drop
[heat-admin@compute-0 ~]$

Comment 49 jamo luhrsen 2018-06-01 18:51:48 UTC

If you disconnect and reconnect the openflow manager on an affected ovs node,
the flows will be reprogrammed and should rectify this issue. This is only
a workaround and the root cause should still be found and fixed.

steps:
(NOTE: replace the ip addresses as needed)

ovs-vsctl del-controller br-int 
ovs-vsctl set-controller br-int tcp:172.17.1.16:6653 tcp:172.17.1.20:6653 tcp:172.17.1.24:6653

Comment 50 Aswin Suryanarayanan 2018-06-05 12:23:28 UTC

This was observed in a non-cluster environment as well. Here table-19 was missing ,but other tables where present. The flows were present in the config DS. When the controller was set again the flows were programmed and FIP started working.

Comment 53 Tim Rozet 2018-06-06 17:00:08 UTC

The plan is to insert a workaround into puppet-tripleo to resync the OVS openflow table with ODL when tables are missing.

Comment 54 Tim Rozet 2018-06-06 17:09:35 UTC

Patch posted upstream which works locally for me.  Need to test in a deployment.

Comment 68 errata-xmlrpc 2018-06-27 13:52:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.