Bug 1389451

Summary: While upgrading from 3.2.0 to 3.2.1.17 (3.2 latest) ovs flows are added correctly but are missed out while upgrading from 3.2.1.17 (3.2 latest) to 3.3 (3.3.0.35)
Product: OpenShift Container Platform Reporter: Miheer Salunke <misalunk>
Component: NetworkingAssignee: Dan Winship <danw>
Status: CLOSED NOTABUG QA Contact: Meng Bo <bmeng>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.1.0CC: aos-bugs, bbennett, clichybi, erich, erjones, ndordet
Target Milestone: ---Keywords: UpcomingRelease
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-11 13:12:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Miheer Salunke 2016-10-27 15:23:46 UTC
Description of problem:


While upgrading from 3.2.0 to  3.2.1.17 (3.2 latest) hostsubnets in ovs node's flow tables are added correctly but are randomly missed out while upgrading from  3.2.1.17 (3.2 latest) to  3.3 (3.3.0.35) in ovs node's flow tables.

according to the comment 
https://access.redhat.com/support/cases/#/case/01697073?commentId=a0aA000000HzifqIAB
registry is located @10.1.1.2 on node 542.
node 538 can reach the registry, its ovs knows the subnet 10.1.1.0/24.
node 540 is unable to reach the registry because its ovs doesn't know the subnet 10.1.1.0/24, so it can't forward the network frame to the host 542 via the vxlan.

We have deleted, then re-added the node 542.
it has fixed the issue !
Missing rules are now present on node 540.
It's the unique workaround we found.

"oc get hostsubnets" return a hostsubnet for every node since the beginning.


So yesterday we restore a 3.2.0 backup, everything went well. 
Then we updated to 3.2.1.17 (3.2 latest),  everything went well too.
Then we updated to 3.3 (3.3.0.35) and we hit the issue again.

The workaround (deleting and adding again a node) is cool for test environment but not acceptable for production ones, since no or minimal downtimes is required.


PS :
We also retry on one node to :
systemctl stop atomic-openshift-node 
ovs-vsctl del-br br0 
systemctl start atomic-openshift-node

But it had no effects.





Version-Release number of selected component (if applicable):
3.3.0.35

How reproducible:
Always on customer side

Steps to Reproduce:
1.Mentioned in the description
2.
3.

Actual results:
While upgrading from 3.2.0 to  3.2.1.17 (3.2 latest) hostsubnets in ovs node's flow tables are added correctly but are randomly missed out while upgrading from  3.2.1.17 (3.2 latest) to  3.3 (3.3.0.35) in ovs node's flow tables.

Expected results:
Upgrade from 3.2.1.17 (3.2 latest) to  3.3 (3.3.0.35) shall not miss out randomly some hostsubnets in ovs node's flow tables

Additional info:

Comment 18 Ben Bennett 2016-11-11 13:12:06 UTC
The resolution was that they had duplicated a UUID of a hostsubnet when creating manually, and that broke things.