Bug 1712494

Summary: Exception Opening Socket error - service endpoints removed but not added back
Product: OpenShift Container Platform Reporter: Greg Rodriguez II <grodrigu>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, jpriddy, knakai, mhernon, msweiker, openshift-bugs-escalate, piqin, rhowe, scuppett, sponnaga, trankin
Version: 3.10.0   
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-26 20:32:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Greg Rodriguez II 2019-05-21 15:47:26 UTC
Description of problem:
Customer is facing issues making calls using services in OCP 3.10.XX. When they see these issues, these are the errors from their application:

~~~~
2019-05-02 15:09:11.940  INFO 9 --- [xtgen.svc:27017] org.mongodb.driver.cluster: Exception in monitor thread while connecting to server mongodb.ode-b92590-nextgen.svc:27017com.mongodb.MongoSocketOpenException: Exception opening socket
~~~~

When the sdn pod logs are reviewed, it was found that the service endpoints were removed and not added back. Restarting the sdn pod resolves the issues. 

NOTE: There also were issues with xtables locks as this environment has >16k iptables rules. To resolve these issues, the iptables min-sync-period parameter was increased to 30s. In addition, BZ 1669311 was hit and the limits/requests of the sdn and ovs pods were changed to prevent oom-killer events. 

Version-Release number of selected component (if applicable):
OCP v3.11.111

Additional info:
Providing sdn and ovs pod logs, as well as sosreport from affected node to private update

Comment 4 Greg Rodriguez II 2019-05-21 16:23:04 UTC
(In reply to Greg Rodriguez II from comment #0)

Correction to BZ description, this is for OCP v3.10.111 and confirmed from customer.

Comment 9 Ryan Howe 2019-06-20 16:11:46 UTC
Seems like this is a dup of the following bug: 

** https://bugzilla.redhat.com/show_bug.cgi?id=1590589 ** 

I think this is the fix for them:
https://github.com/openshift/origin/pull/21618/files

Merged upstream Kube: 
https://github.com/kubernetes/kubernetes/pull/71735

Comment 11 Ryan Howe 2019-09-26 20:32:22 UTC

*** This bug has been marked as a duplicate of bug 1734009 ***