Bug 1372824

Summary:	Sporadic failures connecting to the cluster registry using the service IP
Product:	OpenShift Online	Reporter:	Thomas Wiest <twiest>
Component:	Networking	Assignee:	Dan Williams <dcbw>
Status:	CLOSED DUPLICATE	QA Contact:	Meng Bo <bmeng>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.x	CC:	agrimm, aloughla, anli, aos-bugs, bbennett, dcbw, eparis, sspeiche, sten, sukulkar, tstclair, twiest, wcohen
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-09-15 22:08:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1303130

Description Thomas Wiest 2016-09-02 19:24:28 UTC

Description of problem:

The service IP for the registry intermittently fails. We're seeing this in our cluster registry health check. This is happening in our 'preview' production cluster (aka dev-preview).

When we hit the individual IP addresses of the service, they seem to always work.

This is an issue that we had a conference call with Dan Williams a couple of weeks ago and we gave him a bunch of logs that he said he would analyze and get back to us. This bug is to track his efforts.

Here is the output from when the service IP works and then fails:

root@preview-master-afbb8 ~]# curl --head https://172.30.47.227:5000
HTTP/1.1 200 OK
Cache-Control: no-cache
Date: Thu, 18 Aug 2016 14:41:52 GMT
Content-Type: text/plain; charset=utf-8

[root@preview-master-afbb8 ~]# curl --head https://172.30.47.227:5000
curl: (7) Failed connect to 172.30.47.227:5000; No route to host
[root@preview-master-afbb8 ~]#




Version-Release number of selected component (if applicable):
atomic-openshift-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-master-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-clients-3.2.1.15-1.git.8.c402626.el7.x86_64
tuned-profiles-atomic-openshift-node-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-sdn-ovs-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-node-3.2.1.15-1.git.8.c402626.el7.x86_64




How reproducible:
Very sporadic, but our check shows it happening on a regular basis.



Steps to Reproduce:
1. Unknown, it's sporadic, but we're seeing it running the curl command above.
2.
3.


Actual results:
We're sporadically unable to connect to the registry using the service IP, but can using the individual IPs.

This might be load related as preview prod is one of our biggest / busiest clusters.


Expected results:
We should always be able to connect to the registry using the service IP.


Additional Info:
Even though this bug is specifically talking about the registry, this might be a general issue with the kube proxy.

Comment 5 Sten Turpin 2016-09-07 14:19:48 UTC

On 29 Aug, we changed openshift-node iptablesSyncPeriod from 5s (shipped default) to 300s, since that change we can't reliably reproduce this issue.

Comment 9 Timothy St. Clair 2016-09-13 19:24:53 UTC

Sounds like a combination of: 

https://bugzilla.redhat.com/show_bug.cgi?id=1367199 
https://bugzilla.redhat.com/show_bug.cgi?id=1362661

Comment 10 William Cohen 2016-09-13 20:48:26 UTC

It would be worthwhile to see how iptables-restore is spending that time and determine if there are some hotspots in the code and whether something in the iptables-restore process could be made more efficient.  Flame graphs (http://www.brendangregg.com/flamegraphs.html) can give a diagram of the stack backtraces of "perf record" data showing which functions and children functions the processor is spending time in.  To get user-space function names in the analysis you should uses the following command to install the associated debuginfo for iptables-restore before running the experiments:

# debuginfo-install iptables

Comment 11 Dan Williams 2016-09-14 22:53:54 UTC

I'm not sure what else OpenShift networking can do here right now, given that we have a fix to decrease the contention (installer defaults in 1367199) and issues in the kernel too (1362661).  Should I dupe this issue to one of those, re-assign to iptables, or close?

Comment 12 Timothy St. Clair 2016-09-14 23:01:58 UTC

imho this is a dupe, and close this one.

Comment 13 Dan Williams 2016-09-15 22:08:02 UTC


*** This bug has been marked as a duplicate of bug 1362661 ***