1372824 – Sporadic failures connecting to the cluster registry using the service IP

Bug 1372824 - Sporadic failures connecting to the cluster registry using the service IP

Summary: Sporadic failures connecting to the cluster registry using the service IP

Keywords:
Status:	CLOSED DUPLICATE of bug 1362661
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Dan Williams
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	OSOPS_V3
TreeView+	depends on / blocked

Reported:	2016-09-02 19:24 UTC by Thomas Wiest
Modified:	2016-09-15 22:08 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-09-15 22:08:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Thomas Wiest 2016-09-02 19:24:28 UTC

Description of problem:

The service IP for the registry intermittently fails. We're seeing this in our cluster registry health check. This is happening in our 'preview' production cluster (aka dev-preview).

When we hit the individual IP addresses of the service, they seem to always work.

This is an issue that we had a conference call with Dan Williams a couple of weeks ago and we gave him a bunch of logs that he said he would analyze and get back to us. This bug is to track his efforts.

Here is the output from when the service IP works and then fails:

root@preview-master-afbb8 ~]# curl --head https://172.30.47.227:5000
HTTP/1.1 200 OK
Cache-Control: no-cache
Date: Thu, 18 Aug 2016 14:41:52 GMT
Content-Type: text/plain; charset=utf-8

[root@preview-master-afbb8 ~]# curl --head https://172.30.47.227:5000
curl: (7) Failed connect to 172.30.47.227:5000; No route to host
[root@preview-master-afbb8 ~]#




Version-Release number of selected component (if applicable):
atomic-openshift-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-master-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-clients-3.2.1.15-1.git.8.c402626.el7.x86_64
tuned-profiles-atomic-openshift-node-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-sdn-ovs-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-node-3.2.1.15-1.git.8.c402626.el7.x86_64




How reproducible:
Very sporadic, but our check shows it happening on a regular basis.



Steps to Reproduce:
1. Unknown, it's sporadic, but we're seeing it running the curl command above.
2.
3.


Actual results:
We're sporadically unable to connect to the registry using the service IP, but can using the individual IPs.

This might be load related as preview prod is one of our biggest / busiest clusters.


Expected results:
We should always be able to connect to the registry using the service IP.


Additional Info:
Even though this bug is specifically talking about the registry, this might be a general issue with the kube proxy.

Comment 5 Sten Turpin 2016-09-07 14:19:48 UTC

On 29 Aug, we changed openshift-node iptablesSyncPeriod from 5s (shipped default) to 300s, since that change we can't reliably reproduce this issue.

Comment 9 Timothy St. Clair 2016-09-13 19:24:53 UTC

Sounds like a combination of: 

https://bugzilla.redhat.com/show_bug.cgi?id=1367199 
https://bugzilla.redhat.com/show_bug.cgi?id=1362661

Comment 10 William Cohen 2016-09-13 20:48:26 UTC

It would be worthwhile to see how iptables-restore is spending that time and determine if there are some hotspots in the code and whether something in the iptables-restore process could be made more efficient.  Flame graphs (http://www.brendangregg.com/flamegraphs.html) can give a diagram of the stack backtraces of "perf record" data showing which functions and children functions the processor is spending time in.  To get user-space function names in the analysis you should uses the following command to install the associated debuginfo for iptables-restore before running the experiments:

# debuginfo-install iptables

Comment 11 Dan Williams 2016-09-14 22:53:54 UTC

I'm not sure what else OpenShift networking can do here right now, given that we have a fix to decrease the contention (installer defaults in 1367199) and issues in the kernel too (1362661).  Should I dupe this issue to one of those, re-assign to iptables, or close?

Comment 12 Timothy St. Clair 2016-09-14 23:01:58 UTC

imho this is a dupe, and close this one.

Comment 13 Dan Williams 2016-09-15 22:08:02 UTC


*** This bug has been marked as a duplicate of bug 1362661 ***

Note You need to log in before you can comment on or make changes to this bug.