Bug 1322942 - Service with active endpoints not routing traffic, returns connection refused
Summary: Service with active endpoints not routing traffic, returns connection refused
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.2.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: Ben Bennett
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-31 17:54 UTC by Mike Fiedler
Modified: 2016-05-31 06:25 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-12 16:35:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Docker registry logs (152.62 KB, application/x-gzip)
2016-03-31 17:54 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1064 0 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update 2016-05-12 20:19:17 UTC

Description Mike Fiedler 2016-03-31 17:54:17 UTC
Created attachment 1142322 [details]
Docker registry logs

Description of problem:

Running > 100 concurrent builds, builds start failing with:

F0331 13:27:06.828479       1 builder.go:204] Error: build error: Failed to push image. Response from registry is: Put http://172.25.149.234:5000/v1/repositories/t100/django-example-1/: dial tcp 172.25.149.234:5000: connection refused

The registry pods stay in Running state and never restart, but any pushes after this condition is triggered will hit the problem.  Only restarting the registry allows builds to be successful again.

Config:
   3 masters
   2 docker-registry
   2 router
   100 worker/build nodes

This is on AWS EC2 with S3 storage for the registry.

I've done 100 concurrent builds before, but never on this large of a cluster.  Potentially more traffic hitting the registry at once since the builds are spread over a larger number of nodes.


Version-Release number of selected component (if applicable): 3.2.0.9


How reproducible: Always


Steps to Reproduce:
1. Set up cluster as described above
2. Create 100 projects and oc new-app the django-example in each
3. From 4 shells, run the build_test.py script (openshift/svt repo) and launch 25 builds from each instance. 

Actual results:

Some builds succeed, but at some point the builds start failing with:

F0331 13:27:06.828479       1 builder.go:204] Error: build error: Failed to push image. Response from registry is: Put http://172.25.149.234:5000/v1/repositories/t100/django-example-1/: dial tcp 172.25.149.234:5000: connection refused



Expected results:

All builds succeed


Additional info:

docker-registry logs from the 2 registry instances attached

Comment 1 Andy Goldstein 2016-03-31 18:11:22 UTC
Can you access the registry pods at their pod IPs? curl http://<ip>:5000/ ?

Can you access the registry via the service? curl http://172.25.149.234:5000/ ?

The registry pod logs indicate that the health check the node performs is able to retrieve /healthz after the time the build failed. I'm wondering if this is some other networking issue that's unrelated to the registry itself.

Comment 2 Mike Fiedler 2016-03-31 18:27:39 UTC
Curl to service IP:

root@ip-172-31-15-66: ~ # curl http://172.25.149.234:5000/ 
curl: (7) Failed connect to 172.25.149.234:5000; Connection refused


Curl to pod IP:

docker-registry #1

root@ip-172-31-15-66: ~ # curl http://172.20.7.2:5000
root@ip-172-31-15-66: ~ # 


docker-registry #2

root@ip-172-31-15-66: ~ # curl http://172.20.3.2:5000
root@ip-172-31-15-66: ~ #

Comment 3 Andy Goldstein 2016-03-31 18:41:26 UTC
Chatted on IRC. This is not a registry issue. The registry pods are responding to requests. For some reason, the service is not routing packets to the registry pods. Continuing to debug on IRC.

Comment 5 Solly Ross 2016-04-04 16:00:32 UTC
Hmmm...

This is odd: "Service 'docker-registry' in namespace 'default' has an Endpoint pointing to pod 172.20.3.2 in namespace 't23'" -- looks like there's some failed builder pods sticking around, and their ip addresses have been reused.

It looks like all the affected services seem to have the same issue (reused ip address).  This appears to be the likely culprit, then, but I'm unsure of the root cause.  We'll continue to investigate.

Comment 6 Diego Castro 2016-04-05 18:33:34 UTC
Hello, i have a very close issue, suddenly i can't talk to ClusterIP, just pod IP:

ex:

# oc get svc
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    SELECTOR                  AGE
anitta    172.30.228.178   <none>        8080/TCP   deploymentconfig=anitta   1h
mysql     172.30.11.181    <none>        3306/TCP   name=mysql                1h

# oc get endpoints
NAME      ENDPOINTS        AGE
anitta    10.1.3.15:8080   1h
mysql     10.1.6.8:3306    1h

# telnet 172.30.11.181 3306
Trying 172.30.11.181...
telnet: connect to address 172.30.11.181: Connection refused

# telnet 10.1.6.8 3306
Trying 10.1.6.8...
Connected to 10.1.6.8.
Escape character is '^]'.
J
5.6.26+Yba@X{&�ic5z-iO/:


I've found the following iptables rules pointing to REJECT target:

iptables -nv -L KUBE-SERVICES --line-numbers
Chain KUBE-SERVICES (1 references)
num   pkts bytes target     prot opt in     out     source               destination
1        0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.236.207       /* mateus/mysqlpizza:mysql has no endpoints */ tcp dpt:3306 reject-with icmp-port-unreachable
2        0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.35.6          /* pensatica/postgresql:postgresql has no endpoints */ tcp dpt:5432 reject-with icmp-port-unreachable
3        0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.226.230       /* pensatica/web:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
4        0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.65.219        /* mybff/wordpress:web has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
5        0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.207.105       /* pensatica/elasticsearch:elasticsearch has no endpoints */ tcp dpt:9200 reject-with icmp-port-unreachable
6        5   300 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.11.181        /* anittaoficial/mysql:mysql has no endpoints */ tcp dpt:3306 reject-with icmp-port-unreachable
7        0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.137.82        /* bluegreen/green:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
8        0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.231.72        /* orlandowebtravel/site:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
9        0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.186.185       /* mateus/api:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
10       0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.183.213       /* facoeaconteco/mysql:mysql has no endpoints */ tcp dpt:3306 reject-with icmp-port-unreachable
11       0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.67.91         /* panda/appphp:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
12       0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.24.171        /* abdeployment/app-b:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
13       0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.148.58        /* getup/console:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
14       0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.248.49        /* bluegreen/blue:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
15       0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.157.252       /* cdablog/wordpress:web has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
16       0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.9.41          /* cdablog/mysql:mysql has no endpoints */ tcp dpt:3306 reject-with icmp-port-unreachable


Look that if i manually delete the rule, the error is:

# telnet 172.30.11.181 3306
Trying 172.30.11.181...
telnet: connect to address 172.30.11.181: No route to host

Comment 9 Ben Bennett 2016-04-13 17:59:39 UTC
This was merged to origin in https://github.com/openshift/origin/pull/8468

Comment 10 Troy Dawson 2016-04-15 16:32:20 UTC
This should be in atomic-openshift-3.2.0.16-1.git.0.738b760.el7 which has been built and readied for qe.

Comment 11 Meng Bo 2016-04-19 09:53:58 UTC
Checked on atomic-openshift-3.2.0.17 with steps:

1. Setup env with 1 master, 1 node and multitenant plugin
2. Create two namespaces 
3. Create a service in ns1 which will match the selector "name=test-pods"
4. Create pod with label "name=test-pods" which will be in error state in ns2
5. Create normal pod with label "name=test-pods" after the above pod failed in ns1
6. Delete the error pod in ns2
7. Check the endpoints in ns1, it has the correct pod ip and port.

Check the node log,
Apr 19 16:01:19 ose-node1.bmeng.local atomic-openshift-node[38217]: W0419 16:01:19.091269   38217 registry.go:508] IP '10.128.2.2' was marked as used by namespace 'u1p1' (pod '7d851838-0604-11e6-a8fc-525400dd3698')... updating to namespace 'u1p1' (pod 'e0e1516f-0604-11e6-965c-525400dd3698')

Comment 13 errata-xmlrpc 2016-05-12 16:35:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064


Note You need to log in before you can comment on or make changes to this bug.