Created attachment 1142322 [details] Docker registry logs Description of problem: Running > 100 concurrent builds, builds start failing with: F0331 13:27:06.828479 1 builder.go:204] Error: build error: Failed to push image. Response from registry is: Put http://172.25.149.234:5000/v1/repositories/t100/django-example-1/: dial tcp 172.25.149.234:5000: connection refused The registry pods stay in Running state and never restart, but any pushes after this condition is triggered will hit the problem. Only restarting the registry allows builds to be successful again. Config: 3 masters 2 docker-registry 2 router 100 worker/build nodes This is on AWS EC2 with S3 storage for the registry. I've done 100 concurrent builds before, but never on this large of a cluster. Potentially more traffic hitting the registry at once since the builds are spread over a larger number of nodes. Version-Release number of selected component (if applicable): 3.2.0.9 How reproducible: Always Steps to Reproduce: 1. Set up cluster as described above 2. Create 100 projects and oc new-app the django-example in each 3. From 4 shells, run the build_test.py script (openshift/svt repo) and launch 25 builds from each instance. Actual results: Some builds succeed, but at some point the builds start failing with: F0331 13:27:06.828479 1 builder.go:204] Error: build error: Failed to push image. Response from registry is: Put http://172.25.149.234:5000/v1/repositories/t100/django-example-1/: dial tcp 172.25.149.234:5000: connection refused Expected results: All builds succeed Additional info: docker-registry logs from the 2 registry instances attached
Can you access the registry pods at their pod IPs? curl http://<ip>:5000/ ? Can you access the registry via the service? curl http://172.25.149.234:5000/ ? The registry pod logs indicate that the health check the node performs is able to retrieve /healthz after the time the build failed. I'm wondering if this is some other networking issue that's unrelated to the registry itself.
Curl to service IP: root@ip-172-31-15-66: ~ # curl http://172.25.149.234:5000/ curl: (7) Failed connect to 172.25.149.234:5000; Connection refused Curl to pod IP: docker-registry #1 root@ip-172-31-15-66: ~ # curl http://172.20.7.2:5000 root@ip-172-31-15-66: ~ # docker-registry #2 root@ip-172-31-15-66: ~ # curl http://172.20.3.2:5000 root@ip-172-31-15-66: ~ #
Chatted on IRC. This is not a registry issue. The registry pods are responding to requests. For some reason, the service is not routing packets to the registry pods. Continuing to debug on IRC.
Hmmm... This is odd: "Service 'docker-registry' in namespace 'default' has an Endpoint pointing to pod 172.20.3.2 in namespace 't23'" -- looks like there's some failed builder pods sticking around, and their ip addresses have been reused. It looks like all the affected services seem to have the same issue (reused ip address). This appears to be the likely culprit, then, but I'm unsure of the root cause. We'll continue to investigate.
Hello, i have a very close issue, suddenly i can't talk to ClusterIP, just pod IP: ex: # oc get svc NAME CLUSTER-IP EXTERNAL-IP PORT(S) SELECTOR AGE anitta 172.30.228.178 <none> 8080/TCP deploymentconfig=anitta 1h mysql 172.30.11.181 <none> 3306/TCP name=mysql 1h # oc get endpoints NAME ENDPOINTS AGE anitta 10.1.3.15:8080 1h mysql 10.1.6.8:3306 1h # telnet 172.30.11.181 3306 Trying 172.30.11.181... telnet: connect to address 172.30.11.181: Connection refused # telnet 10.1.6.8 3306 Trying 10.1.6.8... Connected to 10.1.6.8. Escape character is '^]'. J 5.6.26+Yba@X{&�ic5z-iO/: I've found the following iptables rules pointing to REJECT target: iptables -nv -L KUBE-SERVICES --line-numbers Chain KUBE-SERVICES (1 references) num pkts bytes target prot opt in out source destination 1 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.236.207 /* mateus/mysqlpizza:mysql has no endpoints */ tcp dpt:3306 reject-with icmp-port-unreachable 2 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.35.6 /* pensatica/postgresql:postgresql has no endpoints */ tcp dpt:5432 reject-with icmp-port-unreachable 3 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.226.230 /* pensatica/web:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 4 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.65.219 /* mybff/wordpress:web has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 5 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.207.105 /* pensatica/elasticsearch:elasticsearch has no endpoints */ tcp dpt:9200 reject-with icmp-port-unreachable 6 5 300 REJECT tcp -- * * 0.0.0.0/0 172.30.11.181 /* anittaoficial/mysql:mysql has no endpoints */ tcp dpt:3306 reject-with icmp-port-unreachable 7 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.137.82 /* bluegreen/green:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 8 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.231.72 /* orlandowebtravel/site:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 9 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.186.185 /* mateus/api:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 10 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.183.213 /* facoeaconteco/mysql:mysql has no endpoints */ tcp dpt:3306 reject-with icmp-port-unreachable 11 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.67.91 /* panda/appphp:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 12 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.24.171 /* abdeployment/app-b:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 13 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.148.58 /* getup/console:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 14 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.248.49 /* bluegreen/blue:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 15 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.157.252 /* cdablog/wordpress:web has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable 16 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.9.41 /* cdablog/mysql:mysql has no endpoints */ tcp dpt:3306 reject-with icmp-port-unreachable Look that if i manually delete the rule, the error is: # telnet 172.30.11.181 3306 Trying 172.30.11.181... telnet: connect to address 172.30.11.181: No route to host
https://github.com/openshift/openshift-sdn/pull/285
This was merged to origin in https://github.com/openshift/origin/pull/8468
This should be in atomic-openshift-3.2.0.16-1.git.0.738b760.el7 which has been built and readied for qe.
Checked on atomic-openshift-3.2.0.17 with steps: 1. Setup env with 1 master, 1 node and multitenant plugin 2. Create two namespaces 3. Create a service in ns1 which will match the selector "name=test-pods" 4. Create pod with label "name=test-pods" which will be in error state in ns2 5. Create normal pod with label "name=test-pods" after the above pod failed in ns1 6. Delete the error pod in ns2 7. Check the endpoints in ns1, it has the correct pod ip and port. Check the node log, Apr 19 16:01:19 ose-node1.bmeng.local atomic-openshift-node[38217]: W0419 16:01:19.091269 38217 registry.go:508] IP '10.128.2.2' was marked as used by namespace 'u1p1' (pod '7d851838-0604-11e6-a8fc-525400dd3698')... updating to namespace 'u1p1' (pod 'e0e1516f-0604-11e6-965c-525400dd3698')
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064