Description of problem: Very frequent errors with the integrated registry in free-int, sometimes doing a push after a successful build, sometimes doing the pull. Version-Release number of selected component (if applicable): registry.ops.openshift.com/openshift3/ose-docker-registry:v3.7.0-0.125.0 atomic-openshift-3.7.0-0.125.0.git.0.c710e11.el7.x86_64 How reproducible: Happening a lot, but not always, and possibly only on some nodes and not others. Steps to Reproduce: If the issue is occurring: 1. Login to a free-int master and run: oc project openshift-infra oc get pods You should see a "hibernation" pod running, as well as "analytics". oc get events exposes the problem. Actual results: 5m 6m 4 hibernation-4-fsxwq Pod spec.containers{hibernation} Warning Failed kubelet, ip-172-31-53-92.ec2.internal Failed to pull image "172.30.215.46:5000/openshift-infra/hibernation@sha256:4b976fa71a6fd3261c34985c61b026e797798a6fb5c8dce06d4cfae43b877211": rpc error: code = 2 desc = Get https://172.30.215.46:5000/v2/: dial tcp 172.30.215.46:5000: getsockopt: no route to host 4m 6m 6 hibernation-4-fsxwq Pod spec.containers{hibernation} Normal BackOff kubelet, ip-172-31-53-92.ec2.internal Back-off pulling image "172.30.215.46:5000/openshift-infra/hibernation@sha256:4b976fa71a6fd3261c34985c61b026e797798a6fb5c8dce06d4cfae43b877211" Additional info: The image referenced above exists: [root@free-int-master-5470f ~]# oc get images | grep 77211 sha256:4b976fa71a6fd3261c34985c61b026e797798a6fb5c8dce06d4cfae43b877211 172.30.215.46:5000/openshift-infra/hibernation@sha256:4b976fa71a6fd3261c34985c61b026e797798a6fb5c8dce06d4cfae43b877211 The build actually completed and was able to push the image. However another similarly timed build for the anaytics deployment config failed to push: Pushing image 172.30.215.46:5000/openshift-infra/analytics:latest ... Warning: Push failed, retrying in 5s ... Warning: Push failed, retrying in 5s ... Warning: Push failed, retrying in 5s ... Warning: Push failed, retrying in 5s ... Warning: Push failed, retrying in 5s ... Warning: Push failed, retrying in 5s ... Warning: Push failed, retrying in 5s ... Registry server Address: Registry server User Name: serviceaccount Registry server Email: serviceaccount Registry server Password: <<non-empty>> error: build error: Failed to push image: After retrying 6 times, Push image still failed These were executing at roughly the same time but on different nodes, leading us to believe it may be only affecting some nodes. The problem appears to be network related. [root@free-int-master-5470f ~]# oc project default Now using project "default" on server "https://internal.api.free-int.openshift.com:443". [root@free-int-master-5470f ~]# oc get svc NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE docker-registry 172.30.215.46 <none> 5000/TCP 169d kubernetes 172.30.0.1 <none> 443/TCP,53/UDP,53/TCP 169d registry-console 172.30.143.202 <none> 9000/TCP 169d router 172.30.122.144 <none> 80/TCP,443/TCP,1936/TCP 169d zagg-service 172.30.181.159 <none> 80/TCP,443/TCP 169d [root@free-int-master-5470f ~]# ping 172.30.215.46 PING 172.30.215.46 (172.30.215.46) 56(84) bytes of data. From 10.128.4.1 icmp_seq=1 Destination Host Unreachable From 10.128.4.1 icmp_seq=2 Destination Host Unreachable From 10.128.4.1 icmp_seq=3 Destination Host Unreachable From 10.128.4.1 icmp_seq=4 Destination Host Unreachable ^C --- 172.30.215.46 ping statistics --- 5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4000ms pipe 4 This has been happening in this cluster where we'll login periodically and find out hibernation pod is not running with an ErrPullImage event present.
FWIW You can't ping a service IP address. If you want to test connectivity to the registry do: curl 172.30.215.46:5000
We worked past this with rebooting the affected nodes. Ben requested the following if we see it again: 12:04pm <bbennett> dgoodwin: mwoodson I was looking into that... sorry I missed all of this chat. I'm trying to use my tier1/2 access to debug and missing a bunch of commands 12:05pm <bbennett> next time can we capture iptables and ovs rules and we can see what's broken
Created attachment 1331375 [details] iptables rules
Created attachment 1331377 [details] ovs dump
The problem has resurfaced. 21m 23m 4 hibernation-2-q14ph Pod spec.containers{hibernation} Warning Failed kubelet, ip-172-31-53-92.ec2.internal Failed to pull image "172.30.215.46:5000/openshift-infra/hibernation@sha256:a9cec6b1a65f76b6ce4fb8567b6787f40170be6308fec2fb069eefce37f31012": rpc error: code = 2 desc = Get https://172.30.215.46:5000/v2/: dial tcp 172.30.215.46:5000: getsockopt: no route to host iptables and ovs dump are attached from the affected node. Strangely some extra nodes (beyond the masters) seem to be scheduling disabled. [root@free-int-master-5470f ~]# oc get nodes NAME STATUS AGE VERSION ip-172-31-49-31.ec2.internal Ready 154d v1.7.0+695f48a16f ip-172-31-49-44.ec2.internal Ready,SchedulingDisabled 176d v1.7.0+695f48a16f ip-172-31-50-177.ec2.internal Ready,SchedulingDisabled 176d v1.7.0+80709908fd ip-172-31-53-92.ec2.internal Ready 176d v1.7.0+695f48a16f ip-172-31-56-130.ec2.internal Ready,SchedulingDisabled 176d v1.7.0+80709908fd ip-172-31-56-218.ec2.internal Ready 176d v1.7.0+695f48a16f ip-172-31-59-87.ec2.internal Ready 176d v1.7.0+695f48a16f ip-172-31-60-182.ec2.internal Ready,SchedulingDisabled 125d v1.7.0+80709908fd ip-172-31-61-50.ec2.internal Ready 112d v1.7.0+695f48a16f ip-172-31-62-45.ec2.internal Ready,SchedulingDisabled 176d v1.7.0+695f48a16f [root@free-int-master-5470f ~]#
Devan: Was this the system where the PLEGs were failing and taking the nodes offline? If so, I suspect it is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1451902
*** This bug has been marked as a duplicate of bug 1451902 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days