1493603 – Significant Connectivity Issues with Integrated Registry in Online free-int cluster

Bug 1493603 - Significant Connectivity Issues with Integrated Registry in Online free-int cluster

Summary: Significant Connectivity Issues with Integrated Registry in Online free-int c...

Keywords:
Status:	CLOSED DUPLICATE of bug 1451902
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-20 15:03 UTC by Devan Goodwin
Modified:	2023-09-14 04:08 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-24 17:41:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
iptables rules (70.25 KB, text/plain) 2017-09-27 11:16 UTC, Devan Goodwin	no flags	Details
ovs dump (46.93 KB, text/plain) 2017-09-27 11:17 UTC, Devan Goodwin	no flags	Details
View All

Description Devan Goodwin 2017-09-20 15:03:33 UTC

Description of problem:

Very frequent errors with the integrated registry in free-int, sometimes doing a push after a successful build, sometimes doing the pull.


Version-Release number of selected component (if applicable):

registry.ops.openshift.com/openshift3/ose-docker-registry:v3.7.0-0.125.0
atomic-openshift-3.7.0-0.125.0.git.0.c710e11.el7.x86_64

How reproducible:

Happening a lot, but not always, and possibly only on some nodes and not others.

Steps to Reproduce:

If the issue is occurring:

1. Login to a free-int master and run:

oc project openshift-infra
oc get pods

You should see a "hibernation" pod running, as well as "analytics".

oc get events exposes the problem.


Actual results:


5m        6m        4         hibernation-4-fsxwq    Pod                     spec.containers{hibernation}             Warning   Failed                        kubelet, ip-172-31-53-92.ec2.internal   Failed to pull image "172.30.215.46:5000/openshift-infra/hibernation@sha256:4b976fa71a6fd3261c34985c61b026e797798a6fb5c8dce06d4cfae43b877211": rpc error: code = 2 desc = Get https://172.30.215.46:5000/v2/: dial tcp 172.30.215.46:5000: getsockopt: no route to host
4m        6m        6         hibernation-4-fsxwq    Pod                     spec.containers{hibernation}             Normal    BackOff                       kubelet, ip-172-31-53-92.ec2.internal   Back-off pulling image "172.30.215.46:5000/openshift-infra/hibernation@sha256:4b976fa71a6fd3261c34985c61b026e797798a6fb5c8dce06d4cfae43b877211"



Additional info:

The image referenced above exists:

[root@free-int-master-5470f ~]# oc get images | grep 77211
sha256:4b976fa71a6fd3261c34985c61b026e797798a6fb5c8dce06d4cfae43b877211   172.30.215.46:5000/openshift-infra/hibernation@sha256:4b976fa71a6fd3261c34985c61b026e797798a6fb5c8dce06d4cfae43b877211

The build actually completed and was able to push the image. However another similarly timed build for the anaytics deployment config failed to push:
Pushing image 172.30.215.46:5000/openshift-infra/analytics:latest ...
Warning: Push failed, retrying in 5s ...
Warning: Push failed, retrying in 5s ...
Warning: Push failed, retrying in 5s ...
Warning: Push failed, retrying in 5s ...
Warning: Push failed, retrying in 5s ...
Warning: Push failed, retrying in 5s ...
Warning: Push failed, retrying in 5s ...
Registry server Address: 
Registry server User Name: serviceaccount
Registry server Email: serviceaccount
Registry server Password: <<non-empty>>
error: build error: Failed to push image: After retrying 6 times, Push image still failed

These were executing at roughly the same time but on different nodes, leading us to believe it may be only affecting some nodes.

The problem appears to be network related.

[root@free-int-master-5470f ~]# oc project default
Now using project "default" on server "https://internal.api.free-int.openshift.com:443".
[root@free-int-master-5470f ~]# oc get svc 
NAME               CLUSTER-IP       EXTERNAL-IP   PORT(S)                   AGE
docker-registry    172.30.215.46    <none>        5000/TCP                  169d
kubernetes         172.30.0.1       <none>        443/TCP,53/UDP,53/TCP     169d
registry-console   172.30.143.202   <none>        9000/TCP                  169d
router             172.30.122.144   <none>        80/TCP,443/TCP,1936/TCP   169d
zagg-service       172.30.181.159   <none>        80/TCP,443/TCP            169d
[root@free-int-master-5470f ~]# ping 172.30.215.46
PING 172.30.215.46 (172.30.215.46) 56(84) bytes of data.
From 10.128.4.1 icmp_seq=1 Destination Host Unreachable
From 10.128.4.1 icmp_seq=2 Destination Host Unreachable
From 10.128.4.1 icmp_seq=3 Destination Host Unreachable
From 10.128.4.1 icmp_seq=4 Destination Host Unreachable
^C
--- 172.30.215.46 ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4000ms
pipe 4


This has been happening in this cluster where we'll login periodically and find out hibernation pod is not running with an ErrPullImage event present.

Comment 1 Ben Bennett 2017-09-20 15:13:49 UTC

FWIW You can't ping a service IP address.  If you want to test connectivity to the registry do:
  curl 172.30.215.46:5000

Comment 2 Devan Goodwin 2017-09-20 16:06:39 UTC

We worked past this with rebooting the affected nodes. Ben requested the following if we see it again:

12:04pm <bbennett> dgoodwin: mwoodson I was looking into that... sorry I missed all of this chat.  I'm trying to use my tier1/2 access to 
                 debug and missing a bunch of commands
12:05pm <bbennett> next time can we capture iptables and ovs rules and we can see what's broken

Comment 3 Devan Goodwin 2017-09-27 11:16:55 UTC

Created attachment 1331375 [details]
iptables rules

Comment 4 Devan Goodwin 2017-09-27 11:17:25 UTC

Created attachment 1331377 [details]
ovs dump

Comment 5 Devan Goodwin 2017-09-27 11:18:31 UTC

The problem has resurfaced.

21m        23m         4         hibernation-2-q14ph    Pod                     spec.containers{hibernation}   Warning   Failed                        kubelet, ip-172-31-53-92.ec2.internal   Failed to pull image "172.30.215.46:5000/openshift-infra/hibernation@sha256:a9cec6b1a65f76b6ce4fb8567b6787f40170be6308fec2fb069eefce37f31012": rpc error: code = 2 desc = Get https://172.30.215.46:5000/v2/: dial tcp 172.30.215.46:5000: getsockopt: no route to host

iptables and ovs dump are attached from the affected node.

Strangely some extra nodes (beyond the masters) seem to be scheduling disabled.

[root@free-int-master-5470f ~]# oc get nodes
NAME                            STATUS                     AGE       VERSION
ip-172-31-49-31.ec2.internal    Ready                      154d      v1.7.0+695f48a16f
ip-172-31-49-44.ec2.internal    Ready,SchedulingDisabled   176d      v1.7.0+695f48a16f
ip-172-31-50-177.ec2.internal   Ready,SchedulingDisabled   176d      v1.7.0+80709908fd
ip-172-31-53-92.ec2.internal    Ready                      176d      v1.7.0+695f48a16f
ip-172-31-56-130.ec2.internal   Ready,SchedulingDisabled   176d      v1.7.0+80709908fd
ip-172-31-56-218.ec2.internal   Ready                      176d      v1.7.0+695f48a16f
ip-172-31-59-87.ec2.internal    Ready                      176d      v1.7.0+695f48a16f
ip-172-31-60-182.ec2.internal   Ready,SchedulingDisabled   125d      v1.7.0+80709908fd
ip-172-31-61-50.ec2.internal    Ready                      112d      v1.7.0+695f48a16f
ip-172-31-62-45.ec2.internal    Ready,SchedulingDisabled   176d      v1.7.0+695f48a16f
[root@free-int-master-5470f ~]#

Comment 6 Ben Bennett 2017-10-18 19:18:54 UTC

Devan: Was this the system where the PLEGs were failing and taking the nodes offline?  If so, I suspect it is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1451902

Comment 7 Ben Bennett 2017-10-24 17:41:02 UTC


*** This bug has been marked as a duplicate of bug 1451902 ***

Comment 8 Red Hat Bugzilla 2023-09-14 04:08:14 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.