Description of problem: We are running a control plane stress test. It does the following 1. Create 100 namespaces (projects) 2. In each namespace, create certain OpenShift resources/objects such as pods, deployments, imagestreams, buildconfigs and builds. The test passes in general, if we do not create buildconfigs and builds.The test experiences intermittent failures in 3-4 projects out of the total 100 when creating buildconfig and build objects. This seems to be because the URL for the github repo to build is not resolved. This is on baremetal and with OVNKubernetes. In general we see 2 kinds of error messages in failed builds, relating to the build pod. ========================================================================= State: Terminated Reason: Error Message: Cloning "https://github.com/openshift-scale/hello-openshift.git" ... error: RPC failed; result=6, HTTP code = 0 fatal: The remote end hung up unexpectedly =========================================================================== State: Terminated Reason: Error Message: Cloning "https://github.com/openshift-scale/hello-openshift.git" ... error: fatal: unable to access 'https://github.com/openshift-scale/hello-openshift.git/': Could not resolve host: github.com; Unknown error =========================================================================== This happens overall several attempts and several days, so definitely not a github issue. Version-Release number of selected component (if applicable): 4.5.5 How reproducible: 100% - atleast one failure in 100 projects Steps to Reproduce: 1. Create several builds (2-3 per projects) in 100 projects 2. 3. Actual results: Some of the builds fails due to DNS issues Expected results: All builds must be successful Additional info: The template used for build: --- kind: Template apiVersion: template.openshift.io/v1 metadata: name: buildTemplate creationTimestamp: annotations: description: This template will create a single build. tags: '' objects: - apiVersion: v1 kind: Build metadata: name: build${IDENTIFIER}-1 spec: output: to: kind: ImageStreamTag name: build-imagestream-dest${IDENTIFIER}:latest resources: {} serviceAccount: builder source: dockerfile: |- FROM quay.io/openshift-scale/mastervertical-build USER example git: uri: https://github.com/openshift-scale/hello-openshift.git secrets: type: Git strategy: sourceStrategy: from: kind: DockerImage name: quay.io/openshift-scale/mastervertical-build:latest type: Source parameters: - name: IDENTIFIER description: Number to append to the name of resources value: '1' required: true labels: template: buildTemplate I also verified by launching a debug pod that /etc/resolv.conf in general points to the ClusterIP of the DNS service [kni@e16-h18-b03-fc640 ~]$ oc exec debug -- cat /etc/resolv.conf search mastervert099.svc.cluster.local svc.cluster.local cluster.local test448.myocp4.com nameserver 172.30.0.10 options ndots:5 [kni@e16-h18-b03-fc640 ~]$ oc get svc -A | grep 172.30.0.10 openshift-dns dns-default ClusterIP 172.30.0.10 <none> 53/UDP,53/TCP,9154/TCP 5d5h This is a baremetal deployment using OVNKubernetes and all the DNS pods on all nodes are running (nothing is in failed state)
Setup a simple reproducer on baremetal in OCP 4.5.5 I created a namespace called ping and ran this script to launch 500 pods in the same namespace, all trying to ping www.github.com i=0 NUM=$1 mkdir -p dns while [ $i -lt $NUM ]; do cat <<EOF > dns/pod-${i}.yaml apiVersion: v1 kind: Pod metadata: name: ping-${i} labels: app: ping spec: restartPolicy: Never containers: - name: debug image: quay.io/openshift-scale/mastervertical-build command: ["/bin/sh"] args: ["-c", "ping -c 1 www.github.com"] EOF oc create -f dns/pod-${i}.yaml i=$((i+1)) done Pods went into error due to DNS issue. [kni@e16-h18-b03-fc640 ~]$ oc get pods | grep -i erro ping-156 0/1 Error 0 9m23s ping-484 0/1 Error 0 8m54s [kni@e16-h18-b03-fc640 ~]$ oc logs ping-156 ping: www.github.com: Name or service not known [kni@e16-h18-b03-fc640 ~]$ oc logs ping-484 ping: www.github.com: Name or service not known
I'm able to reproduce this on AWS with OVNKubernetes, not just baremetal.
I've also added a cat /etc/resolv.conf command and see that it is updated correctly in successful as well as failed pods. [kni@e16-h18-b03-fc640 ~]$ #successful pod [kni@e16-h18-b03-fc640 ~]$ oc logs ping-99 + cat /etc/resolv.conf search ping.svc.cluster.local svc.cluster.local cluster.local test53.myocp4.com nameserver 172.30.0.10 options ndots:5 + ping -c 1 www.github.com PING github.com (140.82.113.3) 56(84) bytes of data. 64 bytes from lb-140-82-113-3-iad.github.com (140.82.113.3): icmp_seq=1 ttl=43 time=9.46 ms --- github.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 9.460/9.460/9.460/0.000 ms [kni@e16-h18-b03-fc640 ~]$ #failed pod [kni@e16-h18-b03-fc640 ~]$ oc logs ping-38 + cat /etc/resolv.conf search ping.svc.cluster.local svc.cluster.local cluster.local test53.myocp4.com nameserver 172.30.0.10 options ndots:5 + ping -c 1 www.github.com ping: www.github.com: Name or service not known
Another datapoint, cannot reproduce this with OpenShiftSDN on AWS, while it is reproducible with OVNKubernetes.
(In reply to Sai Sindhur Malleni from comment #3) > I'm able to reproduce this on AWS with OVNKubernetes, not just baremetal. How large was the AWS cluster used to reproduce this? Standard 3m/3w cluster, or a larger scale cluster? What exact 4.5.z version was used to check this BZ on AWS? Still 4.5.5? I understand this issue has been observed on 4.5.5. I am unable to reproduce this issue on a standard size 3 master 3 worker 4.5.6 AWS OVN cluster using the provided reproducer script.
Just tried to reproduce this issue on a 3 master 3 worker 4.5.5 AWS OVN cluster and was unsuccessful after running the pod ping test 1000 times. Would like to be able to reproduce this BZ, or alternatively have access to a cluster in which it is reproducible.
So the cluster on AWS is 4.5.5 with 3 master nodes and 25 worker nodes (OVNKubernetes). I have a reproducer, I'm going to contact you on slack now.
This looks like the node networking is at fault... has this reoccurred?
This reproduces consistently. 4.5.5 cluster on AWS nodes with 25 workers.
Steve: Can you take this and contact someone on the SDN team slack to help get to the root of this? The SDN team doesn't have the context needed to debug how coredns is using the network and failing.
Moving component to SDN as it looks to be assigned in comment #14.
Hi, just got this ticket assigned. Can I get details to connect to the cluster?
Hi, This was a short lived testing cluster. The cluster is not around now. I have provided environment requirements and scripts to reproduce on AWS, in the bug comment comment history. Let me know if you have trouble reproducing. Thanks/
I'm afraid I don't have means to get a baremetal cluster. If you could provision it and provide details I'd be happy to debug this further. Thanks
The problem is not specific to baremetal. Please refer to previous comments about being reproduced on AWS. I was able to reproduce it on AWS with an PVNKubernetes cluster with 25 worker nodes and the reproducer script referenced earlier.
Oh, apologies, I didn't see that. Will try to repro, thanks.
Moved to Urgent as Sai indicated he needs this fixed for him to get comfortable for sending status / numbers to VZ. He can give more details if needed. Thanks
Still unable to repro: [ricky@localhost openshift-installer]$ oc get pods | grep -i erro ping-283 0/1 CreateContainerError 0 4m32s ping-284 0/1 CreateContainerError 0 4m31s ping-286 0/1 CreateContainerError 0 4m30s ping-287 0/1 CreateContainerError 0 4m30s ping-289 0/1 CreateContainerError 0 4m29s
Eventually all pods are deployed, no errors: [ricky@localhost openshift-installer]$ oc get pods | grep -i erro [ricky@localhost openshift-installer]$
How many pods are you creating? and what version?
+ oc create -f dns/pod-496.yaml pod/ping-496 created + i=497 + '[' 497 -lt 500 ']' + cat + oc create -f dns/pod-497.yaml pod/ping-497 created + i=498 + '[' 498 -lt 500 ']' + cat + oc create -f dns/pod-498.yaml pod/ping-498 created + i=499 + '[' 499 -lt 500 ']' + cat + oc create -f dns/pod-499.yaml pod/ping-499 created + i=500 + '[' 500 -lt 500 ']' [ricky@localhost openshift-installer]$ oc get pods ^C [ricky@localhost openshift-installer]$ oc get pods | grep -i err Using latest 4.5 CI image. Will push for more.
Can't repro with 1000 either: pod/ping-996 created pod/ping-997 created pod/ping-998 created pod/ping-999 created [ricky@localhost openshift-installer]$ oc get po^C [ricky@localhost openshift-installer]$ oc get pods | grep -i erro ping-351 0/1 CreateContainerError 0 11m
It is likely that the problem got resolved with some of the other bug fixes that went into 4.5. I'm unclear as to when this was actually fixed, but trying with a 4.5.10 deployment on baremetal/AWS I do not see the issue. I think we can consider this bug fixed. To be clear, the problem definitely existed when I opened this bug about a month ago on 4.5.5, it seems to be not present on the more recent 4.5 releases.