Bug 1869104 - DNS fails to resolve in some build pods
Summary: DNS fails to resolve in some build pods
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.7.0
Assignee: Ricardo Carrillo Cruz
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1885761
TreeView+ depends on / blocked
 
Reported: 2020-08-16 21:52 UTC by Sai Sindhur Malleni
Modified: 2020-11-24 16:31 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1885761 (view as bug list)
Environment:
Last Closed: 2020-09-21 10:16:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sai Sindhur Malleni 2020-08-16 21:52:37 UTC
Description of problem: 
We are running a control plane stress test. It does the following
1. Create 100 namespaces (projects)
2. In each namespace, create certain OpenShift resources/objects such as pods, deployments, imagestreams, buildconfigs and builds. 

The test passes in general, if we do not create buildconfigs and builds.The test experiences intermittent failures in 3-4 projects out of the total 100 when creating buildconfig and build objects. This seems to be because the URL for the github repo to build is not resolved. 

This is on baremetal and with OVNKubernetes.

In general we see 2 kinds of error messages in failed builds, relating to the build pod.

=========================================================================
    State:      Terminated                                          
      Reason:   Error                                                                                                                                                                                                                        
      Message:  Cloning "https://github.com/openshift-scale/hello-openshift.git" ...                                                                                                                                                          error: RPC failed; result=6, HTTP code = 0                                                                                                                                                                                                    fatal: The remote end hung up unexpectedly   
===========================================================================
    State:      Terminated                                                                                         
      Reason:   Error                                                         
      Message:  Cloning "https://github.com/openshift-scale/hello-openshift.git" ...                                                                                                                                                          
error: fatal: unable to access 'https://github.com/openshift-scale/hello-openshift.git/': Could not resolve host: github.com; Unknown error  
===========================================================================
This happens overall several attempts and several days, so definitely not a github issue.

Version-Release number of selected component (if applicable):
4.5.5

How reproducible:
100% - atleast one failure in 100 projects

Steps to Reproduce:
1. Create several builds (2-3 per projects) in 100 projects
2. 
3.

Actual results:
Some of the builds fails due to DNS issues

Expected results:
All builds must be successful

Additional info:

The template used for build:
  ---
    kind: Template
    apiVersion: template.openshift.io/v1
    metadata:
      name: buildTemplate
      creationTimestamp:
      annotations:
        description: This template will create a single build.
        tags: ''
    objects:
    - apiVersion: v1
      kind: Build
      metadata:
        name: build${IDENTIFIER}-1
      spec:
        output:
          to:
            kind: ImageStreamTag
            name: build-imagestream-dest${IDENTIFIER}:latest
        resources: {}
        serviceAccount: builder
        source:
          dockerfile: |-
            FROM quay.io/openshift-scale/mastervertical-build
            USER example
          git:
            uri: https://github.com/openshift-scale/hello-openshift.git
          secrets:
          type: Git
        strategy:
          sourceStrategy:
            from:
              kind: DockerImage
              name: quay.io/openshift-scale/mastervertical-build:latest
          type: Source
    parameters:
    - name: IDENTIFIER
      description: Number to append to the name of resources
      value: '1'
      required: true
    labels:
      template: buildTemplate

I also verified by launching a debug pod that /etc/resolv.conf in general points to the ClusterIP of the DNS service

[kni@e16-h18-b03-fc640 ~]$ oc exec debug -- cat /etc/resolv.conf                                                                                                                                                                             
search mastervert099.svc.cluster.local svc.cluster.local cluster.local test448.myocp4.com
nameserver 172.30.0.10
options ndots:5
[kni@e16-h18-b03-fc640 ~]$ oc get svc -A | grep 172.30.0.10
openshift-dns                                      dns-default                                ClusterIP      172.30.0.10      <none>                                 53/UDP,53/TCP,9154/TCP         5d5h  

This is a baremetal deployment using OVNKubernetes and all the DNS pods on all nodes are running (nothing is in failed state)

Comment 2 Sai Sindhur Malleni 2020-08-17 15:46:34 UTC
Setup a simple reproducer on baremetal in OCP 4.5.5

I created a namespace called ping and ran this script to launch 500 pods in the same namespace, all trying to ping www.github.com

i=0
NUM=$1
mkdir -p dns
while [ $i -lt $NUM ]; do
cat <<EOF > dns/pod-${i}.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ping-${i}
  labels:
    app: ping
spec:
  restartPolicy: Never
  containers:
  - name: debug
    image: quay.io/openshift-scale/mastervertical-build
    command: ["/bin/sh"]
    args: ["-c", "ping -c 1 www.github.com"]
EOF
oc create -f  dns/pod-${i}.yaml
i=$((i+1))
done

Pods went into error due to DNS issue.

[kni@e16-h18-b03-fc640 ~]$ oc get pods | grep -i erro
ping-156   0/1     Error               0          9m23s
ping-484   0/1     Error               0          8m54s
[kni@e16-h18-b03-fc640 ~]$ oc logs ping-156
ping: www.github.com: Name or service not known
[kni@e16-h18-b03-fc640 ~]$ oc logs ping-484
ping: www.github.com: Name or service not known

Comment 3 Sai Sindhur Malleni 2020-08-18 02:56:42 UTC
I'm able to reproduce this on AWS with OVNKubernetes, not just baremetal.

Comment 4 Sai Sindhur Malleni 2020-08-18 15:44:02 UTC
I've also added a cat /etc/resolv.conf command and see that it is updated correctly in successful as well as failed pods.

[kni@e16-h18-b03-fc640 ~]$ #successful pod
[kni@e16-h18-b03-fc640 ~]$ oc logs ping-99
+ cat /etc/resolv.conf
search ping.svc.cluster.local svc.cluster.local cluster.local test53.myocp4.com
nameserver 172.30.0.10
options ndots:5
+ ping -c 1 www.github.com
PING github.com (140.82.113.3) 56(84) bytes of data.
64 bytes from lb-140-82-113-3-iad.github.com (140.82.113.3): icmp_seq=1 ttl=43 time=9.46 ms

--- github.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 9.460/9.460/9.460/0.000 ms
[kni@e16-h18-b03-fc640 ~]$ #failed pod
[kni@e16-h18-b03-fc640 ~]$ oc logs ping-38
+ cat /etc/resolv.conf
search ping.svc.cluster.local svc.cluster.local cluster.local test53.myocp4.com
nameserver 172.30.0.10
options ndots:5
+ ping -c 1 www.github.com
ping: www.github.com: Name or service not known

Comment 5 Sai Sindhur Malleni 2020-08-18 15:53:25 UTC
Another datapoint, cannot reproduce this with OpenShiftSDN on AWS, while it is reproducible with OVNKubernetes.

Comment 6 Stephen Greene 2020-08-18 18:38:26 UTC
(In reply to Sai Sindhur Malleni from comment #3)
> I'm able to reproduce this on AWS with OVNKubernetes, not just baremetal.

How large was the AWS cluster used to reproduce this? Standard 3m/3w cluster, or a larger scale cluster? What exact 4.5.z version was used to check this BZ on AWS? Still 4.5.5?

I understand this issue has been observed on 4.5.5. I am unable to reproduce this issue on a standard size 3 master 3 worker 4.5.6 AWS OVN cluster using the provided reproducer script.

Comment 7 Stephen Greene 2020-08-18 20:13:21 UTC
Just tried to reproduce this issue on a 3 master 3 worker 4.5.5 AWS OVN cluster and was unsuccessful after running the pod ping test 1000 times. Would like to be able to reproduce this BZ, or alternatively have access to a cluster in which it is reproducible.

Comment 8 Sai Sindhur Malleni 2020-08-19 15:03:14 UTC
So the cluster on AWS is 4.5.5 with 3 master nodes and 25 worker nodes (OVNKubernetes). I have a reproducer, I'm going to contact you on slack now.

Comment 12 Ben Bennett 2020-08-20 16:26:38 UTC
This looks like the node networking is at fault... has this reoccurred?

Comment 13 Sai Sindhur Malleni 2020-08-21 18:00:32 UTC
This reproduces consistently. 
4.5.5 cluster on AWS nodes with 25 workers.

Comment 14 Ben Bennett 2020-08-24 14:22:01 UTC
Steve: Can you take this and contact someone on the SDN team slack to help get to the root of this?  The SDN team doesn't have the context needed to debug how coredns is using the network and failing.

Comment 16 Andrew McDermott 2020-09-02 16:12:22 UTC
Moving component to SDN as it looks to be assigned in comment #14.

Comment 17 Ricardo Carrillo Cruz 2020-09-08 11:13:57 UTC
Hi, just got this ticket assigned.

Can I get details to connect to the cluster?

Comment 18 Sai Sindhur Malleni 2020-09-08 15:13:26 UTC
Hi, This was a short lived testing cluster. The cluster is not around now. I have provided environment requirements and scripts to reproduce on AWS, in the bug comment comment history. Let me know if you have trouble reproducing. Thanks/

Comment 19 Ricardo Carrillo Cruz 2020-09-09 10:34:34 UTC
I'm afraid I don't have means to get a baremetal cluster.
If you could provision it and provide details I'd be happy to debug this further.

Thanks

Comment 20 Sai Sindhur Malleni 2020-09-09 16:37:10 UTC
The problem is not specific to baremetal. Please refer to previous comments about being reproduced on AWS.

I was able to reproduce it on AWS with an PVNKubernetes cluster with 25 worker nodes and the reproducer script referenced earlier.

Comment 21 Ricardo Carrillo Cruz 2020-09-11 13:07:01 UTC
Oh, apologies, I didn't see that.
Will try to repro, thanks.

Comment 22 Rashid Khan 2020-09-14 20:33:00 UTC
Moved to Urgent as Sai indicated he needs this fixed for him to get comfortable for sending status / numbers to VZ. 
He can give more details if needed. 

Thanks

Comment 23 Ricardo Carrillo Cruz 2020-09-16 15:08:11 UTC
Still unable to repro:

[ricky@localhost openshift-installer]$ oc get pods | grep -i erro
ping-283   0/1     CreateContainerError   0          4m32s
ping-284   0/1     CreateContainerError   0          4m31s
ping-286   0/1     CreateContainerError   0          4m30s
ping-287   0/1     CreateContainerError   0          4m30s
ping-289   0/1     CreateContainerError   0          4m29s

Comment 24 Ricardo Carrillo Cruz 2020-09-16 15:11:39 UTC
Eventually all pods are deployed, no errors:

[ricky@localhost openshift-installer]$ oc get pods | grep -i erro
[ricky@localhost openshift-installer]$

Comment 25 Sai Sindhur Malleni 2020-09-16 15:22:41 UTC
How many pods are you creating? and what version?

Comment 26 Ricardo Carrillo Cruz 2020-09-16 16:11:31 UTC
+ oc create -f dns/pod-496.yaml
pod/ping-496 created
+ i=497
+ '[' 497 -lt 500 ']'
+ cat
+ oc create -f dns/pod-497.yaml
pod/ping-497 created
+ i=498
+ '[' 498 -lt 500 ']'
+ cat
+ oc create -f dns/pod-498.yaml
pod/ping-498 created
+ i=499
+ '[' 499 -lt 500 ']'
+ cat
+ oc create -f dns/pod-499.yaml
pod/ping-499 created
+ i=500
+ '[' 500 -lt 500 ']'
[ricky@localhost openshift-installer]$ oc get pods ^C
[ricky@localhost openshift-installer]$ oc get pods | grep -i err


Using latest 4.5 CI image.

Will push for more.

Comment 27 Ricardo Carrillo Cruz 2020-09-16 17:17:22 UTC
Can't repro with 1000 either:

pod/ping-996 created
pod/ping-997 created
pod/ping-998 created
pod/ping-999 created
[ricky@localhost openshift-installer]$ oc get po^C
[ricky@localhost openshift-installer]$ oc get pods | grep -i erro
ping-351   0/1     CreateContainerError   0          11m

Comment 28 Sai Sindhur Malleni 2020-09-18 16:28:02 UTC
It is likely that the problem got resolved with some of the other bug fixes that went into 4.5. I'm unclear as to when this was actually fixed, but trying with a 4.5.10 deployment on baremetal/AWS I do not see the issue. I think we can consider this bug fixed. To be clear, the problem definitely existed when I opened this bug about a month ago on 4.5.5, it seems to be not present on the more recent 4.5 releases.


Note You need to log in before you can comment on or make changes to this bug.