Bug 1885761

Summary: DNS fails to resolve in some pods
Product: OpenShift Container Platform Reporter: Sai Sindhur Malleni <smalleni>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Raul Sevilla <rsevilla>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: aconstan, amcdermo, anbhat, anusaxen, aos-bugs, bbennett, dblack, fiezzi, jlema, jtaleric, jtanenba, mark.d.gray, mfisher, ricarril, rkhan, rsevilla, sgreene, trozet, zzhao
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-46
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1869104
: 1889461 (view as bug list) Environment:
Last Closed: 2021-02-24 15:23:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1839102, 1869104    
Bug Blocks: 1889461    
Attachments:
Description Flags
openflows for table48 and table49 none

Comment 1 Sai Sindhur Malleni 2020-10-06 20:51:19 UTC
Weirdly enough, the bug was not reproducible when we needed to debug it, but it's back in 4.6. I still have the environment where this happens and if we can get someone to look in the next 1-2 days it will be useful as I'm not sure how lucky we will get with a reproducer again.

Comment 2 Joe Talerico 2020-10-07 13:07:04 UTC
@Richardo - I am hitting this bug on OCP4.6 w/ 100 workers

Init Containers:
  git-clone:
    Container ID:  cri-o://46d1a798e4825eb76f54ae6b466253f6e9b3ac9b56837d226d51f9db8b3be6ee
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a27bd55c6ca56e8d2ef2ad694a5b7817eed55777d94c7cae6494ebeef5951d9
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a27bd55c6ca56e8d2ef2ad694a5b7817eed55777d94c7cae6494ebeef5951d9
    Port:          <none>
    Host Port:     <none>
    Args:
      openshift-git-clone
      --loglevel=0
    State:      Terminated
      Reason:   Error
      Message:  Cloning "https://github.com/openshift-scale/hello-openshift.git" ...
error: fatal: unable to access 'https://github.com/openshift-scale/hello-openshift.git/': Could not resolve host: github.com

Comment 5 Tim Rozet 2020-10-15 21:01:34 UTC
This a race condition where all of the OVS flows are not installed yet by the time the pod comes up and tries to issue dns request. We need to add a check for table 48 to ensure the pod has received flows for its egress port security.

Comment 7 Anurag saxena 2020-11-17 17:40:19 UTC
Verification leveraging steps mentioned in description looks good to me on Azure platform.
@zzhao Could you verify it on Baremetal Platform if you get a chance today? Thanks

Comment 8 zhaozhanqi 2020-11-18 12:49:22 UTC
@Anurag .sure. the cluster is still in deploying using 4.7 version, let me have a check later.  

from the bug description, it should not be platform issue.

Comment 10 zhaozhanqi 2020-11-19 02:46:09 UTC
Created attachment 1730746 [details]
openflows for table48 and table49

Comment 11 zhaozhanqi 2020-11-19 07:55:02 UTC
assign this bug, please correct me if I'm wrong. thanks

Comment 12 Anurag saxena 2020-11-20 19:09:03 UTC
@zzhao Can you try once more with like "args: [git clone https://github.com/openshift-scale/hello-openshift.git;sleep 20000000000]"

AFAIU OCP supports TCP, UDP, and SCTP. Ping utility might not be a good experiment to check testing at scale IMO. Correct me Tim, in case :)

Comment 14 Anurag saxena 2020-11-23 23:10:29 UTC
@zzhao I see. It seems good to me then. Checked on all non-metal platforms. I guess its okay to close this IMO. Re-open if seen again.

Comment 15 zhaozhanqi 2020-11-24 01:48:03 UTC
yes. agree. 

Re-test again on all platforms, this issue did not be reproduced. Move this bug to verified. feel free to re-open if seen again.

Comment 16 Sai Sindhur Malleni 2020-11-25 23:16:40 UTC
I am still seeing this on baremetal with the said fix. I have a 250 node cluster. Looks like the scale of the environment also contributes to  this. I don't think we should close this bug.

Comment 17 zhaozhanqi 2020-11-26 02:13:36 UTC
thanks the information. Sai.  reassign this issue

Comment 20 Sai Sindhur Malleni 2020-12-01 15:36:14 UTC
I understand these are separate issues. I still run into this issue even without running into https://bugzilla.redhat.com/show_bug.cgi?id=1901363

Comment 21 Tim Rozet 2020-12-14 22:55:07 UTC
Thanks to Raul and Sai for helping to reproduce. tl;dr this fix in this bug helps to reduce the occurrence of this issue, but it can still happen. When reproducing we can see that when OVN is under heavy load the first CNI request may fail (logs interlaced with ovnkube-node and ovn-controller):

Interlaced logs:
I1214 21:43:50.908845    3313 cni.go:147] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] dispatching pod network request &{ADD cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 13b55f929a4b5f25653ba5551fb7a5369c046edae440d81960a2d6dbc035ec69 /var/run/netns/4d381877-1f6f-448b-bfe4-4153eac4f2e5 eth0 0xc03976a400}

2020-12-14T21:44:09Z|04949|binding|INFO|Claiming lport cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687_cluster-density-2-1-build for this chassis.
2020-12-14T21:44:09Z|04950|binding|INFO|cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687_cluster-density-2-1-build: Claiming 0a:58:0a:96:00:f7 10.150.0.247

I1214 21:44:15.213359    3313 cni.go:157] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] CNI request &{ADD cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 13b55f929a4b5f25653ba5551fb7a5369c046edae440d81960a2d6dbc035ec69 /var/run/netns/4d381877-1f6f-448b-bfe4-4153eac4f2e5 eth0 0xc03976a400}, result "", err failed to configure pod interface: timed out waiting for pod flows for pod: cluster-density-2-1-build, error: timed out waiting for the condition

I1214 21:44:15.266808    3313 cni.go:157] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] CNI request &{DEL cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 13b55f929a4b5f25653ba5551fb7a5369c046edae440d81960a2d6dbc035ec69 /var/run/netns/4d381877-1f6f-448b-bfe4-4153eac4f2e5 eth0 0xc02beb7900}, result "", err <nil>
I1214 21:44:15.314062    3313 cniserver.go:147] Waiting for DEL result for pod cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build
I1214 21:44:15.314082    3313 cni.go:147] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] dispatching pod network request &{DEL cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 13b55f929a4b5f25653ba5551fb7a5369c046edae440d81960a2d6dbc035ec69 /var/run/netns/4d381877-1f6f-448b-bfe4-4153eac4f2e5 eth0 0xc04092ac00}
I1214 21:44:15.337861    3313 cni.go:157] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] CNI request &{DEL cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 13b55f929a4b5f25653ba5551fb7a5369c046edae440d81960a2d6dbc035ec69 /var/run/netns/4d381877-1f6f-448b-bfe4-4153eac4f2e5 eth0 0xc04092ac00}, result "", err <nil>
I1214 21:44:15.723166    3313 cniserver.go:147] Waiting for ADD result for pod cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build
I1214 21:44:15.723187    3313 cni.go:147] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] dispatching pod network request &{ADD cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 127056dc4da89cebce51b8106538b145cc174c40aee542f08356aba139abc000 /var/run/netns/12fe721d-c9a1-470c-8456-0382fbf2c8bc eth0 0xc02b04bc00}
I1214 21:44:20.391537    3313 cni.go:157] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] CNI request &{ADD cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 127056dc4da89cebce51b8106538b145cc174c40aee542f08356aba139abc000 /var/run/netns/12fe721d-c9a1-470c-8456-0382fbf2c8bc eth0 0xc02b04bc00}, result "{\"Result\":{\"interfaces\":[{\"name\":\"127056dc4da89ce\",\"mac\":\"72:cc:3a:e5:66:8a\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:96:00:f7\",\"sandbox\":\"/var/run/netns/12fe721d-c9a1-470c-8456-0382fbf2c8bc\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.150.0.247/22\",\"gateway\":\"10.150.0.1\"}],\"dns\":{}},\"PodIFInfo\":null}", err <nil>


We can see that:
1. First CNI request it takes ovn-controller 19 seconds to bind the port. We have a 20 second timeout for flows, so we hit that and consider the CNI ADD failed.
2. Then CNI deletes the port, and a new add happens.
3. The 2nd ADD succeeds and the pod is started, where it immediately tries to send a packet and fails.

It looks like either we are missing another flow to check for...or because of the delete of the old port, ovn-controller removes/reinstalls the flows for the pod. Either way this is an OVN bug and we cant keep hacking around it in ovn-kubernetes to try to determine when the flows are present for the pod. For now will leave this bug open and depend on 1839102 which will give us the ability from OVN to determine if the flows are present for a port.

Comment 22 Tim Rozet 2020-12-14 23:08:40 UTC
I should add that after 3 seconds all 50 of the test pods have all of their flows, with 107 nodes, 6081 services, 21830 pods.

Comment 23 Tim Rozet 2020-12-15 15:36:40 UTC
Dumitru seems to think that the CNI DEL ends up causing flows to be reprogrammed for the subsequent pod ADD. He had a really good idea to include checking for the OF port in the flows as well. I'm going to come up with a patch for that and then we can see if this issue is still reproducible. The real OVN fix wont happen for sometime, so we can use this as an intermediary fix.

Comment 24 Tim Rozet 2020-12-15 21:09:57 UTC
Thanks to Raul for testing out the fix he thinks it does seem to help. The problem may still happen at higher scale, but we will defer fixing that use case until there is a proper OVN fix in 1839102. For now this fix should work for around 250 nodes, 7k pods, 1k services or less.

Comment 31 errata-xmlrpc 2021-02-24 15:23:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633