Weirdly enough, the bug was not reproducible when we needed to debug it, but it's back in 4.6. I still have the environment where this happens and if we can get someone to look in the next 1-2 days it will be useful as I'm not sure how lucky we will get with a reproducer again.
@Richardo - I am hitting this bug on OCP4.6 w/ 100 workers Init Containers: git-clone: Container ID: cri-o://46d1a798e4825eb76f54ae6b466253f6e9b3ac9b56837d226d51f9db8b3be6ee Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a27bd55c6ca56e8d2ef2ad694a5b7817eed55777d94c7cae6494ebeef5951d9 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a27bd55c6ca56e8d2ef2ad694a5b7817eed55777d94c7cae6494ebeef5951d9 Port: <none> Host Port: <none> Args: openshift-git-clone --loglevel=0 State: Terminated Reason: Error Message: Cloning "https://github.com/openshift-scale/hello-openshift.git" ... error: fatal: unable to access 'https://github.com/openshift-scale/hello-openshift.git/': Could not resolve host: github.com
This a race condition where all of the OVS flows are not installed yet by the time the pod comes up and tries to issue dns request. We need to add a check for table 48 to ensure the pod has received flows for its egress port security.
Verification leveraging steps mentioned in description looks good to me on Azure platform. @zzhao Could you verify it on Baremetal Platform if you get a chance today? Thanks
@Anurag .sure. the cluster is still in deploying using 4.7 version, let me have a check later. from the bug description, it should not be platform issue.
Created attachment 1730746 [details] openflows for table48 and table49
assign this bug, please correct me if I'm wrong. thanks
@zzhao Can you try once more with like "args: [git clone https://github.com/openshift-scale/hello-openshift.git;sleep 20000000000]" AFAIU OCP supports TCP, UDP, and SCTP. Ping utility might not be a good experiment to check testing at scale IMO. Correct me Tim, in case :)
@zzhao I see. It seems good to me then. Checked on all non-metal platforms. I guess its okay to close this IMO. Re-open if seen again.
yes. agree. Re-test again on all platforms, this issue did not be reproduced. Move this bug to verified. feel free to re-open if seen again.
I am still seeing this on baremetal with the said fix. I have a 250 node cluster. Looks like the scale of the environment also contributes to this. I don't think we should close this bug.
thanks the information. Sai. reassign this issue
I understand these are separate issues. I still run into this issue even without running into https://bugzilla.redhat.com/show_bug.cgi?id=1901363
Thanks to Raul and Sai for helping to reproduce. tl;dr this fix in this bug helps to reduce the occurrence of this issue, but it can still happen. When reproducing we can see that when OVN is under heavy load the first CNI request may fail (logs interlaced with ovnkube-node and ovn-controller): Interlaced logs: I1214 21:43:50.908845 3313 cni.go:147] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] dispatching pod network request &{ADD cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 13b55f929a4b5f25653ba5551fb7a5369c046edae440d81960a2d6dbc035ec69 /var/run/netns/4d381877-1f6f-448b-bfe4-4153eac4f2e5 eth0 0xc03976a400} 2020-12-14T21:44:09Z|04949|binding|INFO|Claiming lport cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687_cluster-density-2-1-build for this chassis. 2020-12-14T21:44:09Z|04950|binding|INFO|cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687_cluster-density-2-1-build: Claiming 0a:58:0a:96:00:f7 10.150.0.247 I1214 21:44:15.213359 3313 cni.go:157] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] CNI request &{ADD cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 13b55f929a4b5f25653ba5551fb7a5369c046edae440d81960a2d6dbc035ec69 /var/run/netns/4d381877-1f6f-448b-bfe4-4153eac4f2e5 eth0 0xc03976a400}, result "", err failed to configure pod interface: timed out waiting for pod flows for pod: cluster-density-2-1-build, error: timed out waiting for the condition I1214 21:44:15.266808 3313 cni.go:157] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] CNI request &{DEL cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 13b55f929a4b5f25653ba5551fb7a5369c046edae440d81960a2d6dbc035ec69 /var/run/netns/4d381877-1f6f-448b-bfe4-4153eac4f2e5 eth0 0xc02beb7900}, result "", err <nil> I1214 21:44:15.314062 3313 cniserver.go:147] Waiting for DEL result for pod cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build I1214 21:44:15.314082 3313 cni.go:147] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] dispatching pod network request &{DEL cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 13b55f929a4b5f25653ba5551fb7a5369c046edae440d81960a2d6dbc035ec69 /var/run/netns/4d381877-1f6f-448b-bfe4-4153eac4f2e5 eth0 0xc04092ac00} I1214 21:44:15.337861 3313 cni.go:157] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] CNI request &{DEL cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 13b55f929a4b5f25653ba5551fb7a5369c046edae440d81960a2d6dbc035ec69 /var/run/netns/4d381877-1f6f-448b-bfe4-4153eac4f2e5 eth0 0xc04092ac00}, result "", err <nil> I1214 21:44:15.723166 3313 cniserver.go:147] Waiting for ADD result for pod cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build I1214 21:44:15.723187 3313 cni.go:147] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] dispatching pod network request &{ADD cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 127056dc4da89cebce51b8106538b145cc174c40aee542f08356aba139abc000 /var/run/netns/12fe721d-c9a1-470c-8456-0382fbf2c8bc eth0 0xc02b04bc00} I1214 21:44:20.391537 3313 cni.go:157] [cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687/cluster-density-2-1-build] CNI request &{ADD cluster-density-6f87d4c2-73d4-4adc-ad57-25af4e555ff0-1687 cluster-density-2-1-build 127056dc4da89cebce51b8106538b145cc174c40aee542f08356aba139abc000 /var/run/netns/12fe721d-c9a1-470c-8456-0382fbf2c8bc eth0 0xc02b04bc00}, result "{\"Result\":{\"interfaces\":[{\"name\":\"127056dc4da89ce\",\"mac\":\"72:cc:3a:e5:66:8a\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:96:00:f7\",\"sandbox\":\"/var/run/netns/12fe721d-c9a1-470c-8456-0382fbf2c8bc\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.150.0.247/22\",\"gateway\":\"10.150.0.1\"}],\"dns\":{}},\"PodIFInfo\":null}", err <nil> We can see that: 1. First CNI request it takes ovn-controller 19 seconds to bind the port. We have a 20 second timeout for flows, so we hit that and consider the CNI ADD failed. 2. Then CNI deletes the port, and a new add happens. 3. The 2nd ADD succeeds and the pod is started, where it immediately tries to send a packet and fails. It looks like either we are missing another flow to check for...or because of the delete of the old port, ovn-controller removes/reinstalls the flows for the pod. Either way this is an OVN bug and we cant keep hacking around it in ovn-kubernetes to try to determine when the flows are present for the pod. For now will leave this bug open and depend on 1839102 which will give us the ability from OVN to determine if the flows are present for a port.
I should add that after 3 seconds all 50 of the test pods have all of their flows, with 107 nodes, 6081 services, 21830 pods.
Dumitru seems to think that the CNI DEL ends up causing flows to be reprogrammed for the subsequent pod ADD. He had a really good idea to include checking for the OF port in the flows as well. I'm going to come up with a patch for that and then we can see if this issue is still reproducible. The real OVN fix wont happen for sometime, so we can use this as an intermediary fix.
Thanks to Raul for testing out the fix he thinks it does seem to help. The problem may still happen at higher scale, but we will defer fixing that use case until there is a proper OVN fix in 1839102. For now this fix should work for around 250 nodes, 7k pods, 1k services or less.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633