Description of problem: SkyDNS not responding on parallel requests from applications inside pods Version-Release number of selected component (if applicable): 3.9 How reproducible: everytime Steps to Reproduce: 1. Create a pod with .net app. Make it connecting to another service within same project 2. 3. Actual results: At first app needs to resolve DNS of service. App queries DNS with svc.cluster.local by sending both A and AAAA requests in parallel. And it does not get reply on AAAA After some attempts app sends DNS requests sequentially and gets correct answers and is able to proceed with connection. Expected results: app gets dns resolved in first attempt Additional info: more info in next comment. Also pcap is attached.
The problem can be reproduced every time now Setup cluster through QE Jenkins tool: https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/ Same curl delay 5s issue can only be seen when set vm_type: m1.large, but not vm_type: m1.medium Same curl delay 5s issue can be seen in both v3.9.51 and v3.11.67 [root@qe-weliang-case4master-etcd-nfs-1 ~]# oc new-app https://github.com/OpenShiftDemos/os-sample-python.git [root@qe-weliang-case4master-etcd-nfs-1 ~]# oc rsh os-sample-python-1-hv9ms (app-root) sh-4.2$ export svc5=os-sample-python.p1.svc.cluster.local (app-root) sh-4.2$ OUTPUT=" %{time_namelookup} %{time_connect} %{time_appconnect} %{time_pretransfer} %{time_redirect} %{time_starttransfer} %{time_total}\n"; echo ""; echo " time namelookup connect appconnect pretransfer redirect starttransfer total"; echo "-----------------------------------------------------------------------------------------------------------------"; while true; do echo -n "$(date)"; curl -w "$OUTPUT" -o /dev/null -s $svc5:8080; sleep 2; echo -n "$(date)"; curl -w "$OUTPUT" -o /dev/null -s $svc5:8080; sleep 120; done time namelookup connect appconnect pretransfer redirect starttransfer total ----------------------------------------------------------------------------------------------------------------- Wed Jan 9 20:35:16 UTC 2019 0.125 0.126 0.000 0.126 0.000 0.130 0.130 Wed Jan 9 20:35:19 UTC 2019 0.012 0.013 0.000 0.013 0.000 0.016 0.016 Wed Jan 9 20:37:19 UTC 2019 5.515 5.515 0.000 5.516 0.000 5.518 5.519 Wed Jan 9 20:37:26 UTC 2019 0.012 0.013 0.000 0.013 0.000 0.016 0.016 Wed Jan 9 20:39:26 UTC 2019 5.514 5.515 0.000 5.515 0.000 5.521 5.521 Wed Jan 9 20:39:34 UTC 2019 0.012 0.013 0.000 0.013 0.000 0.016 0.016 Wed Jan 9 20:41:34 UTC 2019 5.513 5.514 0.000 5.514 0.000 5.519 5.519 Wed Jan 9 20:41:41 UTC 2019 0.012 0.013 0.000 0.013 0.000 0.014 0.014 Wed Jan 9 20:43:41 UTC 2019 5.515 5.515 0.000 5.516 0.000 5.518 5.518 Wed Jan 9 20:43:49 UTC 2019 0.012 0.013 0.000 0.013 0.000 0.014 0.014
Get more interesting results when running more testing: Same testing env setup: v3.9.51 Red Hat Enterprise Linux Server release 7.6 (Maipo) openvswitch-2.9.0-83.el7fdp.1.x86_64 openvswitch-selinux-extra-policy-1.0-8.el7fdp.noarch Case1: Cluster created in openstack with instance_type: m1.large Test result: fail Case2: Cluster created in openstack with instance_type: m1.medium Test result: pass Case3: Cluster created in AWS/EC2 with instance_type: m1.large Test result: pass Case4: Cluster created in AWS/EC2 with instance_type: m1.medium Test result: pass All above test results are consistent.
This bug is almost certainly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1661928, so I'm consolidating into the other bug (which predates this one). *** This bug has been marked as a duplicate of bug 1661928 ***