Bug 1928773

Summary: hostname lookup delays when master node down
Product: OpenShift Container Platform Reporter: Miciah Dashiel Butler Masters <mmasters>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: DNS QA Contact: jechen <jechen>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aarapov, abodhe, agabriel, amcdermo, aos-bugs, bjarolim, hongli, juqiao, mfuruta, mmasters, mzali, openshift-bugs-escalate, openshift-bugzilla-robot, rbolling, rh-container, rhowe, sgaikwad, skanakal
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1928304
: 1935297 (view as bug list) Environment:
Last Closed: 2021-03-16 23:22:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1928304    
Bug Blocks: 1935297    

Comment 1 Miciah Dashiel Butler Masters 2021-02-23 03:49:51 UTC
*** Bug 1930917 has been marked as a duplicate of this bug. ***

Comment 2 Miciah Dashiel Butler Masters 2021-02-25 22:20:32 UTC
Waiting for https://github.com/openshift/kubernetes/pull/581 to be merged so that I can rebase https://github.com/openshift/sdn/pull/261.

Comment 4 Miciah Dashiel Butler Masters 2021-03-03 21:00:24 UTC
Here's an update on where this backport stands.  The 4.6.z backport is waiting on two things: verification of the 4.7.z backport, and passing CI tests.  

The 4.7.z backport got delayed by a process issue unrelated to the fix itself.  CI tests are failing due to an issue with our CI infrastructure: one of the CI jobs verifies changes on GCP, and we are currently having general issues with GCE API rate limiting, again unrelated to the fix itself.  

I anticipate that the 4.7.z backport will be verified this week.  Then the 4.6.z backport can be verified next week, and shipped in the fast/stable channels approximately 2.5 weeks from now.  (There is a possibility that the 4.6.z backport will be available in the candidate-4.6 channel a little earlier.)

Comment 7 Ben Bennett 2021-03-04 14:57:41 UTC
*** Bug 1921797 has been marked as a duplicate of this bug. ***

Comment 10 jechen 2021-03-08 15:05:21 UTC
Verified in 4.6.0-0.nightly-2021-03-06-050044

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-03-06-050044   True        False         24m     Cluster version is 4.6.0-0.nightly-2021-03-06-050044

# in first terminal, created test pod, rsh into test pod and run infinite loop to measure DNS lookup time
$ oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/aosqe-pod-for-ping.json
pod/hello-pod created
$ oc rsh hello-pod
/ # while (true); do time getent hosts kubernetes.default.svc.cluster.local; sleep 1; done
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.02s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
<---- snip----->


# In second terminal. reboot one of the master node
$ oc get node
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-947k4ft-f76d1-9nnw9-master-0         Ready    master   57m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-master-1         Ready    master   57m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-master-2         Ready    master   57m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-worker-b-drbnn   Ready    worker   49m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-worker-c-fwrs8   Ready    worker   49m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-worker-d-6jpxz   Ready    worker   49m   v1.19.0+2f3101c

$ oc debug node/ci-ln-947k4ft-f76d1-9nnw9-master-1
Starting pod/ci-ln-947k4ft-f76d1-9nnw9-master-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.4
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# reboot


#monitor DNS lookup delay in first terminal for 3-5 minutes while master node is rebooting, make sure there is no delay
<---- snip----->

172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.01s
user	0m 0.00s
sys	0m 0.00s
<---- snip----->

[jechen@jechen ~]$  oc -n openshift-dns get ds/dns-default -oyaml

<---- snip----->
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 3           <--- verified fix with https://github.com/openshift/cluster-dns-operator/pull/236
          successThreshold: 1
          timeoutSeconds: 3          <--- verified fix with https://github.com/openshift/cluster-dns-operator/pull/236

<---- snip----->

Comment 11 jechen 2021-03-08 18:46:46 UTC
sorry, I marked wrong status, changed to verified

Comment 13 errata-xmlrpc 2021-03-16 23:22:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.21 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0753